News
Hugging Face Datasets Sprint 2020
This December, we had our largest community event ever: the Hugging Face Datasets Sprint 2020.
It all started as an internal project gathering about 15 employees to spend a week working together to add datasets to the Hugging Face Datasets Hub backing the ๐ค datasets library.
The library provides 2 main features surrounding datasets:
- One-line dataloaders for many public datasets: with a simple command like
squad_dataset = load_datasets("squad")
, you can download and pre-process any of the nearly 600 (and counting!) public datasets provided on the Hugging Face Datasets Hub in an efficient dataframe class - Efficient data pre-processing: simple, fast, and reproducible data pre-processing for the hub's datasets and any local dataset (CSV/JSON/textโฆ). With simple commands like
tokenized_dataset = dataset.map(lambda x: tokenizer(x))
, you can efficiently tokenize an entire dataset.
During this event, our goal was to reach 500 NLP datasets in the library. After internal discussion, we decided to open it up to the community, expecting a few additional people to come work with us and enjoy a bit of the โHugging Face team spirit.โ
We were completely overwhelmed & surprised by our community's response: over 275 participants joined the open-source community effort, and we had close to 300 people on our dedicated Slack channel.
The contributions were numerous, and we are still digesting the flow of opened PRs, but here are some numbers and take-aways we have to share:
- The datasets hub is now the largest open-source NLP datasets hub and will soon pass 600 datasets, covering a significant portion of the NLP dataset world. You can explore the datasets here.
- The community was amazing in improving the coverage in so many different languages. The datasets in the library now cover 467 languages and dialects.
- Several sub-communities emerged during the sprint leading to some very significant support for languages like Spanish, Arabic, Turisk, Portuguese, Polish, Thai, Bulgarian, Indic, and several African languages, just to name a few.
- Each dataset is provided with an editable โDataset cardโ (see an example here), which describes the content of the datasets and welcomes information about the curation process leading to the creation of the dataset.
The event was an inspiring moment for both our team and the community, and we will definitely organize more community events in the future!
Hugging Face Awarded Best Demo Paper at EMNLP
We were honored to be awarded the Best Demo Paper for "Transformers: State-of-the-Art Natural Language Processing" at EMNLP, one of the top academic NLP conferences ๐ฅฐ๐
Thank you to our wonderful team members and the fantastic community of contributors who make the library possible! ๐ค๐ค๐ค
Some of our team also shared some paper recommendations in our forum from the conference โ feel free to respond with any paper suggestions or comments of your own!
๐ฅ Introducing private models for community and organizations ๐ฅ
We'll share a deep dive soon, in the meantime here's the TL;DR:
๐ฉโ๐ฌCommunity ๐๐ป Our new Supporter plan offers private models hosting, early beta access, and helps support our open source efforts!
๐ Organizations ๐๐ป Our new Lab and Startup inference plans accelerate organizations' NLP roadmap with a complete solution to serve private models from research to production.
See the pricing page on our website for more information on how you can access these features and support our mission of solving NLP ๐ค
Hugging Face + Qualcomm Partnership
We are excited to announce our partnership with Qualcomm to put cutting-edge language technology right into the palms of hundreds of millions of individuals around the world. We look forward to creating a future where anyone can communicate with any person or business around the world in their own words and in their own language. ๐๐๐
Watch our CEO Clรฉment Delangue discuss with Qualcomm CEO Cristiano Amon how Snapdragon 5G mobile platforms and Hugging Face will enable smartphone users to communicate faster and better โ in any language.
Thanks to the whole Qualcomm team for your partnership and for inviting us to the Tech Summit virtual stage! ๐๐ฅ
๐ค Transformers v4.0.0 is now out!
New features include:
๐ Fast Tokenizers by default
๐ Self-documented outputs by default
๐งฉ SentencePiece becomes an optional dependency
โ๏ธ Model templates
5๏ธโฃ T5 and mT5
JAX/Flax + Transformers Integration
Any JAX/Flax lovers out there? Ever wanted to use ๐ค Transformers with all the awesome features of JAX? Well, you're in luck! ๐
We've worked with the Google Flax team to enable support for BERT and RoBERTa! ๐
Cloud TPU VMs alpha was just announced at NeurIPS 2020 virtual conference and is already integrated into ๐ค transformers! ๐
Check out our new Flax pretraining example leveraging Flax/Jax & Cloud TPU along with ๐ค full stack (datasets, tokenizers & transformers) ๐ฆพ
mT5 model, including all pre-trained models, is now included in ๐ค Transformers
The powerful mT5 model, a multilingual variant of T5, including all pre-trained models is now part of Transformers.
mT5 was trained on 101 languages and yields SOTA on many multilingual tasks.
๐ Official paper ๐ Official Results
๐จNew pre-print on avoiding dataset biases
Led by Research Scientist Victor Sanh, we show a method to train a model to ignore dataset biases without explicitly identifying/modeling them by learning from the errors of a โdumbโ model.
Special shout to collaborators: Thomas Wolf, Yonatan Belinkov, and Sasha Rush.
Our work builds on top of previous works (1, 2, 3) which explicitly construct a biased model (e.g. a hypothesis-only model for NLI) and use it to improve the robustness of the main model via product-of-experts.
The assumption of knowledge of the underlying dataset bias is quite restrictive: finding biases in datasets can be costly and time-consuming.
We show that we don't need such an explicit formulation of the dataset biases.
More results & analysis in the paper and see Victor's full tweet here!
๐Model Parallelism is now part of ๐ค Transformers!
We've released a utility to distribute a model over several GPUs, enabling certain large models such as GPT-2 XL to be trained on multiple GPUs.
So far it's been implemented for GPT-2 and T5. Thank you for the very clean contribution Alexander Orona. Try it now by installing from master!
๐ค Transformers are starting to work with structured databases!
We just released ๐ค Transformers v4.1.1 with TAPAS, a multi-modal model for question answering on tabular data from Google AI.
Try it out through transformers or our live inference API widget
TAPAS was built by Jonathan Herzig, Paweล Nowak, Thomas Mรผller, Francesco Piccinno, and Julian Eisenschlos.
The model is released alongside a TableQuestionAnsweringPipeline
, available in v4.1.1
โจ Other highlights of this release are: - MPNet model - Model parallelization - Sharded DDP using Fairscale - Conda release - Examples & research projects
๐ฆ Beam search doesn't have to be boring! ๐ฆ
Now you can bring more variety into your beam search with Diverse Beam Search
Huge thanks to Ayush Jain for the PR!
๐ Other Model Hub Highlights ๐
MPNet (NeurIPS 2020)
MPNet, with its pretrained weights, is available in ๐ค transformers. Combining the advantages of masked language modeling and permuted language modeling, the researchers from Microsoft created MPNetโa new pretrained language model. Now you can directly load it from the ๐ค model hub.
๐Doc: https://huggingface.co/transformers/model_doc/mpnet.html
๐Paper: https://proceedings.neurips.cc/paper/2020/file/c3a690be93aa602ee2dc0ccab5b7b67e-Paper.pdf
Efficient Models: Long-range Transformers ๐
Long-range Transformers use improved attention mechanisms to acquire longer-range modeling while reducing memory use. These models are also available in our model hub!
๐Longformer (Al2), Reformer (Google), Funnel Transformer (Google/CMU)
๐ Here is a good survey on this topic: Efficient Transformers: A Survey
Community
๐ฅTop Contributors ๐ฅ
This issue, we're highlighting the 275+ participants who joined us in our Datasets sprint! Check out the list of GitHub contributors here.
Want to be featured in a future newsletter? A great way to contribute is to check out these good first issues!
Tutorials
Model Deletion
Uploaded a model in the wrong organization? Want to start again from scratch? You can now delete a model directly from the ๐ค Hub, in 2 clicks!
โ ๏ธ Beware, any deletion is a permanent action; there is no going back! ๐
Fine-tuning a model on a token classification task
Fine-tuning a transformers model on a token classification task made easy with a new tutorial leveraging the Trainer API and the ๐ค Datasets and Tokenizers library!
Question Answering Tutorial
Having trouble with question answering tasks? We have a tutorial for you too!
Pre-processing is usually super tedious with these tasks, but it's made simple here thanks to the amazing features of the ๐ค Tokenizers library (mapping from tokens to character positions, automatically splitting long contexts into multiple segments, etc.)
Automatic Text Classification
Take a look at one of our most popular features in action in this tutorial lead by our CEO, Clement. Automatic text classification is now easier than ever for software engineers, thanks to our inference API.
Let us know if you're interested in learning more and our CEO, Clรฉment, and Product Director, Jeff Boudier, would be happy to brainstorm with you how you could use it in your products or workflows.
Events & Talks
Hugging Face at NeurIPS
Our researchers Victor Sanh, Thomas Wolf and Alexander Rush were at NeurIPS 2020 to present their work on extreme sparsity and interact with the broader ML community. They argue that in the context of transfer learning, one should use pruning methods that consider the changes of weights during fine-tuning.
๐Paper: https://arxiv.org/abs/2005.07683
๐Code: https://huggingface.co/mvp
NLP-OSS: An Introduction to Transfer Learning in NLP and HuggingFace
In this talk, Hugging Face CSO, Thomas Wolf introduces the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures.
The second part of the talk is dedicated to an introduction of the open-source tools released by HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.
Overcoming the challenges of computational linguistics | Thomas Wolf
Thomas Wolf, the Co-Founder & Chief Scientist from Hugging Face, joins Engati CX to discuss the challenges of computational linguistics and how to overcome them.