This December, we had our largest community event ever: the Hugging Face Datasets Sprint 2020.
It all started as an internal project gathering about 15 employees to spend a week working together to add datasets to the Hugging Face Datasets Hub backing the 🤗 datasets library.
The library provides 2 main features surrounding datasets:
- One-line dataloaders for many public datasets: with a simple command like
squad_dataset = load_datasets("squad"), you can download and pre-process any of the nearly 600 (and counting!) public datasets provided on the Hugging Face Datasets Hub in an efficient dataframe class
- Efficient data pre-processing: simple, fast, and reproducible data pre-processing for the hub's datasets and any local dataset (CSV/JSON/text…). With simple commands like
tokenized_dataset = dataset.map(lambda x: tokenizer(x)), you can efficiently tokenize an entire dataset.
During this event, our goal was to reach 500 NLP datasets in the library. After internal discussion, we decided to open it up to the community, expecting a few additional people to come work with us and enjoy a bit of the “Hugging Face team spirit.”
We were completely overwhelmed & surprised by our community's response: over 275 participants joined the open-source community effort, and we had close to 300 people on our dedicated Slack channel.
The contributions were numerous, and we are still digesting the flow of opened PRs, but here are some numbers and take-aways we have to share:
- The datasets hub is now the largest open-source NLP datasets hub and will soon pass 600 datasets, covering a significant portion of the NLP dataset world. You can explore the datasets here.
- The community was amazing in improving the coverage in so many different languages. The datasets in the library now cover 467 languages and dialects.
- Several sub-communities emerged during the sprint leading to some very significant support for languages like Spanish, Arabic, Turisk, Portuguese, Polish, Thai, Bulgarian, Indic, and several African languages, just to name a few.
- Each dataset is provided with an editable “Dataset card” (see an example here), which describes the content of the datasets and welcomes information about the curation process leading to the creation of the dataset.
The event was an inspiring moment for both our team and the community, and we will definitely organize more community events in the future!
We'll share a deep dive soon, in the meantime here's the TL;DR:
👩🔬Community 👉🏻 Our new Supporter plan offers private models hosting, early beta access, and helps support our open source efforts!
🏎 Organizations 👉🏻 Our new Lab and Startup inference plans accelerate organizations' NLP roadmap with a complete solution to serve private models from research to production.
See the pricing page on our website for more information on how you can access these features and support our mission of solving NLP 🤗
We are excited to announce our partnership with Qualcomm to put cutting-edge language technology right into the palms of hundreds of millions of individuals around the world. We look forward to creating a future where anyone can communicate with any person or business around the world in their own words and in their own language. 🌍🌎🌏
Watch our CEO Clément Delangue discuss with Qualcomm CEO Cristiano Amon how Snapdragon 5G mobile platforms and Hugging Face will enable smartphone users to communicate faster and better — in any language.
Thanks to the whole Qualcomm team for your partnership and for inviting us to the Tech Summit virtual stage! 🙏🔥
New features include:
🚀 Fast Tokenizers by default
📖 Self-documented outputs by default
🧩 SentencePiece becomes an optional dependency
⚙️ Model templates
5️⃣ T5 and mT5
Any JAX/Flax lovers out there? Ever wanted to use 🤗 Transformers with all the awesome features of JAX? Well, you're in luck! 😍
We've worked with the Google Flax team to enable support for BERT and RoBERTa! 🚀
Cloud TPU VMs alpha was just announced at NeurIPS 2020 virtual conference and is already integrated into 🤗 transformers! 🚀
Check out our new Flax pretraining example leveraging Flax/Jax & Cloud TPU along with 🤗 full stack (datasets, tokenizers & transformers) 🦾
The powerful mT5 model, a multilingual variant of T5, including all pre-trained models is now part of Transformers.
mT5 was trained on 101 languages and yields SOTA on many multilingual tasks.
📝 Official paper
📊 Official Results
Led by Research Scientist Victor Sanh, we show a method to train a model to ignore dataset biases without explicitly identifying/modeling them by learning from the errors of a “dumb” model.
Special shout to collaborators: Thomas Wolf, Yonatan Belinkov, and Sasha Rush.
Our work builds on top of previous works (1, 2, 3) which explicitly construct a biased model (e.g. a hypothesis-only model for NLI) and use it to improve the robustness of the main model via product-of-experts.
The assumption of knowledge of the underlying dataset bias is quite restrictive: finding biases in datasets can be costly and time-consuming.
We show that we don't need such an explicit formulation of the dataset biases.
More results & analysis in the paper and see Victor's full tweet here!
We've released a utility to distribute a model over several GPUs, enabling certain large models such as GPT-2 XL to be trained on multiple GPUs.
So far it's been implemented for GPT-2 and T5. Thank you for the very clean contribution Alexander Orona. Try it now by installing from master!
We just released 🤗 Transformers v4.1.1 with TAPAS, a multi-modal model for question answering on tabular data from Google AI.
Try it out through transformers or our live inference API widget
TAPAS was built by Jonathan Herzig, Paweł Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos.
The model is released alongside a
TableQuestionAnsweringPipeline, available in v4.1.1
✨ Other highlights of this release are:
- MPNet model
- Model parallelization
- Sharded DDP using Fairscale
- Conda release
- Examples & research projects