Issue 5 December 21st 2020

News

Hugging Face Datasets Sprint 2020

This December, we had our largest community event ever: the Hugging Face Datasets Sprint 2020.

It all started as an internal project gathering about 15 employees to spend a week working together to add datasets to the Hugging Face Datasets Hub backing the 🤗 datasets library.

The library provides 2 main features surrounding datasets:

One-line dataloaders for many public datasets: with a simple command like squad_dataset = load_datasets("squad"), you can download and pre-process any of the nearly 600 (and counting!) public datasets provided on the Hugging Face Datasets Hub in an efficient dataframe class
Efficient data pre-processing: simple, fast, and reproducible data pre-processing for the hub's datasets and any local dataset (CSV/JSON/text…). With simple commands like tokenized_dataset = dataset.map(lambda x: tokenizer(x)), you can efficiently tokenize an entire dataset.

During this event, our goal was to reach 500 NLP datasets in the library. After internal discussion, we decided to open it up to the community, expecting a few additional people to come work with us and enjoy a bit of the “Hugging Face team spirit.”

We were completely overwhelmed & surprised by our community's response: over 275 participants joined the open-source community effort, and we had close to 300 people on our dedicated Slack channel.

The contributions were numerous, and we are still digesting the flow of opened PRs, but here are some numbers and take-aways we have to share:

The datasets hub is now the largest open-source NLP datasets hub and will soon pass 600 datasets, covering a significant portion of the NLP dataset world. You can explore the datasets here.
The community was amazing in improving the coverage in so many different languages. The datasets in the library now cover 467 languages and dialects.
Several sub-communities emerged during the sprint leading to some very significant support for languages like Spanish, Arabic, Turisk, Portuguese, Polish, Thai, Bulgarian, Indic, and several African languages, just to name a few.
Each dataset is provided with an editable “Dataset card” (see an example here), which describes the content of the datasets and welcomes information about the curation process leading to the creation of the dataset.

The event was an inspiring moment for both our team and the community, and we will definitely organize more community events in the future!

Hugging Face Awarded Best Demo Paper at EMNLP

We were honored to be awarded the Best Demo Paper for "Transformers: State-of-the-Art Natural Language Processing" at EMNLP, one of the top academic NLP conferences 🥰😍

Thank you to our wonderful team members and the fantastic community of contributors who make the library possible! 🤗🤗🤗

Some of our team also shared some paper recommendations in our forum from the conference – feel free to respond with any paper suggestions or comments of your own!

🔥 Introducing private models for community and organizations 🔥

We'll share a deep dive soon, in the meantime here's the TL;DR:

👩‍🔬Community 👉🏻 Our new Supporter plan offers private models hosting, early beta access, and helps support our open source efforts!

🏎 Organizations 👉🏻 Our new Lab and Startup inference plans accelerate organizations' NLP roadmap with a complete solution to serve private models from research to production.

See the pricing page on our website for more information on how you can access these features and support our mission of solving NLP 🤗

Hugging Face + Qualcomm Partnership

We are excited to announce our partnership with Qualcomm to put cutting-edge language technology right into the palms of hundreds of millions of individuals around the world. We look forward to creating a future where anyone can communicate with any person or business around the world in their own words and in their own language. 🌍🌎🌏

Watch our CEO Clément Delangue discuss with Qualcomm CEO Cristiano Amon how Snapdragon 5G mobile platforms and Hugging Face will enable smartphone users to communicate faster and better — in any language.

Thanks to the whole Qualcomm team for your partnership and for inviting us to the Tech Summit virtual stage! 🙏🔥

🤗 Transformers v4.0.0 is now out!

New features include:
🚀 Fast Tokenizers by default
📖 Self-documented outputs by default
🧩 SentencePiece becomes an optional dependency
⚙️ Model templates
5️⃣ T5 and mT5

JAX/Flax + Transformers Integration

Any JAX/Flax lovers out there? Ever wanted to use 🤗 Transformers with all the awesome features of JAX? Well, you're in luck! 😍

We've worked with the Google Flax team to enable support for BERT and RoBERTa! 🚀

Cloud TPU VMs alpha was just announced at NeurIPS 2020 virtual conference and is already integrated into 🤗 transformers! 🚀

Check out our new Flax pretraining example leveraging Flax/Jax & Cloud TPU along with 🤗 full stack (datasets, tokenizers & transformers) 🦾

mT5 model, including all pre-trained models, is now included in 🤗 Transformers

The powerful mT5 model, a multilingual variant of T5, including all pre-trained models is now part of Transformers.

mT5 was trained on 101 languages and yields SOTA on many multilingual tasks.

📝 Official paper 📊 Official Results

🚨New pre-print on avoiding dataset biases

Led by Research Scientist Victor Sanh, we show a method to train a model to ignore dataset biases without explicitly identifying/modeling them by learning from the errors of a “dumb” model.

Special shout to collaborators: Thomas Wolf, Yonatan Belinkov, and Sasha Rush.

Our work builds on top of previous works (1, 2, 3) which explicitly construct a biased model (e.g. a hypothesis-only model for NLI) and use it to improve the robustness of the main model via product-of-experts.

The assumption of knowledge of the underlying dataset bias is quite restrictive: finding biases in datasets can be costly and time-consuming.

We show that we don't need such an explicit formulation of the dataset biases.

More results & analysis in the paper and see Victor's full tweet here!

🌟Model Parallelism is now part of 🤗 Transformers!

We've released a utility to distribute a model over several GPUs, enabling certain large models such as GPT-2 XL to be trained on multiple GPUs.

So far it's been implemented for GPT-2 and T5. Thank you for the very clean contribution Alexander Orona. Try it now by installing from master!

🤗 Transformers are starting to work with structured databases!

We just released 🤗 Transformers v4.1.1 with TAPAS, a multi-modal model for question answering on tabular data from Google AI.

Try it out through transformers or our live inference API widget

TAPAS was built by Jonathan Herzig, Paweł Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos.

The model is released alongside a TableQuestionAnsweringPipeline, available in v4.1.1

✨ Other highlights of this release are: - MPNet model - Model parallelization - Sharded DDP using Fairscale - Conda release - Examples & research projects

🔦 Beam search doesn't have to be boring! 🔦

Now you can bring more variety into your beam search with Diverse Beam Search

Huge thanks to Ayush Jain for the PR!

🚀 Other Model Hub Highlights 🚀

MPNet (NeurIPS 2020)

MPNet, with its pretrained weights, is available in 🤗 transformers. Combining the advantages of masked language modeling and permuted language modeling, the researchers from Microsoft created MPNet—a new pretrained language model. Now you can directly load it from the 🤗 model hub.

👉Doc: https://huggingface.co/transformers/model_doc/mpnet.html
👉Paper: https://proceedings.neurips.cc/paper/2020/file/c3a690be93aa602ee2dc0ccab5b7b67e-Paper.pdf

Efficient Models: Long-range Transformers 🚀

Long-range Transformers use improved attention mechanisms to acquire longer-range modeling while reducing memory use. These models are also available in our model hub!

👉Longformer (Al2), Reformer (Google), Funnel Transformer (Google/CMU)
📄 Here is a good survey on this topic: Efficient Transformers: A Survey

Community

🔥 Hot Hugging Face Forum Topics 🔥

🔥Top Contributors 🔥

This issue, we're highlighting the 275+ participants who joined us in our Datasets sprint! Check out the list of GitHub contributors here.

Want to be featured in a future newsletter? A great way to contribute is to check out these good first issues!

Tutorials

Model Deletion

Uploaded a model in the wrong organization? Want to start again from scratch? You can now delete a model directly from the 🤗 Hub, in 2 clicks!

⚠️ Beware, any deletion is a permanent action; there is no going back! 🙈

Fine-tuning a model on a token classification task

Fine-tuning a transformers model on a token classification task made easy with a new tutorial leveraging the Trainer API and the 🤗 Datasets and Tokenizers library!

Question Answering Tutorial

Having trouble with question answering tasks? We have a tutorial for you too!

Pre-processing is usually super tedious with these tasks, but it's made simple here thanks to the amazing features of the 🤗 Tokenizers library (mapping from tokens to character positions, automatically splitting long contexts into multiple segments, etc.)

Automatic Text Classification

Take a look at one of our most popular features in action in this tutorial lead by our CEO, Clement. Automatic text classification is now easier than ever for software engineers, thanks to our inference API.

Let us know if you're interested in learning more and our CEO, Clément, and Product Director, Jeff Boudier, would be happy to brainstorm with you how you could use it in your products or workflows.

Events & Talks

Hugging Face at NeurIPS

Our researchers Victor Sanh, Thomas Wolf and Alexander Rush were at NeurIPS 2020 to present their work on extreme sparsity and interact with the broader ML community. They argue that in the context of transfer learning, one should use pruning methods that consider the changes of weights during fine-tuning.

👉Paper: https://arxiv.org/abs/2005.07683
👉Code: https://huggingface.co/mvp

NLP-OSS: An Introduction to Transfer Learning in NLP and HuggingFace

In this talk, Hugging Face CSO, Thomas Wolf introduces the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures.

The second part of the talk is dedicated to an introduction of the open-source tools released by HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.

Overcoming the challenges of computational linguistics | Thomas Wolf

Thomas Wolf, the Co-Founder & Chief Scientist from Hugging Face, joins Engati CX to discuss the challenges of computational linguistics and how to overcome them.

Happy Holidays from the Hugging Face Team!

Wishing you and your loved ones a safe and healthy holiday season! We're so glad you're a part of the Hugging Face community!