News

Transformers v3.5.0
Model Versioning
The new release of transformers
brings a complete rehaul of the weights sharing system, introducing a brand new feature: model versioning, based on the git versioning system and git-lfs, a git-based system for large files.
This version introduces the concept of revisions, allowing weights to be accessed with a given identifier: a tag, branch or commit hash identifier. This is accompanied by a rework of the model hub files user interface, showcasing the history of files (tokenizer, configuration, model files), as well as the diff between files.
Find more about this in the original discussion and in the original PR.
TensorFlow Encoder-Decoder Models
Following the port of the BART encoder-decoder model in TensorFlow, four others follow:
- mBART - Multilingual BART
- MarianMT - Language pair translations
- Pegasus - State-of-the-art Summarization
- BlenderBot - end-to-end neural dialog system
New and Updated Scripts
New examples focusing on how to leverage the 🤗 Datasets library and the Trainer API. Those scripts are meant as examples easy to customize, with lots of comments explaining the various steps.
The following tasks are now covered:
With more to come soon!

New Addition to our API: Zero Shot Text Classification
Our API now includes a brand new pipeline: zero-shot text classification 🤗
This feature lets you classify sequences into the specified class names out-of-the-box without any additional training in a few lines of code! 🚀
You can try it out right on our website, or read about how it works in our blog post on Zero Shot Learning in Modern NLP.
You can also use this pipeline directly within the Transformers library. Check out our demo colab notebook.

Movement Pruning: Adaptive Sparsity
Excited to share our latest work on extreme pruning in the context of transfer learning 🧀 led by Scientist Victor Sanh with collaboration with CSO Thomas Wolf and VP of Science Sasha Rush!
95% of the original perf with only ~5% of remaining weights in the encoder 🦾 We will present this work at NeurIPS2020 early December!

Pre-trained Summarization Distillation
DistilBERT is one of the most popular models on the Hugging Face model hub, but there wasn’t a clear equivalent for Seq2Seq models. Now there is!
We're happy to introduce our paper on “Pre-trained Summarization Distillation” written by Research Engineer Sam Shleifer and VP of Science Sasha Rush.
We found that the simple method of copying layers from teacher to student and then fine-tuning the student model serves as a strong baseline on the CNN/DM dataset.
On the XSUM dataset, it is competitive with more expensive methods: pseudo-labeling and knowledge-distillation.
All the pre-trained distilled models
Docs for distilling new models
Forums thread on the topic
🚀 Model Hub Highlights 🚀
Efficient Models: Dynamic Acceleration 🚀
Except for the static acceleration models we introduced in the previous newsletter, there are also many dynamic models that you can tweak to work as you wish! DeeBERT stops the inference when it’s enough. PABEE employs an “early stopping” mechanism for inference.
DynaBERT can flexibly adjust the size and latency by selecting adaptive width and depth. Try them out!
Language Spotlight: Japanese
Japanese (日本語, Nihongo) is an East Asian language spoken by about 128 million people, primarily in Japan, where it is the national language. It is often considered a “language isolate” without genetic relationship with other languages. Check out our Japanese models here.
ProphetNet 🔮
ProphetNet is a new pretrained seq2seq model from Microsoft Research and we just added the model into 🤗 Transformers! Instead of predicting the next token, ProphetNet predicts the next n-gram thus has a better vision into the "future." Both English and multilingual pretrained models are available in the 🤗 Model Hub.
🔦Dataset Spotlight 🔦
Our team is working hard alongside the community to add incredible NLP datasets to our 🤗 Datasets library. Here's a spotlight of a few of our recent additions.
CommonGen
One of the key metrics for success in creating world-changing text generation is a model's ability to use "common sense" 🧠
A group of researchers from University of Southern California have introduced a new benchmark called CommonGen, already included in the 🤗 Datasets library, that measures a model's ability to connect a set of related concepts using commonsense knowledge.
Amazon Reviews
Amazon Reviews is one of the largest canonical datasets for sentiment analysis in NLP. The dataset contains the text for over 130 million customer reviews of Amazon products with annotated data including the 1-5 star rating, the number of “helpful” votes, and the product ID and category ⭐️⭐️⭐️⭐️⭐️
If you’re interested in enormous datasets with continuous sentiment ratings and plentiful metadata, check this one out 🔢
XNLI Training Set
We already had the canonical test and validation sets, and now we’ve added the recently translated training sets as well. XNLI is a cross-lingual version of the MultiNLI dataset with translations in 15 languages 🌏
This data can be used to evaluate cross-lingual models or even train a multilingual zero-shot text classifier like this one. Check it out!
Quail QA: Multi-domain Reading Comprehension
This brand new dataset includes passages, questions, and answers for reading comprehension. The domains include news 📰, blogs 🖥, fiction 📖, and user stories 📝 with hundreds of examples in each category. Domain diversity mitigates the issue of possible overlap between training and test data of large pre-trained models, which the current SOTA systems are based on. Try it out!
Original Chinese Natural Language Inference (OCNLI)
OCNLI is a corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. Unlike some other non-English NLI datasets, this OCNLI does not use either human- or machine- translated examples – all are original Chinese texts. The dataset is included as part of the clue dataset. Check out the paper here.
Starting Integration of Transformers with JAX 🚀
We are collaborating with Google’s Flax team to integrate the Flax functional API on top of all the very cool features from JAX. It’s just the beginning of this collaboration with Google Flax team yet, you should be able to use BERT & RoBERTa models right away.
This integration will come with the same unified API you see in our PyTorch and TensorFlow models. Expect more to come in the following weeks and do not hesitate to let us know if you would like to see other models supported 🔥
🤫 Sneak Peek
Hugging Face Transformers v4.0.0 is on the horizon with some cool new features on the way! All details are here if you want to get prepared.
Community
Porting fairseq wmt19 translation system to transformers
In this guest blog post, Hugging Face Contributor Stas Bekman documents how fairseq wmt19 translation system was ported to transformers.

Comet ❤️ Hugging Face Integration
We partnered with Comet in this new integration. In this blog post, Comet shows you how to get started with auto-logging model metrics and parameters to Comet from the Hugging Face transformers library.
Also in partnership with Comet, our Research Scientist Victor Sanh participated in an online panel to answer the question, "How do top AI researchers from Google, Stanford and Hugging Face approach new ML problems?"
🔥Top Contributors 🔥
Every newsletter, we'll be highlighting some top contributors to the Hugging Face library!
This week's top contributors:
- Weizhen Qi - Added ProphetNet Model.
- Stas Bekman - Multiple fixes and improvements. 🔥All time top contributor🔥
- Moussa Kamal Eddine - Added Barthez model, a French BART model.
- Robert Mroczkowski - Added HerBERT, a Polish BERT model.
- Ratthachat Chatpatanasiri - Added TFDPR, the Tensorflow version of DPR.
- Jonathan Chang - Added batch generation for GPT2.
- Yossi Synett - Added cross-attention output for all encoder-decoder models.
- Vlad - Integrated ML Flow to Trainer.
- Guillaume Filion - Refactored Longformer attention output.
Want to be featured? A great way to contribute is to check out these good first issues!
Tutorials
Fine-tuning a Model on a Text Classification Task
Training a transformer model for text classification has never been easier. Pick a model checkpoint from the 🤗Transformers library, a dataset from the dataset library and fine-tune your model on the task with the built-in Trainer!
This notebook example by Research Engineer Sylvain Gugger uses the awesome 🤗 Datasets library to load the data quickly and have all the preprocessing done in just one line.
Then you can finish with a hyperparameter search to best tune your model!

#HuggingTweets Video Tutorial
In our latest video tutorial, our CEO, Clément Delangue, walks you through fine-tuning a language model with #huggingtweets (courtesy of 🤖 Boris Dayma). He'll also show you how to use the model hub and cover zero-shot classification & open-domain question answering!
If you're not subscribed to our YouTube channel, make sure you subscribe today so you don't miss a new video!
Fine-Tuning a Language Model
In this notebook by Research Engineer Sylvain Gugger, you'll learn how to fine-tune one of the 🤗 Transformers model on a language modeling tasks. We cover two types of language modeling tasks:
- Causal language modeling: the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.
- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.
You can easily load and preprocess the dataset for each one of those tasks and use the Trainer API to fine-tune a model on it.

Hyperparameter Search with Transformers and Ray Tune
We were so excited to team up with Ray and Anyscale to provide a simple yet powerful integration in transformers for hyperparameter tuning. Ray Tune is a popular Python library for hyperparameter tuning that provides many state-of-the-art algorithms out of the box, along with integrations with the best-of-class tooling, such as Weights and Biases and tensorboard.
To demonstrate this new Hugging Face + Ray Tune integration, we leverage the Hugging Face Datasets library to fine tune BERT on MRPC.

Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models
Encoder-Decoder models don't need costly pre-training to yield state-of-the-art results on seq2seq tasks!
This blog post written by Machine Learning Engineer Patrick von Platen shows you how to leverage pre-trained BERT-like checkpoints for Encoder-Decoder models to save money in training.
Events & Talks

Hugging Face Presents at Machine Learning Tokyo
Machine Learning Tokyo welcomed Thomas Wolf, Co-founder and Chief Science Officer at HuggingFace, for a talk on "An introduction to transfer learning in NLP and HuggingFace".
To start, Thomas introduces the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk is dedicated to an introduction of the open-source tools released by HuggingFace, in particular Transformers, Tokenizers and Datasets libraries and models. Join the discussion on our forum!

Hugging Face presents at Chai Time Data Science
In this video, host of Chai Time Data Science, Sanyam Bhutani, interviews Hugging Face CSO, Thomas Wolf.
They talk about Thomas's journey into the field, from his work in many different areas and how he followed his passions leading towards finally now NLP and the world of transformers.
Another great video if you're looking to learn more about Hugging Face!