The new release of
transformers brings a complete rehaul of the weights sharing system, introducing a brand new feature: model versioning, based on the git versioning system and git-lfs, a git-based system for large files.
This version introduces the concept of revisions, allowing weights to be accessed with a given identifier: a tag, branch or commit hash identifier. This is accompanied by a rework of the model hub files user interface, showcasing the history of files (tokenizer, configuration, model files), as well as the diff between files.
Find more about this in the original discussion and in the original PR.
TensorFlow Encoder-Decoder Models
Following the port of the BART encoder-decoder model in TensorFlow, four others follow:
- mBART - Multilingual BART
- MarianMT - Language pair translations
- Pegasus - State-of-the-art Summarization
- BlenderBot - end-to-end neural dialog system
New and Updated Scripts
New examples focusing on how to leverage the 🤗 Datasets library and the Trainer API. Those scripts are meant as examples easy to customize, with lots of comments explaining the various steps.
The following tasks are now covered:
With more to come soon!
Our API now includes a brand new pipeline: zero-shot text classification 🤗
This feature lets you classify sequences into the specified class names out-of-the-box without any additional training in a few lines of code! 🚀
You can try it out right on our website, or read about how it works in our blog post on Zero Shot Learning in Modern NLP.
You can also use this pipeline directly within the Transformers library. Check out our demo colab notebook.
Excited to share our latest work on extreme pruning in the context of transfer learning 🧀 led by Scientist Victor Sanh with collaboration with CSO Thomas Wolf and VP of Science Sasha Rush!
95% of the original perf with only ~5% of remaining weights in the encoder 🦾
We will present this work at NeurIPS2020 early December!
Efficient Models: Dynamic Acceleration 🚀
Except for the static acceleration models we introduced in the previous newsletter, there are also many dynamic models that you can tweak to work as you wish! DeeBERT stops the inference when it’s enough. PABEE employs an “early stopping” mechanism for inference.
DynaBERT can flexibly adjust the size and latency by selecting adaptive width and depth. Try them out!
Language Spotlight: Japanese
Japanese (日本語, Nihongo) is an East Asian language spoken by about 128 million people, primarily in Japan, where it is the national language. It is often considered a “language isolate” without genetic relationship with other languages. Check out our Japanese models here.
ProphetNet is a new pretrained seq2seq model from Microsoft Research and we just added the model into 🤗 Transformers! Instead of predicting the next token, ProphetNet predicts the next n-gram thus has a better vision into the "future." Both English and multilingual pretrained models are available in the 🤗 Model Hub.
Our team is working hard alongside the community to add incredible NLP datasets to our 🤗 Datasets library. Here's a spotlight of a few of our recent additions.
One of the key metrics for success in creating world-changing text generation is a model's ability to use "common sense" 🧠
A group of researchers from University of Southern California have introduced a new benchmark called CommonGen, already included in the 🤗 Datasets library, that measures a model's ability to connect a set of related concepts using commonsense knowledge.
Amazon Reviews is one of the largest canonical datasets for sentiment analysis in NLP. The dataset contains the text for over 130 million customer reviews of Amazon products with annotated data including the 1-5 star rating, the number of “helpful” votes, and the product ID and category ⭐️⭐️⭐️⭐️⭐️
If you’re interested in enormous datasets with continuous sentiment ratings and plentiful metadata, check this one out 🔢
XNLI Training Set
We already had the canonical test and validation sets, and now we’ve added the recently translated training sets as well. XNLI is a cross-lingual version of the MultiNLI dataset with translations in 15 languages 🌏
This data can be used to evaluate cross-lingual models or even train a multilingual zero-shot text classifier like this one. Check it out!
Quail QA: Multi-domain Reading Comprehension
This brand new dataset includes passages, questions, and answers for reading comprehension. The domains include news 📰, blogs 🖥, fiction 📖, and user stories 📝 with hundreds of examples in each category. Domain diversity mitigates the issue of possible overlap between training and test data of large pre-trained models, which the current SOTA systems are based on. Try it out!
Original Chinese Natural Language Inference (OCNLI)
OCNLI is a corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. Unlike some other non-English NLI datasets, this OCNLI does not use either human- or machine- translated examples – all are original Chinese texts. The dataset is included as part of the clue dataset. Check out the paper here.
Starting Integration of Transformers with JAX 🚀
We are collaborating with Google’s Flax team to integrate the Flax functional API on top of all the very cool features from JAX. It’s just the beginning of this collaboration with Google Flax team yet, you should be able to use BERT & RoBERTa models right away.
This integration will come with the same unified API you see in our PyTorch and TensorFlow models. Expect more to come in the following weeks and do not hesitate to let us know if you would like to see other models supported 🔥
Hugging Face Transformers v4.0.0 is on the horizon with some cool new features on the way! All details are here if you want to get prepared.