🤗Hugging Face Newsletter Issue #2 September 11th 2020

News

Transformers gets a new release: v3.1.0

This new version is the first PyPI release to feature:

The PEGASUS models, the current State-of-the-Art in summarization
DPR, for open-domain Q&A research
mBART, a multilingual encoder-decoder model trained using the BART objective

Alongside the three new models, we are also releasing a long-awaited feature: “named outputs”. By passing return_dict=True, model outputs can now be accessed as named values as well as by index (see the example image above).

Two new pipelines are added with version 3.1.0:

A zero-shot pipeline, for classifying sequences into specified labels without any additional training needed
A dialogue pipeline, for a conversation between model & user

Our work continues on several aspects of the libraries: simpler documentation, better TensorFlow support, new encoder-decoder architectures. Find the full release notes here.

Release of 🤗Datasets v1.0

After a summer of hard work, we are releasing 🤗Datasets v1.0: the first stable version of our datasets and metrics library (known as “nlp” in its beta versions).

This library started as a way to simplify datasets/metrics access for researchers & teachers, and soon became a test bed for efficient and fast data loading & processing.

This v1.0 release brings many interesting features including strong speed improvements, efficient indexing capabilities, multi-modality for image and text datasets as well as many reproducibility and traceability improvements.

Noticeable new features:

Pickle support
Save and load datasets to/from disk
Multiprocessing in map and filter
Multi-dimensional arrays support for multi-modal datasets
Speed up Tokenization
Speed up shuffle/shard/select methods - use indices mappings
Speed up download and processing
Indexed datasets for hybrid models (REALM/RAG/MARGE)

Many new datasets including:

IWSLT 2017
CommonGen Dataset
CLUE Benchmark (11 datasets)
The KILT knowledge source and tasks
DailyDialog
DoQA dataset (ACL 2020)
reuters21578
HANS
MLSUM
Guardian authorship
web_questions
MS MARCO

Full Changelog can be found here.

Install with pip install datasets

Tutorial, doc, details can be found on the github repository at https://github.com/huggingface/datasets

We would like to give a huge thank you to the amazing community of early contributors and supporters of the "nlp" beta for their help and contributions and in particular to: Stefan Schweter, Thomas Hudson, Jared Nielsen, Jack Morris, Bharat Raghunathan, Richard Wang, Leandro von Werra, Yoav Artzi, Alessandro Suglia, Mohit Bansal, Antonio V Mendoza, Gustavo Aguilar and all the other 54 early contributors!

🚀 Model Hub Highlights 🚀

The number of models in our model hub has surpassed 3,000! Huge thanks to the hundreds of organizations and users that make this possible! 🎉

Biology + 🤗Transformers

We are glad and honored to play a role in fighting against the pandemic. Lead by Ahmed Elnaggar, the Rostlab at the Technical University of Munich has shared their protein models on the Hugging Face models hub. We hope these models can help scientists better understand the protein and facilitate the development of cures for diseases.

Language Spotlight: Spanish

¡Hola! Do you know that Spanish is the world's second-most spoken native language? Check out the amazing Spanish models from our model hub! Shout out to Manuel Romero who shared so many of these awesome models!

Faster and smaller quantized NLP with Hugging Face and ONNX Runtime

Looking to serve transformers models but want to stay on CPU? Check out our newest collaboration with ONNX Runtime, led by ML Engineer Morgan Funtowicz!

Transformers models can now run at the speed of light on commodity CPU servers thanks to quantization support. You can now quantize and export Hugging Face transformers models with a single command-line and leverage all the performance benefits of ONNX Runtime.

We also released a brand new documentation page to highlight the possibilities offered by ONNX/ONNX Runtime and how you can leverage both projects from the 🤗transformers repository.

First Release of Fast Block Sparse Matrices for Pytorch

Our new PyTorch CUDA extension provides a drop-in replacement for torch.nn.Linear using block sparse matrices. This functionality saves parameters, memory and time proportional to the sparsity level.

This functionality has been promised by OpenAI, but so far remains unfulfilled. We felt someone needed to fill this gap.

The current version has support only for fixed sparsity patterns, but stay tuned: the next release will include tools to optimize sparsity patterns which has a major impact on the final network precision. Future releases will also incorporate a newer version of Cutlass, the powerful Nvidia tool behind ultra-fast CUDA kernels, and will support the new Ampere sparse functionality.

Of course, sparsity advantages can easily be combined with other methods like distillation and quantization to enable networks which are both much smaller and faster.

Hyper-parameter search being integrated in Trainer

You can now use optuna or Ray Tune for hyperparameter search very easily inside Trainer (support for TensorFlow is coming very soon). Just use the brand new command Trainer.hyperparameter_search (and its documentation). This topic on the forum shows a full example of use and explains how to customize the objective being optimized or the search space.

Seq2Seq Generation Improvements

A sneaky bug was fixed that improves generation and finetuning performance for Bart, Marian, MBart and Pegasus. If you have a trained sequence to sequence model, you may get a nice surprise if you rerun evaluation 🙃

Create Your Own Hugging Face Organization

Now you can join the ranks of Allen AI, Microsoft, Facebook AI, Google AI, Musixmatch and dozens of others across the world.

Create or join an organization to use & share NLP models!

Zero-shot Classification Support in 100 Languages 🌍

The 🤗 Model Hub now includes a crosslingual model which can be used with our recent zero-shot-classification pipeline with support for up to 100 languages. Check out this colab notebook for examples or the model page for more information.

Community

🔥 Hot Hugging Face Forum Topics 🔥

🔥Top Contributors 🔥

Every newsletter, we'll be highlighting some top contributors to the Hugging Face library!

This week's top contributors:

Stas Bekman - Adding FairSeq MachineTranslation model and did multiple great code refactorizations everywhere in the library.
Antonio V Mendoza - Added the dual-stream language-vision model “LXMERT”.
Suraj Patil - Added “Text2TextGenerationPipeline” and answers many issues especially those related to Seq2Seq.
Kai Frick - Added checkpointing to Ray hyperparameter optimization and improved the training logger .
Jin Young Son - Fixed TF Trainer issues related to XLA and improved TextDataset class.
Boris Dayma - Added configurable padding to the text generation pipeline.
Manuel Romero - Added multiple T5 and Electra models to the model hub