News
Transformers gets a new release: v3.1.0
This new version is the first PyPI release to feature:
- The PEGASUS models, the current State-of-the-Art in summarization
- DPR, for open-domain Q&A research
- mBART, a multilingual encoder-decoder model trained using the BART objective
Alongside the three new models, we are also releasing a long-awaited feature: “named outputs”. By passing return_dict=True
, model outputs can now be accessed as named values as well as by index (see the example image above).
Two new pipelines are added with version 3.1.0:
- A zero-shot pipeline, for classifying sequences into specified labels without any additional training needed
- A dialogue pipeline, for a conversation between model & user
Our work continues on several aspects of the libraries: simpler documentation, better TensorFlow support, new encoder-decoder architectures. Find the full release notes here.
Release of 🤗Datasets v1.0
After a summer of hard work, we are releasing 🤗Datasets v1.0: the first stable version of our datasets and metrics library (known as “nlp” in its beta versions).
This library started as a way to simplify datasets/metrics access for researchers & teachers, and soon became a test bed for efficient and fast data loading & processing.
This v1.0 release brings many interesting features including strong speed improvements, efficient indexing capabilities, multi-modality for image and text datasets as well as many reproducibility and traceability improvements.
Noticeable new features:
- Pickle support
- Save and load datasets to/from disk
- Multiprocessing in map and filter
- Multi-dimensional arrays support for multi-modal datasets
- Speed up Tokenization
- Speed up shuffle/shard/select methods - use indices mappings
- Speed up download and processing
- Indexed datasets for hybrid models (REALM/RAG/MARGE)
Many new datasets including:
- IWSLT 2017
- CommonGen Dataset
- CLUE Benchmark (11 datasets)
- The KILT knowledge source and tasks
- DailyDialog
- DoQA dataset (ACL 2020)
- reuters21578
- HANS
- MLSUM
- Guardian authorship
- web_questions
- MS MARCO
Full Changelog can be found here.
Install with pip install datasets
Tutorial, doc, details can be found on the github repository at https://github.com/huggingface/datasets
We would like to give a huge thank you to the amazing community of early contributors and supporters of the "nlp" beta for their help and contributions and in particular to: Stefan Schweter, Thomas Hudson, Jared Nielsen, Jack Morris, Bharat Raghunathan, Richard Wang, Leandro von Werra, Yoav Artzi, Alessandro Suglia, Mohit Bansal, Antonio V Mendoza, Gustavo Aguilar and all the other 54 early contributors!
🚀 Model Hub Highlights 🚀
The number of models in our model hub has surpassed 3,000! Huge thanks to the hundreds of organizations and users that make this possible! 🎉
Biology + 🤗Transformers
We are glad and honored to play a role in fighting against the pandemic. Lead by Ahmed Elnaggar, the Rostlab at the Technical University of Munich has shared their protein models on the Hugging Face models hub. We hope these models can help scientists better understand the protein and facilitate the development of cures for diseases.
Language Spotlight: Spanish
¡Hola! Do you know that Spanish is the world's second-most spoken native language? Check out the amazing Spanish models from our model hub! Shout out to Manuel Romero who shared so many of these awesome models!
Faster and smaller quantized NLP with Hugging Face and ONNX Runtime
Looking to serve transformers models but want to stay on CPU? Check out our newest collaboration with ONNX Runtime, led by ML Engineer Morgan Funtowicz!
Transformers models can now run at the speed of light on commodity CPU servers thanks to quantization support. You can now quantize and export Hugging Face transformers models with a single command-line and leverage all the performance benefits of ONNX Runtime.
We also released a brand new documentation page to highlight the possibilities offered by ONNX/ONNX Runtime and how you can leverage both projects from the 🤗transformers repository.
First Release of Fast Block Sparse Matrices for Pytorch
Our new PyTorch CUDA extension provides a drop-in replacement for torch.nn.Linear using block sparse matrices. This functionality saves parameters, memory and time proportional to the sparsity level.
This functionality has been promised by OpenAI, but so far remains unfulfilled. We felt someone needed to fill this gap.
The current version has support only for fixed sparsity patterns, but stay tuned: the next release will include tools to optimize sparsity patterns which has a major impact on the final network precision. Future releases will also incorporate a newer version of Cutlass, the powerful Nvidia tool behind ultra-fast CUDA kernels, and will support the new Ampere sparse functionality.
Of course, sparsity advantages can easily be combined with other methods like distillation and quantization to enable networks which are both much smaller and faster.
Hyper-parameter search being integrated in Trainer
You can now use optuna or Ray Tune for hyperparameter search very easily inside Trainer (support for TensorFlow is coming very soon). Just use the brand new command Trainer.hyperparameter_search
(and its documentation). This topic on the forum shows a full example of use and explains how to customize the objective being optimized or the search space.
Seq2Seq Generation Improvements
A sneaky bug was fixed that improves generation and finetuning performance for Bart, Marian, MBart and Pegasus. If you have a trained sequence to sequence model, you may get a nice surprise if you rerun evaluation 🙃
Create Your Own Hugging Face Organization
Now you can join the ranks of Allen AI, Microsoft, Facebook AI, Google AI, Musixmatch and dozens of others across the world.
Create or join an organization to use & share NLP models!
Zero-shot Classification Support in 100 Languages 🌍
The 🤗 Model Hub now includes a crosslingual model which can be used with our recent zero-shot-classification pipeline with support for up to 100 languages. Check out this colab notebook for examples or the model page for more information.
Community
🔥Top Contributors 🔥
Every newsletter, we'll be highlighting some top contributors to the Hugging Face library!
This week's top contributors:
- Stas Bekman - Adding FairSeq MachineTranslation model and did multiple great code refactorizations everywhere in the library.
- Antonio V Mendoza - Added the dual-stream language-vision model “LXMERT”.
- Suraj Patil - Added “Text2TextGenerationPipeline” and answers many issues especially those related to Seq2Seq.
- Kai Frick - Added checkpointing to Ray hyperparameter optimization and improved the training logger .
- Jin Young Son - Fixed TF Trainer issues related to XLA and improved TextDataset class.
- Boris Dayma - Added configurable padding to the text generation pipeline.
- Manuel Romero - Added multiple T5 and Electra models to the model hub
Want to be featured? A great way to contribute is to check out these good first issues!
HuggingTweets - Train a model to generate tweets
Thanks to Boris Dayma, you can now use huggingtweets to share your own tweet generator with everyone!
Models are trained with Hugging Face and tracked with Weights & Biases along with their datasets.
Share your favorite #huggingtweets with us on Twitter!
Events & Talks
Hugging Face on The AI Podcast
Research engineer Sam Shleifer spoke with The AI Podcast host, Noah Kravitz, about Hugging Face NLP technology. You can listen to the whole episode for free on Spotify.
Hugging Face at Software Freedom Day
Research Engineer Joe Davison will be speaking at Software Freedom Day (Sept.19-20). Register here to see him and other experts talk about free and open source software
Hugging Face at Ray Summit
CSO Thomas Wolf will be speaking at Ray Summit (Sept. 30 - Oct. 1). Learn more about scalable machine learning and Python from Thomas and other industry experts. You can register here.
Have Ideas or Feedback for the Next Issue?
Email newsletter@huggingface.co. We would love your feedback and support!