Issue 7 February 9th 2021

News

New Year, New Website!

Our vision for the future of machine learning is one step closer to reality thanks to the 1,000+ researchers & open-source contributors, thousands of companies & the fantastic Hugging Face team! Last month, we announced the launch of the latest version of huggingface.co and we couldn't be more proud.

🔥 Play live with >10 billion parameters models for tasks including translation, NER, zero-shot classification, and more. You can use any of these models instantly in production with our hosted API or join the 500 organizations using our hub to host/share your own models & datasets.

🤯 Also, the Hub is now open to all models, not just transformers!

🤗 Open-source, open-sharing, open-science for the win.

Hugging Face Needs Your Help!

⭐️⭐️⭐️Transformers just passed 40K (4️⃣0️⃣0️⃣0️⃣0️⃣) GitHub stars! ⭐️⭐️⭐️

🤗 Our libraries are all about the community and we need your input to define the direction of the next 40k stars 🌟

Please take 5 minutes for a short survey and help us craft the future of the library.

v1.2 of the 🤗 Datasets library is now available!

🐍 611 datasets you can download in one line of python
🗣 467 languages covered, 99 with at least 10 datasets
🚀 efficient pre-processing to free you from memory constraints

All the new datasets from the 2020 Datasets sprint are now available in the 🤗Datasets library via pip install! This includes 450 new datasets, bringing the library to more than 600 datasets that are all available to be downloaded and used within a single framework. The result showcases the incredible community that came together for this effort and we want to thank you all again – we could not have done it without you!

As discussed above, our brand new website provides an incredibly convenient way to search through these datasets and filter them by language, task, size, and more.

Stay tuned for our upcoming Datasets 2.0 release 🤗

🚨Transformers is expanding to Speech!🚨

Hugging faceTransformers v4.3.0 is out and we are excited to welcome Facebook AI's Wav2Vec2 as the first Automatic Speech Recognition model to our library!

🗣 You can now transcribe audio files directly on the 🤗 hub!

Fit More and Train Faster with ZeRO via DeepSpeed and FairScale

🔎Fine-tuning a 3 billion parameter model on a single GPU?

It's now possible in 🤗 Transformers, thanks to DeepSpeed & Fairscale integrations!

Shout out to team members Stas Bekman & Sylvain Gugger for the seamless integration & blog post, and huge thanks to the Microsoft and Facebook AI teams for their support!

💎 Introducing the GEM benchmark

Models that can classify text are great, but how good are we actually at generating language?

💎 GEM, a living benchmark for natural language Generation (NLG), will help answer this question by contrasting models and evaluation methods in several languages.

We're super proud to help set this up along with a fantastic team of collaborators spearheaded by Sebastian Gehrmann!

If you want to contribute and get started, all data is available through 🤗 Datasets:

from datasets import load_dataset
load_dataset("gem", script_version="master")

📓Doc 📝Tutorial

🚀 Encoder-Decoder models are going long-range in 🤗 Transformers!

We just released 🤗 Transformers v4.2.0 with Longformer Encoder-Decoder.

Summarize up to 16K tokens either with the 🤗 Transformers pipeline or via our inference API.

📓We also prepared two notebooks for LED:

1️⃣ How to evaluate LED on arXiv with Datasets showing its state-of-the-art performance

2️⃣ How to finetune LED on 8K tokens on a single GPU with the 🤗 Transformers Seq2SeqTrainer & 🤗 Datasets

Tokenizers v0.10

It’s now easier than ever to train a tokenizer using any sort of in-memory data. This obviously works with the 611 datasets available with 🤗 datasets! This newest release includes:

New tools to help you visualize how your tokenizer works (Thanks to https://twitter.com/thetalperry)
Ability to train word-level tokenizers
Many bug fixes and experience improvements

Support for Amazon SageMaker's new data parallelism library

We're excited to announce that in collaboration with Amazon Web Services (AWS), we have added support for Amazon SageMaker's new data parallelism library in our latest release (4.3.0).

When executing a script with Trainer using Amazon SageMaker and enabling SageMaker's data parallelism library, Trainer will automatically use the smdistributed library. All maintained examples have been tested with this functionality.

Read the release notes to learn more!

🏅 Transformers Featured by Papers with Code as Most Viewed Library of 2020

Hugging Face was honored to be featured by Papers with Code as one of the Trending Libraries for 2020!

Check out the article featuring 🤗 Transformers and other great libraries including PyTorch Image Models, Detectron2, InsightFace, and more!

🚀 Model Hub Highlights 🚀

ARBERT and MARBERT

Introducing ARBERT and MARBERT, two powerful transformer-based language models for Arabic courtesy of The University of British Columbia Deep Learning & NLP Lab!

📝 Paper

🦄Pegasus for Financial Summarization

This new model for financial summarization was trained on a novel financial dataset which consists of 2,000 financial and economic articles from the Bloomberg LP website of different categories such as stock, markets, currencies, rate and cryptocurrencies, using PEGASUS.

📝 Paper

🔊Our Inference Widget now supports audio models!! (in beta)

Try it yourself:

Community

We're Hiring!

Looking for a new role or know someone who is? Hugging Face is hiring! We have new roles open for:

🤗🤓📚 Reading Group

We debuted our new Hugging Face Reads series with an inaugural post on current methods in sparsity and pruning.

We are changing our reading group format to focus on a theme each month instead of a paper every week. For each iteration, we will be choosing 4 relevant papers, having an internal discussion, and publishing a blog post comprising an introduction to the topic, overview of the papers, an outline of the common trends, and some follow-up questions we’re interested in. We’ll also let you know how the topic relates to our Open Source and Research efforts at Hugging Face!

You can find the first one on Sparsity and Pruning. And stay tuned for a new HFR blog post on Long Range Dependencies in transformer models this month!

🤗Transformers passed its 10,000th issue/PR!

📈🔥 10,000 GitHub issues and pull requests in what, two years? It's so exciting to see the community so invested in this project and to see the power of the open source and Hugging Face AI community!

🔥Top Contributors 🔥

In every issue of our newsletter, we'll be highlighting some top contributors to the Hugging Face library!

This week's top contributors:

Niels Rogge - Lead the integration of TAPAS and improved LayoutLM.
Amog Kamsetty - Added distributed retriever implementation built on Ray including a blog post.
KaiTao Song - Added MPNet.
Stefan Schweter - Added Bort.
Ayush Jain - Added Diverse Beam Search.
Daniel Stancl - Added head masking functionality to multiple models.
Ratthachat Chatpatanasiri - Fixed multiple issues related to RAG.
Yusuke Mori - Refactored the “reorder_cache” functionality for all decoder-only models.
Guillaume Becquin - Did multiple fixes for ProphetNet.

Want to be featured? A great way to contribute is to check out these good first issues!

🔥 Hot Hugging Face Forum Topics 🔥

Tutorials

Faster TensorFlow Models in Hugging Face Transformers

❓ Ever wanted to deploy the fast TensorFlow models from Hugging Face Transformers in TensorFlow Serving?

The TF2 version of Bert in 🤗 transformers is 2x faster than the original implementation 🤯 ...AND it's faster than the official Google version 🚀

Thanks to ML Engineer Julien Plu for this detailed blog post!

📝 Tutorial

📓 Notebook

How we sped up transformer inference 100x for 🤗 API customers

In this blog post by ML Engineer Nicolas Patry, learn how the Hugging Face team built this 100x performance gain into our Accelerated Inference API. 🚀

With our Plans for Organizations, companies upload and share models privately within their organization, and integrate them easily within their applications, supporting millions of requests a day with best-in-class inference time.

Want to feel the speed? Start your 7-day free trial!

🔎 Get a better understanding of your generative models with 🤗 transformers!

With just a few lines of code, you can now output generation scores, attention weights, and hidden states.

On google colab by Simon Brandeis, you'll need to install sentencepiece beforehand for the code snippet to work.

Simply add this cell before the snippet: !pip install sentencepiece

Optimize Hugging Face Models with Weights & Biases

A new tutorial by 🤖 Boris Dayma shows you how to optimize Hugging Face models with Weights & Biases - no extra line of code required 🥳

This tutorial contains a few little known tips such as auto-logging of models, gradients, parameter histograms, etc!