🤗Hugging Face Newsletter Issue #1 August 20th 2020

News

🤗Welcome to the Hugging Face Newsletter! 🤗

Every few weeks, we'll be updating you on the latest happenings at Hugging Face. Make sure to subscribe and share with all NLP lovers to get the latest updates on releases, readings, research, and more!

Have an idea for the newsletter? Email newsletter@huggingface.co

🚀 Model Hub Highlights 🚀

Open-Source Machine Translation
Did you know that you can translate between many languages with open-source 🤗 Transformers and great models from Helsinki-NLP? Say goodbye to Google Translate and have a taste of open-source!

Easy Sentence Embedding
Multiple state-of-the-art sentence embedding models are now available in the 🤗 Model Hub, including Sentence Transformers from UKPLab and DeCLUTR from U of T. You can use these models to generate rich representations for sentences without further fine-tuning!

BERT for Code
Recently, BERT learned programming after hours! CodeBERT (Bi-modal/MLM) by Microsoft and CodeBERTa by Hugging Face both shed light on the interdisciplinary area between natural language and programming language.

🤗 Datasets and Metrics Library Heading Toward First Non-Beta Release

Led by ML Intern Quentin Lhoest (@qlhoest) and CSO Thomas Wolf (@Thom_Wolf), the 🤗 team has been hard at work on our newest library focusing on datasets and metrics.

Features:

One line access to 150+ datasets and metrics - very easy to add new datasets/metrics to the hub
Loading a 17GB+ dataset like English Wikipedia only takes 9MB in RAM and you can iterate over the data at 2-3 Gbit/s
Blazing-fast and reproducible data processing
Deep integration with numpy/pandas/pytorch/tensorflow

New this summer:

Brand new documentation
More tutorials to showcase simplicity of use

Short roadmap for the 1.0.0 release:

Deep integration and focus on knowledge-based models such as RAG/REALM/ORQA/MARGE/knn-LM using indexed datasets
Additional speed improvements (multiprocessing, instant shuffling)
Support for multi-modal datasets
Final community-voted name for the library will be "🤗 datasets" (a change from the original non-beta release 1.0.0)

Longformer has been ported to TF 2

All Longformer models are now available in TensorFlow 2.0 in addition to PyTorch thanks to Machine Learning Engineer, Patrick (@PatrickPlaten). Longformer – or Long Document Transformer – uses an attention mechanism that scales linearly with sequence length, enabling it to process much longer sequences than is possible with the standard transformer.

Built-in Pipeline for Zero-Shot Text Classification

No Training? No Problem! The Hugging Face Transformers master branch now includes an experimental pipeline for zero-shot text classification, to be included in the next release, thanks to Research Engineer Joe Davison (@joeddav). This pipeline allows you to classify text into a set of provided labels using a pre-trained model without any fine-tuning.

Introducing Benchmarks to 🤗 Transformers

With ever larger language models, it has become crucial to accurately measure the model's computational cost.

To make benchmarking language models as easy as possible, we are very excited to introduce Benchmarks to 🤗Transformers.

🚨New Model Alert: Pegasus 🚨

We're also excited to release 12 new SOTA summarization models: Pegasus (from Google) is a seq2seq transformer that achieves SOTA scores on all 12 of these datasets, because it was pretrained specifically for summarization.

Updated Transformers Documentation

Research Engineer Sylvain Gugger (@GuggerSylvain) has reorganized the documentation of Transformers into five sections:

Getting started
How to use Transformers
Advanced guides
Research
Package reference

New tutorials for beginners have been added: the new quick tour, a guide on how to preprocess your data or a training tutorial. For quick comparisons of various models or tokenizers, there is a new model summary and a new tokenizer summary.

All tutorials can now be opened in a colab with a simple click on the corresponding icon on the top left!

Community

Join the Hugging Face Forum

Created by Research Engineer, Sylvain Gugger (@GuggerSylvain), the Hugging Face forum is for everyone and anyone who's looking to share thoughts and ask questions about Hugging Face and NLP, in general.

Some of the 🔥 topics covered in the last few weeks:

Check out the Discussion on our Latest #ScienceTuesday

In our latest Science Tuesday discussion, Hugging Face Research Engineer, Sam Shleifer (@sam_shleifer), read Pre-training via Paraphrasing (MARGE) and asked some interesting questions.

Join in on the discussion and suggest interesting papers that we should cover next!

🤗 Transformers Survey Results

We sent out a survey on 🤗 Transformers a few weeks ago and received over 800 detailed feedback responses and more than 50,000 words of open answers 🤯

CSO Thomas Wolf (@Thom_Wolf) read them all and wrote a summary on the Hugging Face discourse forum.

🔥Top Contributors 🔥

Every newsletter, we'll be highlighting some top contributors to the Hugging Face library! This week's top contributors:

Suraj Patil - Added the MBart model and diverse functionalities for Seq2Seq models.
Guillaume - Added the "ConversationalPipeline" and optimized "Bad token ids" for generation.
Stas Bekman - Fixed many inconsistencies in the documentation and cleaned up some tests.
Manuel Romero - Added multiple T5 models to the model hub, notably models for Question Generation.
Pradhy729 - Added "Feed forward chunking functionality" and added support for "IterableDataset" in Trainer.

Want Some 🤗 Stickers?

Make sure you're subscribed to this newsletter and send an email with your shipping address and how you're using Transformers to clara@huggingface.co with the subject line: 🤗Stickers

Tutorials

New Tutorial on Fine-Tuning 🤗Transformers with Your Own Data

Want to fine-tune a model on your own custom data but are unsure how to get started? Research Engineer Joe Davison (@joeddav) wrote up a new tutorial on how to fine-tune Transformers on your own data. The tutorial takes you through several examples of downloading a dataset, preprocessing & tokenization, and preparing it for training with either TensorFlow or PyTorch. Examples include sequence classification, NER, and question answering.