Our vision for the future of machine learning is one step closer to reality thanks to the 1,000+ researchers & open-source contributors, thousands of companies & the fantastic Hugging Face team! Last month, we announced the launch of the latest version of huggingface.co and we couldn't be more proud.
🔥 Play live with >10 billion parameters models for tasks including translation, NER, zero-shot classification, and more. You can use any of these models instantly in production with our hosted API or join the 500 organizations using our hub to host/share your own models & datasets.
🤯 Also, the Hub is now open to all models, not just transformers!
🤗 Open-source, open-sharing, open-science for the win.
⭐️⭐️⭐️Transformers just passed 40K (4️⃣0️⃣0️⃣0️⃣0️⃣) GitHub stars! ⭐️⭐️⭐️
🤗 Our libraries are all about the community and we need your input to define the direction of the next 40k stars 🌟
Please take 5 minutes for a short survey and help us craft the future of the library.
🐍 611 datasets you can download in one line of python
🗣 467 languages covered, 99 with at least 10 datasets
🚀 efficient pre-processing to free you from memory constraints
All the new datasets from the 2020 Datasets sprint are now available in the 🤗Datasets library via pip install! This includes 450 new datasets, bringing the library to more than 600 datasets that are all available to be downloaded and used within a single framework. The result showcases the incredible community that came together for this effort and we want to thank you all again – we could not have done it without you!
As discussed above, our brand new website provides an incredibly convenient way to search through these datasets and filter them by language, task, size, and more.
Stay tuned for our upcoming Datasets 2.0 release 🤗
🔎Fine-tuning a 3 billion parameter model on a single GPU?
It's now possible in 🤗 Transformers, thanks to DeepSpeed & Fairscale integrations!
Shout out to team members Stas Bekman & Sylvain Gugger for the seamless integration & blog post, and huge thanks to the Microsoft and Facebook AI teams for their support!
Models that can classify text are great, but how good are we actually at generating language?
💎 GEM, a living benchmark for natural language Generation (NLG), will help answer this question by contrasting models and evaluation methods in several languages.
We're super proud to help set this up along with a fantastic team of collaborators spearheaded by Sebastian Gehrmann!
If you want to contribute and get started, all data is available through 🤗 Datasets:
from datasets import load_dataset
It’s now easier than ever to train a tokenizer using any sort of in-memory data. This obviously works with the 611 datasets available with 🤗 datasets! This newest release includes:
- New tools to help you visualize how your tokenizer works (Thanks to https://twitter.com/thetalperry)
- Ability to train word-level tokenizers
- Many bug fixes and experience improvements
We're excited to announce that in collaboration with Amazon Web Services (AWS), we have added support for Amazon SageMaker's new data parallelism library in our latest release (4.3.0).
When executing a script with
Trainer using Amazon SageMaker and enabling SageMaker's data parallelism library,
Trainer will automatically use the
smdistributed library. All maintained examples have been tested with this functionality.
Read the release notes to learn more!