Spaces:
Runtime error
Runtime error
<!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# NLLB | |
**DISCLAIMER:** The default behaviour for the tokenizer has recently been fixed (and thus changed)! | |
The previous version adds `[self.eos_token_id, self.cur_lang_code]` at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) : | |
*Note that we prefix the source sequence with the source language, as opposed to the target | |
language as previously done in several works (Arivazhagan et al., 2019; Johnson et al., | |
2017). This is primarily because we prioritize optimizing zero-shot performance of our | |
model on any pair of 200 languages at a minor cost to supervised performance.* | |
Previous behaviour: | |
```python | |
from transformers import NllbTokenizer | |
"facebook/nllb-200-distilled-600M") | tokenizer = NllbTokenizer.from_pretrained(|
"How was your day?").input_ids | tokenizer(|
[13374, 1398, 4260, 4039, 248130, 2, 256047] | |
# 2: '</s>' | |
# 256047 : 'eng_Latn' | |
``` | |
New behaviour | |
```python | |
from transformers import NllbTokenizer | |
"facebook/nllb-200-distilled-600M") | tokenizer = NllbTokenizer.from_pretrained(|
"How was your day?").input_ids | tokenizer(|
[256047, 13374, 1398, 4260, 4039, 248130, 2] | |
``` | |
Enabling the old behaviour can be done as follows: | |
```python | |
from transformers import NllbTokenizer | |
"facebook/nllb-200-distilled-600M", legacy_behaviour=True) | tokenizer = NllbTokenizer.from_pretrained(|
``` | |
For more details, feel free to check the linked [PR](https://github.com/huggingface/transformers/pull/22313) and [Issue](https://github.com/huggingface/transformers/issues/19943). | |
## Overview of NLLB | |
The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by Marta R. Costa-jussà, James Cross, Onur Çelebi, | |
Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, | |
Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, | |
Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, | |
Safiyyah Saleem, Holger Schwenk, and Jeff Wang. | |
The abstract of the paper is the following: | |
*Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. | |
However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the | |
200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by | |
first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed | |
at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of | |
Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training | |
improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using | |
a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. | |
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.* | |
This implementation contains the dense models available on release. | |
**The sparse model NLLB-MoE (Mixture of Expert) is now available! More details [here](nllb-moe)** | |
This model was contributed by [Lysandre](https://huggingface.co/lysandre). The authors' code can be found [here](https://github.com/facebookresearch/fairseq/tree/nllb). | |
## Generating with NLLB | |
While generating the target text set the `forced_bos_token_id` to the target language id. The following | |
example shows how to translate English to French using the *facebook/nllb-200-distilled-600M* model. | |
Note that we're using the BCP-47 code for French `fra_Latn`. See [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) | |
for the list of all BCP-47 in the Flores 200 dataset. | |
```python | |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer | |
"facebook/nllb-200-distilled-600M") | tokenizer = AutoTokenizer.from_pretrained(|
"facebook/nllb-200-distilled-600M") | model = AutoModelForSeq2SeqLM.from_pretrained(|
"UN Chief says there is no military solution in Syria" | article =|
"pt") | inputs = tokenizer(article, return_tensors=|
translated_tokens = model.generate( | |
"fra_Latn"], max_length=30 | **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[|
) | |
True)[0] | tokenizer.batch_decode(translated_tokens, skip_special_tokens=|
Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie | |
``` | |
### Generating from any other language than English | |
English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language, | |
you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization. | |
See example below for a translation from romanian to german: | |
```py | |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer | |
tokenizer = AutoTokenizer.from_pretrained( | |
"facebook/nllb-200-distilled-600M", use_auth_token=True, src_lang="ron_Latn" | |
) | |
"facebook/nllb-200-distilled-600M", use_auth_token=True) | model = AutoModelForSeq2SeqLM.from_pretrained(|
"Şeful ONU spune că nu există o soluţie militară în Siria" | article =|
"pt") | inputs = tokenizer(article, return_tensors=|
translated_tokens = model.generate( | |
"deu_Latn"], max_length=30 | **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[|
) | |
True)[0] | tokenizer.batch_decode(translated_tokens, skip_special_tokens=|
UN-Chef sagt, es gibt keine militärische Lösung in Syrien | |
``` | |
## Documentation resources | |
- [Translation task guide](../tasks/translation) | |
- [Summarization task guide](../tasks/summarization) | |
## NllbTokenizer | |
[[autodoc]] NllbTokenizer | |
- build_inputs_with_special_tokens | |
## NllbTokenizerFast | |
[[autodoc]] NllbTokenizerFast | |