jq's picture
Update README.md
34b3e20 verified
metadata
base_model: facebook/nllb-200-1.3B
model-index:
  - name: translate-nllb-1.3b-salt
    results: []
datasets:
  - Sunbird/salt

Model details

This machine translation model can convert single sentences from and to any combination of the following languages:

ISO 693-3 Language name
eng English
ach Acholi
lgg Lugbara
lug Luganda
nyn Runyankole
teo Ateso

It was trained on the SALT dataset and a variety of additional external data resources, including back-translated news articles, FLORES-200, MT560 and LAFAND-MT. The base model was facebok/nllb-200-1.3B, with tokens adapted to add support for languages not originally included.

Usage example

tokenizer = transformers.NllbTokenizer.from_pretrained(
    'Sunbird/translate-nllb-1.3b-salt')
model = transformers.M2M100ForConditionalGeneration.from_pretrained(
    'Sunbird/translate-nllb-1.3b-salt')

text = 'Where is the hospital?'
source_language = 'eng'
target_language = 'lug'

language_tokens = {
    'eng': 256047,
    'ach': 256111,
    'lgg': 256008,
    'lug': 256110,
    'nyn': 256002,
    'teo': 256006,
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = tokenizer(text, return_tensors="pt").to(device)
inputs['input_ids'][0][0] = language_tokens[source_language]
translated_tokens = model.to(device).generate(
    **inputs,
    forced_bos_token_id=language_tokens[target_language],
    max_length=100,
    num_beams=5,
)

result = tokenizer.batch_decode(
    translated_tokens, skip_special_tokens=True)[0]
# Eddwaliro liri ludda wa?

Evaluation metrics

Results on salt-dev:

Source language Target language BLEU
ach eng 28.371
lgg eng 30.45
lug eng 41.978
nyn eng 32.296
teo eng 30.422
eng ach 20.972
eng lgg 22.362
eng lug 30.359
eng nyn 15.305
eng teo 21.391