File size: 2,229 Bytes
21145b8
94db1f6
21145b8
94db1f6
21145b8
94db1f6
 
21145b8
 
 
 
94db1f6
 
b75f03f
94db1f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d099c60
 
 
 
 
 
 
 
6d1001c
d099c60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34b3e20
d099c60
94db1f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
base_model: facebook/nllb-200-1.3B
model-index:
- name: translate-nllb-1.3b-salt
  results: []
datasets:
- Sunbird/salt
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Model details

This machine translation model can convert single sentences from and to any combination of the following languages:

| ISO 693-3 | Language name |
| --- | --- |
| eng | English |
| ach | Acholi |
| lgg | Lugbara |
| lug | Luganda |
| nyn | Runyankole |
| teo | Ateso |

It was trained on the [SALT](http://huggingface.co/datasets/Sunbird/salt) dataset and a variety of
additional external data resources, including back-translated news articles, FLORES-200, MT560 and LAFAND-MT.
The base model was [facebok/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B),
with tokens adapted to add support for languages not originally included.

# Usage example

```python
tokenizer = transformers.NllbTokenizer.from_pretrained(
    'Sunbird/translate-nllb-1.3b-salt')
model = transformers.M2M100ForConditionalGeneration.from_pretrained(
    'Sunbird/translate-nllb-1.3b-salt')

text = 'Where is the hospital?'
source_language = 'eng'
target_language = 'lug'

language_tokens = {
    'eng': 256047,
    'ach': 256111,
    'lgg': 256008,
    'lug': 256110,
    'nyn': 256002,
    'teo': 256006,
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = tokenizer(text, return_tensors="pt").to(device)
inputs['input_ids'][0][0] = language_tokens[source_language]
translated_tokens = model.to(device).generate(
    **inputs,
    forced_bos_token_id=language_tokens[target_language],
    max_length=100,
    num_beams=5,
)

result = tokenizer.batch_decode(
    translated_tokens, skip_special_tokens=True)[0]
# Eddwaliro liri ludda wa?
```

# Evaluation metrics

Results on salt-dev:

| Source language | Target language | BLEU |
| --- | --- | --- |
| ach | eng | 28.371 |
| lgg | eng | 30.45 |
| lug | eng | 41.978 |
| nyn | eng |32.296 |
| teo | eng | 30.422 |
| eng | ach | 20.972 |
| eng | lgg | 22.362 |
| eng | lug | 30.359 |
| eng | nyn | 15.305 |
| eng | teo | 21.391 |