LLaMAX commited on
Commit
5378f8f
·
verified ·
1 Parent(s): 66cfc84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -87
README.md CHANGED
@@ -1,87 +1,91 @@
1
- ### Model Sources
2
-
3
- - **Paper**: LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
4
- - **Link**: https://arxiv.org/pdf/2407.05975
5
- - **Repository**: https://github.com/CONE-MT/LLaMAX/
6
-
7
- ### Model Description
8
-
9
- LLaMAX is a language model with powerful multilingual capabilities without loss instruction-following capabilities.
10
-
11
- We collected extensive training sets in 102 languages for continued pre-training of Llama2 and leveraged the English instruction fine-tuning dataset, Alpaca, to fine-tune its instruction-following capabilities.
12
-
13
- ### 🔥 Effortless Multilingual Translation with a Simple Prompt
14
-
15
- LLaMAX supports translation between more than 100 languages, surpassing the performance of similarly scaled LLMs.
16
-
17
- ```angular2html
18
- def Prompt_template(query, src_language, trg_language):
19
- instruction = f'Translate the following sentences from {src_language} to {trg_language}.'
20
- prompt = (
21
- 'Below is an instruction that describes a task, paired with an input that provides further context. '
22
- 'Write a response that appropriately completes the request.\n'
23
- f'### Instruction:\n{instruction}\n'
24
- f'### Input:\n{query}\n### Response:'
25
- )
26
- return prompt
27
- ```
28
-
29
- And then run the following codes to execute translation:
30
- ```angular2html
31
- from transformers import AutoTokenizer, LlamaForCausalLM
32
-
33
- model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
34
- tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
35
-
36
- query = "你好,今天是个好日子"
37
- prompt = Prompt_template(query, 'Chinese', 'English')
38
- inputs = tokenizer(prompt, return_tensors="pt")
39
-
40
- generate_ids = model.generate(inputs.input_ids, max_length=30)
41
- tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
42
- # => "Hello, today is a good day"
43
- ```
44
-
45
-
46
- ### 🔥 Excellent Translation Performance
47
-
48
- LLaMAX3-8B-Alpaca achieves an average spBLEU score improvement of over **5 points** compared to the LLaMA3-8B-Alpaca model on the Flores-101 dataset.
49
-
50
-
51
- | System | Size | en-X (COMET) | en-X (BLEU) | zh-X (COMET)| zh-X (BLEU) | de-X (COMET) | de-X (BLEU) | ne-X (COMET) | ne-X (BLEU) |ar-X (COMET) | ar-X (BLEU) | az-X (COMET) | az-X (BLEU) | ceb-X (COMET) | ceb-X (BLEU)|
52
- |--------------------|------|--------------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
53
- | LLaMA3-8B-Alpaca | 8B |67.97|17.23|64.65|10.14|64.67|13.62|62.95|7.96|63.45|11.27|60.61|6.98|55.26|8.52|
54
- | LLaMAX3-8B-Alpaca | 8B |75.52|22.77|73.16|14.43|73.47|18.95|75.13|15.32|72.29|16.42|72.06|12.41|68.88|15.85|
55
-
56
-
57
- | System | Size | X-en (COMET) | X-en (BLEU) | X-zh (COMET)| X-zh (BLEU) | X-de (COMET) | X-de (BLEU) | X-ne (COMET) | X-ne (BLEU) |X-ar (COMET) | X-ar (BLEU) | X-az (COMET) | X-az (BLEU) | X-ceb (COMET) | X-ceb (BLEU) |
58
- |--------------------|------|----------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |--------------|
59
- | LLaMA3-8B-Alpaca | 8B |77.43|26.55|73.56|13.17|71.59|16.82|46.56|3.83|66.49|10.20|58.30|4.81|52.68|4.18|
60
- | LLaMAX3-8B-Alpaca | 8B |81.28|31.85|78.34|16.46|76.23|20.64|65.83|14.16|75.84|15.45|70.61|9.32|63.35|12.66|
61
-
62
-
63
- ### Supported Languages
64
- Akrikaans (af), Amharic (am), Arabic (ar), Armenian (hy), Assamese (as), Asturian (ast), Azerbaijani (az), Belarusian (be), Bengali (bn), Bosnian (bs), Bulgarian (bg), Burmese (my), Catalan (ca), Cebuano (ceb), Chinese Simpl (zho), Chinese Trad (zho), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Filipino (tl), Finnish (fi), French (fr), Fulah (ff), Galician (gl), Ganda (lg), Georgian (ka), German (de), Greek (el), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Igbo (ig), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Javanese (jv), Kabuverdianu (kea), Kamba (kam), Kannada (kn), Kazakh (kk), Khmer (km), Korean (ko), Kyrgyz (ky), Lao (lo), Latvian (lv), Lingala (ln), Lithuanian (lt), Luo (luo), Luxembourgish (lb), Macedonian (mk), Malay (ms), Malayalam (ml), Maltese (mt), Maori (mi), Marathi (mr), Mongolian (mn), Nepali (ne), Northern Sotho (ns), Norwegian (no), Nyanja (ny), Occitan (oc), Oriya (or), Oromo (om), Pashto (ps), Persian (fa), Polish (pl), Portuguese (pt), Punjabi (pa), Romanian (ro), Russian (ru), Serbian (sr), Shona (sn), Sindhi (sd), Slovak (sk), Slovenian (sl), Somali (so), Sorani Kurdish (ku), Spanish (es), Swahili (sw), Swedish (sv), Tajik (tg), Tamil (ta), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Umbundu (umb), Urdu (ur), Uzbek (uz), Vietnamese (vi), Welsh (cy), Wolof (wo), Xhosa (xh), Yoruba (yo), Zulu (zu)
65
-
66
- ### Model Index
67
- We implement multiple versions of the LLaMAX model, the model links are as follows:
68
-
69
- | Model | LLaMAX | LLaMAX-Alpaca |
70
- |---------|----------------------------------------------------------|-----------------------------------------------------------------|
71
- | Llama-2 | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B) | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B-Alpaca) |
72
- | Llama-3 | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B-8B) | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B-8B-Alpaca) |
73
-
74
- ### Citation
75
- If our model helps your work, please cite this paper:
76
-
77
- ```
78
- @misc{lu2024llamaxscalinglinguistichorizons,
79
- title={LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages},
80
- author={Yinquan Lu and Wenhao Zhu and Lei Li and Yu Qiao and Fei Yuan},
81
- year={2024},
82
- eprint={2407.05975},
83
- archivePrefix={arXiv},
84
- primaryClass={cs.CL},
85
- url={https://arxiv.org/abs/2407.05975},
86
- }
87
- ```
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - Multilingual
4
+ ---
5
+ ### Model Sources
6
+
7
+ - **Paper**: LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
8
+ - **Link**: https://arxiv.org/pdf/2407.05975
9
+ - **Repository**: https://github.com/CONE-MT/LLaMAX/
10
+
11
+ ### Model Description
12
+
13
+ LLaMAX is a language model with powerful multilingual capabilities without loss instruction-following capabilities.
14
+
15
+ We collected extensive training sets in 102 languages for continued pre-training of Llama2 and leveraged the English instruction fine-tuning dataset, Alpaca, to fine-tune its instruction-following capabilities.
16
+
17
+ ### 🔥 Effortless Multilingual Translation with a Simple Prompt
18
+
19
+ LLaMAX supports translation between more than 100 languages, surpassing the performance of similarly scaled LLMs.
20
+
21
+ ```angular2html
22
+ def Prompt_template(query, src_language, trg_language):
23
+ instruction = f'Translate the following sentences from {src_language} to {trg_language}.'
24
+ prompt = (
25
+ 'Below is an instruction that describes a task, paired with an input that provides further context. '
26
+ 'Write a response that appropriately completes the request.\n'
27
+ f'### Instruction:\n{instruction}\n'
28
+ f'### Input:\n{query}\n### Response:'
29
+ )
30
+ return prompt
31
+ ```
32
+
33
+ And then run the following codes to execute translation:
34
+ ```angular2html
35
+ from transformers import AutoTokenizer, LlamaForCausalLM
36
+
37
+ model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
38
+ tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
39
+
40
+ query = "你好,今天是个好日子"
41
+ prompt = Prompt_template(query, 'Chinese', 'English')
42
+ inputs = tokenizer(prompt, return_tensors="pt")
43
+
44
+ generate_ids = model.generate(inputs.input_ids, max_length=30)
45
+ tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
46
+ # => "Hello, today is a good day"
47
+ ```
48
+
49
+
50
+ ### 🔥 Excellent Translation Performance
51
+
52
+ LLaMAX3-8B-Alpaca achieves an average spBLEU score improvement of over **5 points** compared to the LLaMA3-8B-Alpaca model on the Flores-101 dataset.
53
+
54
+
55
+ | System | Size | en-X (COMET) | en-X (BLEU) | zh-X (COMET)| zh-X (BLEU) | de-X (COMET) | de-X (BLEU) | ne-X (COMET) | ne-X (BLEU) |ar-X (COMET) | ar-X (BLEU) | az-X (COMET) | az-X (BLEU) | ceb-X (COMET) | ceb-X (BLEU)|
56
+ |--------------------|------|--------------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
57
+ | LLaMA3-8B-Alpaca | 8B |67.97|17.23|64.65|10.14|64.67|13.62|62.95|7.96|63.45|11.27|60.61|6.98|55.26|8.52|
58
+ | LLaMAX3-8B-Alpaca | 8B |75.52|22.77|73.16|14.43|73.47|18.95|75.13|15.32|72.29|16.42|72.06|12.41|68.88|15.85|
59
+
60
+
61
+ | System | Size | X-en (COMET) | X-en (BLEU) | X-zh (COMET)| X-zh (BLEU) | X-de (COMET) | X-de (BLEU) | X-ne (COMET) | X-ne (BLEU) |X-ar (COMET) | X-ar (BLEU) | X-az (COMET) | X-az (BLEU) | X-ceb (COMET) | X-ceb (BLEU) |
62
+ |--------------------|------|----------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |--------------|
63
+ | LLaMA3-8B-Alpaca | 8B |77.43|26.55|73.56|13.17|71.59|16.82|46.56|3.83|66.49|10.20|58.30|4.81|52.68|4.18|
64
+ | LLaMAX3-8B-Alpaca | 8B |81.28|31.85|78.34|16.46|76.23|20.64|65.83|14.16|75.84|15.45|70.61|9.32|63.35|12.66|
65
+
66
+
67
+ ### Supported Languages
68
+ Akrikaans (af), Amharic (am), Arabic (ar), Armenian (hy), Assamese (as), Asturian (ast), Azerbaijani (az), Belarusian (be), Bengali (bn), Bosnian (bs), Bulgarian (bg), Burmese (my), Catalan (ca), Cebuano (ceb), Chinese Simpl (zho), Chinese Trad (zho), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Filipino (tl), Finnish (fi), French (fr), Fulah (ff), Galician (gl), Ganda (lg), Georgian (ka), German (de), Greek (el), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Igbo (ig), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Javanese (jv), Kabuverdianu (kea), Kamba (kam), Kannada (kn), Kazakh (kk), Khmer (km), Korean (ko), Kyrgyz (ky), Lao (lo), Latvian (lv), Lingala (ln), Lithuanian (lt), Luo (luo), Luxembourgish (lb), Macedonian (mk), Malay (ms), Malayalam (ml), Maltese (mt), Maori (mi), Marathi (mr), Mongolian (mn), Nepali (ne), Northern Sotho (ns), Norwegian (no), Nyanja (ny), Occitan (oc), Oriya (or), Oromo (om), Pashto (ps), Persian (fa), Polish (pl), Portuguese (pt), Punjabi (pa), Romanian (ro), Russian (ru), Serbian (sr), Shona (sn), Sindhi (sd), Slovak (sk), Slovenian (sl), Somali (so), Sorani Kurdish (ku), Spanish (es), Swahili (sw), Swedish (sv), Tajik (tg), Tamil (ta), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Umbundu (umb), Urdu (ur), Uzbek (uz), Vietnamese (vi), Welsh (cy), Wolof (wo), Xhosa (xh), Yoruba (yo), Zulu (zu)
69
+
70
+ ### Model Index
71
+ We implement multiple versions of the LLaMAX model, the model links are as follows:
72
+
73
+ | Model | LLaMAX | LLaMAX-Alpaca |
74
+ |---------|----------------------------------------------------------|-----------------------------------------------------------------|
75
+ | Llama-2 | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B) | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B-Alpaca) |
76
+ | Llama-3 | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B-8B) | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B-8B-Alpaca) |
77
+
78
+ ### Citation
79
+ If our model helps your work, please cite this paper:
80
+
81
+ ```
82
+ @misc{lu2024llamaxscalinglinguistichorizons,
83
+ title={LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages},
84
+ author={Yinquan Lu and Wenhao Zhu and Lei Li and Yu Qiao and Fei Yuan},
85
+ year={2024},
86
+ eprint={2407.05975},
87
+ archivePrefix={arXiv},
88
+ primaryClass={cs.CL},
89
+ url={https://arxiv.org/abs/2407.05975},
90
+ }
91
+ ```