Update README.md
Browse files
README.md
CHANGED
@@ -48,7 +48,7 @@ import pyonmttok
|
|
48 |
from huggingface_hub import snapshot_download
|
49 |
model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-gl-ca", revision="main")
|
50 |
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
|
51 |
-
tokenized=tokenizer.tokenize("Benvido ao proxecto
|
52 |
translator = ctranslate2.Translator(model_dir)
|
53 |
translated = translator.translate_batch([tokenized[0]])
|
54 |
print(tokenizer.detokenize(translated[0][0]['tokens']))
|
@@ -122,24 +122,24 @@ Weights were saved every 1000 updates and reported results are the average of th
|
|
122 |
### Variable and metrics
|
123 |
We use the BLEU score for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
|
124 |
### Evaluation results
|
125 |
-
Below are the evaluation results on the machine translation from Galician to Catalan compared to [M2M100 1.2B](https://huggingface.co/facebook/m2m100_1.2B), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
|
126 |
-
| Test set |M2M100 1.2B| NLLB 1.3B | NLLB 3.3 |mt-aina-gl-ca|
|
127 |
-
|
128 |
-
|
|
129 |
-
| TaCON |56,5|32,2 | 54,1 | **58,2** |
|
130 |
-
| NTREX
|
131 |
-
| Average |41,0| 25,0 | 40,9 | **41,4** |
|
132 |
## Additional information
|
133 |
### Author
|
134 |
-
Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center
|
135 |
### Contact information
|
136 |
-
For further information, send an email to <
|
137 |
### Copyright
|
138 |
Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
|
139 |
### Licensing information
|
140 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
141 |
### Funding
|
142 |
-
This work was funded by the
|
143 |
### Disclaimer
|
144 |
<details>
|
145 |
<summary>Click to expand</summary>
|
|
|
48 |
from huggingface_hub import snapshot_download
|
49 |
model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-gl-ca", revision="main")
|
50 |
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
|
51 |
+
tokenized=tokenizer.tokenize("Benvido ao proxecto Ilenia.")
|
52 |
translator = ctranslate2.Translator(model_dir)
|
53 |
translated = translator.translate_batch([tokenized[0]])
|
54 |
print(tokenizer.detokenize(translated[0][0]['tokens']))
|
|
|
122 |
### Variable and metrics
|
123 |
We use the BLEU score for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
|
124 |
### Evaluation results
|
125 |
+
Below are the evaluation results on the machine translation from Galician to Catalan compared to [Google Translate](https://translate.google.com/), [M2M100 1.2B](https://huggingface.co/facebook/m2m100_1.2B), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
|
126 |
+
| Test set |Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 |mt-aina-gl-ca|
|
127 |
+
|----------------------|----|-------|-----------|------------------|---------------|
|
128 |
+
|Flores 101 devtest |**36,4**|32,6| 22,3 | 34,3 | 32,4 |
|
129 |
+
| TaCON |48,4|56,5|32,2 | 54,1 | **58,2** |
|
130 |
+
| NTREX |**34,7**|34,0|20,4 | 34,2 | 33,7 |
|
131 |
+
| Average |39,0|41,0| 25,0 | 40,9 | **41,4** |
|
132 |
## Additional information
|
133 |
### Author
|
134 |
+
Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
|
135 |
### Contact information
|
136 |
+
For further information, send an email to <langtech@bsc.es>
|
137 |
### Copyright
|
138 |
Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
|
139 |
### Licensing information
|
140 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
141 |
### Funding
|
142 |
+
This work was funded by the SEDIA within the framework of ILENIA
|
143 |
### Disclaimer
|
144 |
<details>
|
145 |
<summary>Click to expand</summary>
|