|
---
|
|
license: mit
|
|
language:
|
|
- en
|
|
- eu
|
|
metrics:
|
|
- BLEU
|
|
- TER
|
|
tags:
|
|
- text2text-generation
|
|
- open-nmt
|
|
- pytorch
|
|
---
|
|
|
|
# Itzune v1.9 EN -> EU machine translation argos model
|
|
|
|
This model was trained using [argostrain](https://github.com/argosopentech/argos-train) training scripts with 11,542,706 English to Basque parallel strings extracted from datasets obtained directly from the [Opus project](https://opus.nlpl.eu/).
|
|
|
|
## Model description
|
|
|
|
|
|
- **Developed by:** argostranslate
|
|
- **Model type:** traslation
|
|
- **Model version:** v1.9
|
|
- **Source Language:** English
|
|
- **Target Language:** Basque
|
|
- **License:** MIT
|
|
|
|
## Training Data
|
|
|
|
The English-Basque parallel sentences were collected from the following datasets:
|
|
|
|
| Dataset | Sentences before cleaning |
|
|
|----------------------|--------------------------:|
|
|
| CCMatrix v1 | 7,788,871 |
|
|
| OpenSubtitles v2018 | 805,780 |
|
|
| XLEnt v1.2 | 800,631 |
|
|
| GNOME v1 | 652,298 |
|
|
| HPLT v1.1 | 610,694 |
|
|
| EhuHac v1 | 585,210 |
|
|
| WikiMatrix v1 | 119,480 |
|
|
| KDE4 v2 | 100,160 |
|
|
| wikimedia v20230407 | 60,990 |
|
|
| bible-uedin v1 | 15,893 |
|
|
| Tatoeba v2023-04-12 | 2,070 |
|
|
| Wiktionary | 629 |
|
|
| **Total** | **11,542,706** |
|
|
|
|
### Evaluation results
|
|
Below are the evaluation results on the machine translation from English to Basque compared to [Google Translate](https://translate.google.com/), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [mt-hitz-en-eu](https://huggingface.co/HiTZ/mt-hitz-en-eu):
|
|
|
|
#### BLEU scores
|
|
|
|
| Test set |Google Translate | NLLB 3.3 | mt-hitz-en-eu | itzune 1.9 |
|
|
|----------------------|-----------------|----------|---------------|------------|
|
|
| Flores 200 devtest | **20.5** | 13.3 | 19.2 | 17.0 |
|
|
| TaCON | **12.1** | 9.4 | 8.8 | - |
|
|
| NTREX | **15.7** | 8.0 | 14.5 | - |
|
|
| Average | **16.1** | 10.2 | 14.2 | - |
|
|
|
|
#### TER scores
|
|
|
|
| Test set |Google Translate | NLLB 3.3 | mt-hitz-en-eu | itzune 1.9 |
|
|
|----------------------|-----------------|----------|---------------|------------|
|
|
| Flores 200 devtest |**59.5** | 70.4 | 65.0 | 70.1 |
|
|
| TaCON |**69.5** | 75.3 | 76.8 | - |
|
|
| NTREX |**65.8** | 81.6 | 66.7 | - |
|
|
| Average |**64.9** | 75.8 | 68.2 | - |
|
|
|
|
|