license: apache-2.0
language:
- en
- eu
metrics:
- BLEU
- TER
Hitz Center’s English-Basque machine translation model
Model description
This model was trained from scratch using Marian NMT on a combination of English-Basque datasets totalling 20,523,431 sentence pairs. 9,033,998 sentence pairs were parallel data collected from the web while the remaining 11,489,433 sentence pairs were parallel synthetic data created using the Google Translate translator. The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
- Developed by: HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
- Model type: traslation
- Source Language: English
- Target Language: Basque
- License: apache-2.0
Intended uses and limitations
You can use this model for machine translation from English to Basque.
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import MarianMTModel, MarianTokenizer
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM
src_text = ["this is a test"]
model_name = "HiTZ/mt-hitz-en-eu"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=T
rue))
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
Training Details
Training Data
The English-Basque data collected from the web was a combination of the following datasets:
Dataset | Sentences before cleaning |
---|---|
CCMatrix v1 | 7,788,871 |
EhuHac | 585,210 |
Ehuskaratuak | 482,259 |
Ehuskaratuak | 482,259 |
Elhuyar | 1,176,529 |
HPLT | 4,546,563 |
OpenSubtitles | 805,780 |
PaCO_2012 | 109,524 |
PaCO_2013 | 48,892 |
WikiMatrix | 119,480 |
Total | 15,653,108 |
The 11,489,433 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into English using the ES-EN translator from Google Translate.
Training Procedure
Preprocessing
After concatenation, all datasets are cleaned and deduplicated using bifixer and bicleaner tools (Ramírez-Sánchez et al., 2020). Any sentence pairs with a classification score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
Tokenization
All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
Evaluation
Variable and metrics
We use the BLEU and TER scores for evaluation on test sets: Flores-200, TaCon and NTREX
Evaluation results
Below are the evaluation results on the machine translation from English to Basque compared to Google Translate and NLLB 200 3.3B:
####BLEU scores
Test set | Google Translate | NLLB 3.3 | mt-hitz-en-eu |
---|---|---|---|
Flores 200 devtest | 20.5 | 13.3 | 19.2 |
TaCON | 12.1 | 9.4 | 8.8 |
NTREX | 15.7 | 8.0 | 14.5 |
Average | 16.1 | 10.2 | 14.2 |
####TER scores
Test set | Google Translate | NLLB 3.3 | mt-hitz-en-eu |
---|---|---|---|
Flores 200 devtest | 59.5 | 70.4 | 65.0 |
TaCON | 69.5 | 75.3 | 76.8 |
NTREX | 65.8 | 81.6 | 66.7 |
Average | 64.9 | 75.8 | 68.2 |
Additional information
Author
HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
Contact information
For further information, send an email to [email protected]
Licensing information
This work is licensed under a Apache License, Version 2.0
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334