tribble-600m / README.md
igorktech's picture
Update README.md
714685d verified
---
metrics:
- bleu
- chrf
base_model:
- facebook/nllb-200-distilled-600M
---
# Model Card for TRIBBLE - Translating Iberian Languages Based on Limited E-resources
![TRIBBLE Model ](tribble.png)
## Model Description
[TRIBBLE](https://www2.statmt.org/wmt24/pdf/2024.wmt-1.94.pdf) is a machine translation model specifically fine-tuned for **low-resource Iberian languages** as part of the **WMT24 Shared Task**. It translates from **Spanish (spa_Latn)** to **Aragonese (arg_Latn)**, **Asturian (ast_Latn)**, and **Aranese (arn_Latn)**, providing an essential tool for these endangered languages within the Romance language family.
The model builds on **distilled NLLB-200** with **600M parameters**, integrating additional tokens for **Aragonese** and **Aranese** to extend the multilingual translation capabilities of the original NLLB-200 model.
### Model Details
- **Architecture**: Distilled NLLB-200 (600M parameters)
- **Training Data**: Processed subsets of OPUS and PILAR corpora, alongside bilingual and monolingual data sources.
- **Control Tokens**: `arg_Latn` for Aragonese and `arn_Latn` for Aranese, initialized with `spa_Latn` and `oci_Latn` embeddings, respectively, based on linguistic proximity.
- **Optimization**: Fine-tuned with Adafactor optimizer and custom data processing pipeline.
## Intended Use
This model is intended for **translation tasks** involving low-resource Iberian languages:
- Translating from **Spanish to Aragonese, Asturian, and Aranese**.
- Applications in cultural preservation, language research, and digital inclusion for endangered languages.
## Evaluation
TRIBBLE was evaluated using BLEU and chrF metrics on the WMT24 devtest set:
| Language Direction | Baseline (Apertium) | TRIBBLE (Constrained) |
|----------------------------|---------------------|------------------------|
| Spanish → Aragonese (BLEU) | 61.1 | 49.2 |
| Spanish → Aragonese (chrF) | 79.3 | 73.6 |
| Spanish → Aranese (BLEU) | 28.8 | 23.9 |
| Spanish → Aranese (chrF) | 49.4 | 46.1 |
| Spanish → Asturian (BLEU) | 17.0 | **17.9** |
| Spanish → Asturian (chrF) | 50.8 | 50.5 |
While **Apertium** generally outperformed TRIBBLE, the model achieved comparable BLEU scores for **Asturian**. The constrained setting highlights TRIBBLE's potential for low-resource translation with efficient data use.
## Citation
If you use TRIBBLE in your work, please cite:
```bibtex
@InProceedings{kuzmin-EtAl:2024:WMT,
author = {Kuzmin, Igor and Przybyła, Piotr and McGill, Euan and Saggion, Horacio},
title = {TRIBBLE - TRanslating IBerian languages Based on Limited E-resources},
booktitle = {Proceedings of the Ninth Conference on Machine Translation},
month = {November},
year = {2024},
address = {Miami, Florida, USA},
publisher = {Association for Computational Linguistics},
pages = {955--959},
abstract = {In this short overview paper, we describe our system submission for the language pairs Spanish to Aragonese (spa-arg), Spanish to Aranese (spa-arn), and Spanish to Asturian (spa-ast). We train a unified model for all language pairs in the constrained scenario. In addition, we add two language control tokens for Aragonese and Aranese Occitan, as there is already one present for Asturian. We take the distilled NLLB-200 model with 600M parameters and extend special tokens with 2 tokens that denote target languages (arn\_Latn, arg\_Latn) because Asturian was already presented in NLLB-200 model. We adapt the model by training on a special regime of data augmentation with both monolingual and bilingual training data for the language pairs in this challenge.},
url = {https://www2.statmt.org/wmt24/pdf/2024.wmt-1.94.pdf}
}
```