tribble-600m / README.md
igorktech's picture
Update README.md
714685d verified
metadata
metrics:
  - bleu
  - chrf
base_model:
  - facebook/nllb-200-distilled-600M

Model Card for TRIBBLE - Translating Iberian Languages Based on Limited E-resources

TRIBBLE Model

Model Description

TRIBBLE is a machine translation model specifically fine-tuned for low-resource Iberian languages as part of the WMT24 Shared Task. It translates from Spanish (spa_Latn) to Aragonese (arg_Latn), Asturian (ast_Latn), and Aranese (arn_Latn), providing an essential tool for these endangered languages within the Romance language family.

The model builds on distilled NLLB-200 with 600M parameters, integrating additional tokens for Aragonese and Aranese to extend the multilingual translation capabilities of the original NLLB-200 model.

Model Details

  • Architecture: Distilled NLLB-200 (600M parameters)
  • Training Data: Processed subsets of OPUS and PILAR corpora, alongside bilingual and monolingual data sources.
  • Control Tokens: arg_Latn for Aragonese and arn_Latn for Aranese, initialized with spa_Latn and oci_Latn embeddings, respectively, based on linguistic proximity.
  • Optimization: Fine-tuned with Adafactor optimizer and custom data processing pipeline.

Intended Use

This model is intended for translation tasks involving low-resource Iberian languages:

  • Translating from Spanish to Aragonese, Asturian, and Aranese.
  • Applications in cultural preservation, language research, and digital inclusion for endangered languages.

Evaluation

TRIBBLE was evaluated using BLEU and chrF metrics on the WMT24 devtest set:

Language Direction Baseline (Apertium) TRIBBLE (Constrained)
Spanish → Aragonese (BLEU) 61.1 49.2
Spanish → Aragonese (chrF) 79.3 73.6
Spanish → Aranese (BLEU) 28.8 23.9
Spanish → Aranese (chrF) 49.4 46.1
Spanish → Asturian (BLEU) 17.0 17.9
Spanish → Asturian (chrF) 50.8 50.5

While Apertium generally outperformed TRIBBLE, the model achieved comparable BLEU scores for Asturian. The constrained setting highlights TRIBBLE's potential for low-resource translation with efficient data use.

Citation

If you use TRIBBLE in your work, please cite:

@InProceedings{kuzmin-EtAl:2024:WMT,
author = {Kuzmin, Igor and Przybyła, Piotr and McGill, Euan and Saggion, Horacio},
title = {TRIBBLE - TRanslating IBerian languages Based on Limited E-resources},
booktitle = {Proceedings of the Ninth Conference on Machine Translation},
month = {November},
year = {2024},
address = {Miami, Florida, USA},
publisher = {Association for Computational Linguistics},
pages = {955--959},
abstract = {In this short overview paper, we describe our system submission for the language pairs Spanish to Aragonese (spa-arg), Spanish to Aranese (spa-arn), and Spanish to Asturian (spa-ast). We train a unified model for all language pairs in the constrained scenario. In addition, we add two language control tokens for Aragonese and Aranese Occitan, as there is already one present for Asturian. We take the distilled NLLB-200 model with 600M parameters and extend special tokens with 2 tokens that denote target languages (arn\_Latn, arg\_Latn) because Asturian was already presented in NLLB-200 model. We adapt the model by training on a special regime of data augmentation with both monolingual and bilingual training data for the language pairs in this challenge.},
url = {https://www2.statmt.org/wmt24/pdf/2024.wmt-1.94.pdf}
}