opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Afro-Asiatic languages (afa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-29
License: Apache-2.0
Language(s):
- Source Language(s): deu eng fra por spa
- Target Language(s): aar acm afb amh apc ara arc arq arz bcw byn cop daa dsh gde gnd hau hbo heb hig irk jpa kab ker kqp ktb kxc lln lme meq mfh mfi mfk mif mlt mpg mqb muy oar orm pbi phn rif sgw shi shy som sur syc syr taq thv tig tir tmc tmh tmr ttr tzm wal xed zgh
- Valid Target Language Labels: >>aal<< >>aar<< >>aas<< >>acm<< >>afb<< >>agj<< >>ahg<< >>aij<< >>aiw<< >>ajw<< >>akk<< >>alw<< >>amh<< >>amw<< >>anc<< >>ank<< >>apc<< >>ara<< >>arc<< >>arq<< >>arv<< >>arz<< >>auj<< >>auo<< >>awn<< >>bbt<< >>bcq<< >>bcw<< >>bcy<< >>bde<< >>bdm<< >>bdn<< >>bds<< >>bej<< >>bhm<< >>bhn<< >>bhs<< >>bid<< >>bjf<< >>bji<< >>bnl<< >>bob<< >>bol<< >>bsw<< >>bta<< >>btf<< >>bux<< >>bva<< >>bvf<< >>bvh<< >>bvw<< >>bwo<< >>bwr<< >>bxe<< >>bxq<< >>byn<< >>cie<< >>ckl<< >>ckq<< >>cky<< >>cla<< >>cnu<< >>cop<< >>cop_Copt<< >>cuv<< >>daa<< >>dal<< >>dbb<< >>dbp<< >>dbq<< >>dbr<< >>dgh<< >>dim<< >>dkx<< >>dlk<< >>dme<< >>dot<< >>dox<< >>doz<< >>drs<< >>dsh<< >>dwa<< >>egy<< >>elo<< >>fie<< >>fkk<< >>fli<< >>gab<< >>gde<< >>gdf<< >>gdk<< >>gdl<< >>gdq<< >>gdu<< >>gea<< >>gek<< >>gew<< >>gex<< >>gez<< >>gft<< >>gha<< >>gho<< >>gid<< >>gis<< >>giz<< >>gji<< >>glo<< >>glw<< >>gnc<< >>gnd<< >>gou<< >>gow<< >>gqa<< >>grd<< >>grr<< >>gru<< >>gwd<< >>gwn<< >>har<< >>hau<< >>hau_Latn<< >>hbb<< >>hbo<< >>hbo_Hebr<< >>hdy<< >>heb<< >>hed<< >>hia<< >>hig<< >>hna<< >>hod<< >>hoh<< >>hrt<< >>hss<< >>huy<< >>hwo<< >>hya<< >>inm<< >>ior<< >>irk<< >>jaf<< >>jbe<< >>jbn<< >>jeu<< >>jia<< >>jie<< >>jii<< >>jim<< >>jmb<< >>jmi<< >>jnj<< >>jpa<< >>jpa_Hebr<< >>jrb<< >>juu<< >>kab<< >>kai<< >>kbz<< >>kcn<< >>kcs<< >>ker<< >>kil<< >>kkr<< >>kks<< >>kna<< >>kof<< >>kot<< >>kpa<< >>kqd<< >>kqp<< >>kqx<< >>ksq<< >>ktb<< >>ktc<< >>kuh<< >>kul<< >>kvf<< >>kvi<< >>kvj<< >>kwl<< >>kxc<< >>ldd<< >>lhs<< >>liq<< >>lln<< >>lme<< >>lsd<< >>maf<< >>mcn<< >>mcw<< >>mdx<< >>meq<< >>mes<< >>mew<< >>mey<< >>mfh<< >>mfi<< >>mfj<< >>mfk<< >>mfl<< >>mfm<< >>mid<< >>mif<< >>mje<< >>mjs<< >>mkf<< >>mlj<< >>mlr<< >>mlt<< >>mlw<< >>mmf<< >>mmy<< >>mou<< >>moz<< >>mpg<< >>mpi<< >>mpk<< >>mqb<< >>mrt<< >>mse<< >>msv<< >>mtl<< >>mub<< >>mug<< >>muj<< >>muu<< >>muy<< >>mvh<< >>mvz<< >>mxf<< >>mxu<< >>mys<< >>myz<< >>mzb<< >>nbh<< >>ndm<< >>ngi<< >>ngs<< >>ngw<< >>ngx<< >>nja<< >>nmi<< >>nnc<< >>nnn<< >>noz<< >>nxm<< >>oar<< >>oar_Hebr<< >>oar_Syrc<< >>orm<< >>oua<< >>pbi<< >>pcw<< >>phn<< >>phn_Phnx<< >>pip<< >>piy<< >>plj<< >>pqa<< >>rel<< >>rif<< >>rif_Latn<< >>rzh<< >>saa<< >>sam<< >>say<< >>scw<< >>sds<< >>sgw<< >>she<< >>shi<< >>shi_Latn<< >>shv<< >>shy<< >>shy_Latn<< >>sid<< >>sir<< >>siz<< >>sjs<< >>smp<< >>sok<< >>som<< >>sor<< >>sqr<< >>sqt<< >>ssn<< >>ssy<< >>stv<< >>sur<< >>swn<< >>swq<< >>swy<< >>syc<< >>syk<< >>syn<< >>syr<< >>tak<< >>tal<< >>tan<< >>taq<< >>tax<< >>tdk<< >>tez<< >>tgd<< >>thv<< >>tia<< >>tig<< >>tir<< >>tjo<< >>tmc<< >>tmh<< >>tmr<< >>tmr_Hebr<< >>tng<< >>tqq<< >>trg<< >>trj<< >>tru<< >>tsb<< >>tsh<< >>ttr<< >>twc<< >>tzm<< >>tzm_Latn<< >>tzm_Tfng<< >>ubi<< >>udl<< >>uga<< >>vem<< >>wal<< >>wbj<< >>wji<< >>wka<< >>wle<< >>xaa<< >>xan<< >>xeb<< >>xed<< >>xhd<< >>xmd<< >>xmj<< >>xna<< >>xpu<< >>xqt<< >>xsa<< >>ymm<< >>zah<< >>zay<< >>zaz<< >>zen<< >>zgh<< >>zim<< >>ziz<< >>zns<< >>zrn<< >>zua<< >>zuy<< >>zwa<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>aar<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>kab<< Tu seras parmi nous demain.",
    ">>heb<< Let's get out of here while we can."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Azekka ad tiliḍ yid-i
#     בוא נצא מכאן כל עוד אנחנו יכולים.

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa")
print(pipe(">>kab<< Tu seras parmi nous demain."))

# expected output: Azekka ad tiliḍ yid-i

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
deu-ara	tatoeba-test-v2021-08-07	0.49517	20.2	1209	6324
deu-heb	tatoeba-test-v2021-08-07	0.56943	35.8	3090	20341
eng-ara	tatoeba-test-v2021-08-07	0.46273	17.3	10305	61356
eng-heb	tatoeba-test-v2021-08-07	0.57708	34.9	10519	63628
eng-mlt	tatoeba-test-v2021-08-07	0.61044	29.5	203	899
fra-ara	tatoeba-test-v2021-08-07	0.42223	10.4	1569	7956
fra-heb	tatoeba-test-v2021-08-07	0.58681	37.5	3281	20655
por-heb	tatoeba-test-v2021-08-07	0.61593	41.0	719	4423
spa-ara	tatoeba-test-v2021-08-07	0.53669	23.9	1511	7547
spa-heb	tatoeba-test-v2021-08-07	0.61966	41.2	1849	12112
deu-ara	flores101-devtest	0.47927	15.7	1012	21357
eng-hau	flores101-devtest	0.47807	19.0	1012	27730
eng-mlt	flores101-devtest	0.67196	32.9	1012	22169
fra-mlt	flores101-devtest	0.56271	19.9	1012	22169
por-heb	flores101-devtest	0.49378	19.6	1012	20749
spa-ara	flores101-devtest	0.44988	11.7	1012	21357
deu-ara	flores200-devtest	0.661	0.0	1012	5
deu-hau	flores200-devtest	0.40471	11.4	1012	27730
deu-heb	flores200-devtest	0.48645	18.1	1012	20238
deu-mlt	flores200-devtest	0.54079	17.5	1012	22169
eng-ara	flores200-devtest	0.627	0.0	1012	5
eng-arz	flores200-devtest	0.42804	11.1	1012	21034
eng-hau	flores200-devtest	0.49023	20.4	1012	27730
eng-heb	flores200-devtest	0.56635	27.1	1012	20238
eng-mlt	flores200-devtest	0.68334	34.9	1012	22169
eng-som	flores200-devtest	0.42814	9.9	1012	25991
fra-ara	flores200-devtest	0.631	0.0	1012	5
fra-hau	flores200-devtest	0.42731	13.2	1012	27730
fra-heb	flores200-devtest	0.49683	19.1	1012	20238
fra-mlt	flores200-devtest	0.56844	20.4	1012	22169
por-ara	flores200-devtest	0.622	0.0	1012	5
por-hau	flores200-devtest	0.42593	13.6	1012	27730
por-heb	flores200-devtest	0.50345	19.7	1012	20238
por-mlt	flores200-devtest	0.58913	21.5	1012	22169
spa-ara	flores200-devtest	0.587	0.0	1012	5
spa-hau	flores200-devtest	0.40309	9.4	1012	27730
spa-heb	flores200-devtest	0.45249	13.5	1012	20238
spa-mlt	flores200-devtest	0.51077	12.7	1012	22169
eng-hau	newstest2021	0.43617	13.1	1000	32966
deu-hau	ntrex128	0.41931	12.5	1997	54982
deu-heb	ntrex128	0.43961	13.3	1997	39624
deu-mlt	ntrex128	0.49871	15.1	1997	43308
eng-hau	ntrex128	0.51601	23.2	1997	54982
eng-heb	ntrex128	0.50625	20.3	1997	39624
eng-mlt	ntrex128	0.62552	29.0	1997	43308
eng-som	ntrex128	0.46845	13.5	1997	49351
fra-hau	ntrex128	0.43729	14.5	1997	54982
fra-heb	ntrex128	0.43855	13.9	1997	39624
fra-mlt	ntrex128	0.51640	17.3	1997	43308
fra-som	ntrex128	0.41813	9.6	1997	49351
por-hau	ntrex128	0.44408	15.1	1997	54982
por-heb	ntrex128	0.45739	15.0	1997	39624
por-mlt	ntrex128	0.53719	18.2	1997	43308
por-som	ntrex128	0.41367	9.3	1997	49351
spa-hau	ntrex128	0.44695	14.8	1997	54982
spa-heb	ntrex128	0.45509	14.5	1997	39624
spa-mlt	ntrex128	0.53631	17.7	1997	43308
spa-som	ntrex128	0.41755	9.1	1997	49351
eng-ara	tico19-test	0.56288	25.4	2100	51339
eng-hau	tico19-test	0.50060	22.2	2100	64509
fra-amh	tico19-test	3.575	1.3	2100	44782
fra-hau	tico19-test	5.071	1.8	2100	64509
fra-orm	tico19-test	4.044	1.8	2100	50032
fra-som	tico19-test	2.698	0.9	2100	63654
fra-tir	tico19-test	4.151	1.4	2100	46685
por-amh	tico19-test	3.799	1.4	2100	44782
por-ara	tico19-test	0.44442	16.0	2100	51339
por-hau	tico19-test	5.786	2.0	2100	64509
por-orm	tico19-test	4.613	2.0	2100	50032
por-som	tico19-test	3.413	1.2	2100	63654
por-tir	tico19-test	5.092	1.6	2100	46685
spa-amh	tico19-test	3.831	1.4	2100	44782
spa-ara	tico19-test	0.45429	16.5	2100	51339
spa-hau	tico19-test	5.790	1.9	2100	64509
spa-orm	tico19-test	4.617	1.9	2100	50032
spa-som	tico19-test	3.402	1.2	2100	63654
spa-tir	tico19-test	5.033	1.6	2100	46685

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 08:58:38 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa