metadata

license: cc0-1.0
language:
  - mt
tags:
  - MaltBERTa
  - MaCoCu

Model description

XLMR-MaltBERTa is a large pre-trained language model trained on Maltese texts. It was created by continuing training from the XLM-RoBERTa-large model. It was developed as part of the MaCoCu project. The main developer is Rik van Noord from the University of Groningen.

XLMR-MaltBERTa was trained on 3.2GB of text, which is equal to 439M tokens. It was trained for 50,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as MaltBERTa, but this model was trained from scratch using the RoBERTa architecture.

The training and fine-tuning procedures are described in detail on our Github repo.

How to use

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # Tensorflow

Data

For training, we used all Maltese data that was present in the MaCoCu, Oscar and mc4 corpora. After de-duplicating the data, we were left with a total of 3.2GB of text.

Benchmark performance

We tested the performance of MaltBERTa on the UPOS and XPOS benchmark of the Universal Dependencies project. We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, though note that Maltese was not one of the training languages for those models. We also compare to the recently introduced Maltese language models BERTu, mBERTu and our own MaltBERTa. For details regarding the fine-tuning procedure you can checkout our Github.

Scores are averages of three runs. We use the same hyperparameter settings for all models.

	UPOS	UPOS	XPOS	XPOS
	Dev	Test	Dev	Test
XLM-R-base	93.6	93.2	93.4	93.2
XLM-R-large	94.9	94.4	95.1	94.7
BERTu	97.5	97.6	95.7	95.8
mBERTu	97.7	97.8	97.9	98.1
MaltBERTa	95.7	95.8	96.1	96.0
XLMR-MaltBERTa	97.7	98.1	98.1	98.2

Citation

If you use this model, please cite the following paper:

@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}