MaCoCu
/

XLMR-MaltBERTa

 ---
 license: cc0-1.0
+language:
+- mt
+tags:
+- MaltBERTa
+- MaCoCu
 ---
+# Model description
+**XLMR-MaltBERTa** is a large pre-trained language model trained on Maltese texts. It was created by continuing training from the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.
+XLMR-MaltBERTa was trained on 3.2GB of text, which is equal to 439M tokens. It was trained for 50,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as [MaltBERTa](https://huggingface.co/RVN/MaltBERTa), but this model was trained from scratch using the RoBERTa architecture.
+The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
+# How to use
+```python
+from transformers import AutoTokenizer, AutoModel, TFAutoModel
+tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
+model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # PyTorch
+model = TFAutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # Tensorflow
+```
+# Data
+For training, we used all Maltese data that was present in the [MaCoCu](https://macocu.eu/), Oscar and mc4 corpora. After de-duplicating the data, we were left with a total of 3.2GB of text.
+# Benchmark performance
+We tested the performance of MaltBERTa on the UPOS and XPOS benchmark of the [Universal Dependencies](https://universaldependencies.org/) project.  We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, though note that Maltese was not one of the training languages for those models. We also compare to the recently introduced Maltese language models [BERTu](https://huggingface.co/MLRS/BERTu), [mBERTu](https://huggingface.co/MLRS/mBERTu) and our own [MaltBERTa](https://huggingface.co/RVN/MaltBERTa). For details regarding the fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).
+Scores are averages of three runs. We use the same hyperparameter settings for all models.
+|                 | **UPOS** | **UPOS** | **XPOS** | **XPOS** |
+|-----------------|:--------:|:--------:|:--------:|:--------:|
+|                 |  **Dev** | **Test** |  **Dev** | **Test** |
+| **XLM-R-base**  |   93.6   |   93.2   |   93.4   |   93.2   |
+| **XLM-R-large** |   94.9   |   94.4   |   95.1   |   94.7   |
+| **BERTu**       |   97.5   |   97.6   |   95.7   |   95.8   |
+| **mBERTu**      |   **97.7**   |   97.8   |   97.9   |   98.1   |
+| **MaltBERTa**   |   95.7   |   95.8   |   96.1   |   96.0   |
+| **XLMR-MaltBERTa**   |   **97.7**   |   **98.1**  |   **98.1**  |   **98.2**   |
+# Citation
+If you use this model, please cite the following paper:
+```bibtex
+@inproceedings{non-etal-2022-macocu,
+    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
+    author = "Ba{\~n}{\'o}n, Marta  and
+      Espl{\`a}-Gomis, Miquel  and
+      Forcada, Mikel L.  and
+      Garc{\'\i}a-Romero, Cristian  and
+      Kuzman, Taja  and
+      Ljube{\v{s}}i{\'c}, Nikola  and
+      van Noord, Rik  and
+      Sempere, Leopoldo Pla  and
+      Ram{\'\i}rez-S{\'a}nchez, Gema  and
+      Rupnik, Peter  and
+      Suchomel, V{\'\i}t  and
+      Toral, Antonio  and
+      van der Werff, Tobias  and
+      Zaragoza, Jaume",
+    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
+    month = jun,
+    year = "2022",
+    address = "Ghent, Belgium",
+    publisher = "European Association for Machine Translation",
+    url = "https://aclanthology.org/2022.eamt-1.41",
+    pages = "303--304"
+}
+```