File size: 4,414 Bytes
951dd95 3a01efc 951dd95 3a01efc 5fac5f4 3a01efc f7b2751 3a01efc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
license: cc0-1.0
language:
- mt
tags:
- MaltBERTa
- MaCoCu
---
# Model description
**XLMR-MaltBERTa** is a large pre-trained language model trained on Maltese texts. It was created by continuing training from the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.
XLMR-MaltBERTa was trained on 3.2GB of text, which is equal to 439M tokens. It was trained for 50,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as [MaltBERTa](https://huggingface.co/RVN/MaltBERTa), but this model was trained from scratch using the RoBERTa architecture.
The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
# How to use
```python
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # Tensorflow
```
# Data
For training, we used all Maltese data that was present in the [MaCoCu](https://macocu.eu/), Oscar and mc4 corpora. After de-duplicating the data, we were left with a total of 3.2GB of text.
# Benchmark performance
We tested the performance of MaltBERTa on the UPOS and XPOS benchmark of the [Universal Dependencies](https://universaldependencies.org/) project. Moreover, we test on a Google Translated version of the COPA data set (see our [Github repo](https://github.com/RikVN/COPA) for details). We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, though note that Maltese was not one of the training languages for those models. We also compare to the recently introduced Maltese language models [BERTu](https://huggingface.co/MLRS/BERTu), [mBERTu](https://huggingface.co/MLRS/mBERTu) and our own [MaltBERTa](https://huggingface.co/RVN/MaltBERTa). For details regarding the fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).
Scores are averages of three runs for UPOS/XPOS and 10 runs for COPA. We use the same hyperparameter settings for all models for UPOS/XPOS, while for COPA we optimize on the dev set.
| | **UPOS** | **UPOS** | **XPOS** | **XPOS** | **COPA** |
|-----------------|:--------:|:--------:|:--------:|:--------:| :--------:|
| | **Dev** | **Test** | **Dev** | **Test** | **Test** |
| **XLM-R-base** | 93.6 | 93.2 | 93.4 | 93.2 | 52.2 |
| **XLM-R-large** | 94.9 | 94.4 | 95.1 | 94.7 | 54.0 |
| **BERTu** | 97.5 | 97.6 | 95.7 | 95.8 | **55.6** |
| **mBERTu** | **97.7** | 97.8 | 97.9 | 98.1 52.6 |
| **MaltBERTa** | 95.7 | 95.8 | 96.1 | 96.0 | 53.7 |
| **XLMR-MaltBERTa** | **97.7** | **98.1** | **98.1** | **98.2** | 54.4 |
# Acknowledgements
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014-
2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
# Citation
If you use this model, please cite the following paper:
```bibtex
@inproceedings{non-etal-2022-macocu,
title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
author = "Ba{\~n}{\'o}n, Marta and
Espl{\`a}-Gomis, Miquel and
Forcada, Mikel L. and
Garc{\'\i}a-Romero, Cristian and
Kuzman, Taja and
Ljube{\v{s}}i{\'c}, Nikola and
van Noord, Rik and
Sempere, Leopoldo Pla and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Rupnik, Peter and
Suchomel, V{\'\i}t and
Toral, Antonio and
van der Werff, Tobias and
Zaragoza, Jaume",
booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
month = jun,
year = "2022",
address = "Ghent, Belgium",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2022.eamt-1.41",
pages = "303--304"
}
``` |