RVN commited on
Commit
3a01efc
1 Parent(s): 213a872

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
2
  license: cc0-1.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc0-1.0
3
+ language:
4
+ - mt
5
+ tags:
6
+ - MaltBERTa
7
+ - MaCoCu
8
  ---
9
+
10
+ # Model description
11
+
12
+ **XLMR-MaltBERTa** is a large pre-trained language model trained on Maltese texts. It was created by continuing training from the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.
13
+
14
+ XLMR-MaltBERTa was trained on 3.2GB of text, which is equal to 439M tokens. It was trained for 50,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as [MaltBERTa](https://huggingface.co/RVN/MaltBERTa), but this model was trained from scratch using the RoBERTa architecture.
15
+
16
+ The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
17
+
18
+ # How to use
19
+
20
+ ```python
21
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
24
+ model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # PyTorch
25
+ model = TFAutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # Tensorflow
26
+ ```
27
+
28
+ # Data
29
+
30
+ For training, we used all Maltese data that was present in the [MaCoCu](https://macocu.eu/), Oscar and mc4 corpora. After de-duplicating the data, we were left with a total of 3.2GB of text.
31
+
32
+ # Benchmark performance
33
+
34
+ We tested the performance of MaltBERTa on the UPOS and XPOS benchmark of the [Universal Dependencies](https://universaldependencies.org/) project. We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, though note that Maltese was not one of the training languages for those models. We also compare to the recently introduced Maltese language models [BERTu](https://huggingface.co/MLRS/BERTu), [mBERTu](https://huggingface.co/MLRS/mBERTu) and our own [MaltBERTa](https://huggingface.co/RVN/MaltBERTa). For details regarding the fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).
35
+
36
+ Scores are averages of three runs. We use the same hyperparameter settings for all models.
37
+
38
+ | | **UPOS** | **UPOS** | **XPOS** | **XPOS** |
39
+ |-----------------|:--------:|:--------:|:--------:|:--------:|
40
+ | | **Dev** | **Test** | **Dev** | **Test** |
41
+ | **XLM-R-base** | 93.6 | 93.2 | 93.4 | 93.2 |
42
+ | **XLM-R-large** | 94.9 | 94.4 | 95.1 | 94.7 |
43
+ | **BERTu** | 97.5 | 97.6 | 95.7 | 95.8 |
44
+ | **mBERTu** | **97.7** | 97.8 | 97.9 | 98.1 |
45
+ | **MaltBERTa** | 95.7 | 95.8 | 96.1 | 96.0 |
46
+ | **XLMR-MaltBERTa** | **97.7** | **98.1** | **98.1** | **98.2** |
47
+
48
+ # Citation
49
+
50
+ If you use this model, please cite the following paper:
51
+
52
+ ```bibtex
53
+ @inproceedings{non-etal-2022-macocu,
54
+ title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
55
+ author = "Ba{\~n}{\'o}n, Marta and
56
+ Espl{\`a}-Gomis, Miquel and
57
+ Forcada, Mikel L. and
58
+ Garc{\'\i}a-Romero, Cristian and
59
+ Kuzman, Taja and
60
+ Ljube{\v{s}}i{\'c}, Nikola and
61
+ van Noord, Rik and
62
+ Sempere, Leopoldo Pla and
63
+ Ram{\'\i}rez-S{\'a}nchez, Gema and
64
+ Rupnik, Peter and
65
+ Suchomel, V{\'\i}t and
66
+ Toral, Antonio and
67
+ van der Werff, Tobias and
68
+ Zaragoza, Jaume",
69
+ booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
70
+ month = jun,
71
+ year = "2022",
72
+ address = "Ghent, Belgium",
73
+ publisher = "European Association for Machine Translation",
74
+ url = "https://aclanthology.org/2022.eamt-1.41",
75
+ pages = "303--304"
76
+ }
77
+ ```