RVN commited on
Commit
fd76b0d
1 Parent(s): 9418836

First README version

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md CHANGED
@@ -1,3 +1,90 @@
1
  ---
2
  license: cc0-1.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc0-1.0
3
+ language:
4
+ - bg
5
+ - mk
6
+ tags:
7
+ - BERTovski
8
+ - MaCoCu
9
  ---
10
+
11
+ # Model description
12
+
13
+ **XLMR-BERTovski** is a large pre-trained language model trained on Bulgarian and Macedonian texts. It was created by continuing training from the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.
14
+
15
+ XLMR-BERTovski was trained on 74GB of Bulgarian and Macedonian text, which is equal to just over 7 billion tokens. It was trained for 67,500 steps with a batch size of 1,024, which was approximately 2.5 epochs. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as [BERTovski](https://huggingface.co/RVN/BERTovski), but this model was trained from scratch using the RoBERTa architecture.
16
+
17
+ The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
18
+
19
+ # How to use
20
+
21
+ ```python
22
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-BERTovski")
25
+ model = AutoModel.from_pretrained("RVN/XLMR-BERTovski") # PyTorch
26
+ model = TFAutoModel.from_pretrained("RVN/XLMR-BERTovski") # Tensorflow
27
+ ```
28
+
29
+ # Data
30
+
31
+ For training, we used all Bulgarian and Macedonian data that was present in the [MaCoCu](https://macocu.eu/), Oscar, mc4 and Wikipedia corpora. In a manual analysis we found that for Oscar and mc4, if the data did not come from the corresponding domain (.bg or .mk), it was often (badly) machine translated. Therefore, we opted to only use data that originally came from a .bg or .mk domain.
32
+
33
+ After de-duplicating the data, we were left with a total of 54.5 GB of Bulgarian and 9 GB of Macedonian text. Since there was quite a bit more Bulgarian data, we simply doubled the Macedonian data during training.
34
+
35
+ # Benchmark performance
36
+
37
+ We tested performance of XLMR-BERTovski on benchmarks of XPOS, UPOS and NER. For Bulgarian, we used the data from the [Universal Dependencies](http://nl.ijs.si/nikola/macocu/bertovski.tgz) project. For Macedonian, we used the data sets created in the [babushka-bench](https://github.com/clarinsi/babushka-bench/) project. We compare performance to [BERTovski](https://huggingface.co/RVN/BERTovski) and the strong multi-lingual models XLMR-base and XLMR-large. For details regarding the fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).
38
+
39
+ Scores are averages of three runs. We use the same hyperparameter settings for all models.
40
+
41
+ ## Bulgarian
42
+
43
+ | | **UPOS** | **UPOS** | **XPOS** | **XPOS** | **NER** | **NER** |
44
+ |-----------------|:--------:|:--------:|:--------:|:--------:|:-------:|:--------:|
45
+ | | **Dev** | **Test** | **Dev** | **Test** | **Dev** | **Test** |
46
+ | **XLM-R-base** | 99.2 | 99.4 | 98.0 | 98.3 | 93.2 | 92.9 |
47
+ | **XLM-R-large** | 99.3 | 99.4 | 97.4 | 97.7 | 93.7 | 93.5 |
48
+ | **BERTovski** | 98.8 | 99.1 | 97.6 | 97.8 | 93.5 | 93.3 |
49
+ | **XLMR-BERTovski** | 99.3 | 99.5 | 98.5 | 98.8 | 94.4 | 94.3 |
50
+
51
+ ## Macedonian
52
+
53
+ | | **UPOS** | **UPOS** | **XPOS** | **XPOS** | **NER** | **NER** |
54
+ |-----------------|:--------:|:--------:|:--------:|:--------:|:-------:|:--------:|
55
+ | | **Dev** | **Test** | **Dev** | **Test** | **Dev** | **Test** |
56
+ | **XLM-R-base** | 98.3 | 98.6 | 97.3 | 97.1 | 92.8 | 94.8 |
57
+ | **XLM-R-large** | 98.3 | 98.7 | 97.7 | 97.5 | 93.3 | 95.1 |
58
+ | **BERTovski** | 97.8 | 98.1 | 96.4 | 96.0 | 92.8 | 94.6 |
59
+ | **XLMR-BERTovski** | 98.6 | 98.8 | 98.0 | 97.7 | 94.4 | 96.3 |
60
+
61
+ # Citation
62
+
63
+ If you use this model, please cite the following paper:
64
+
65
+ ```bibtex
66
+ @inproceedings{non-etal-2022-macocu,
67
+ title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
68
+ author = "Ba{\~n}{\'o}n, Marta and
69
+ Espl{\`a}-Gomis, Miquel and
70
+ Forcada, Mikel L. and
71
+ Garc{\'\i}a-Romero, Cristian and
72
+ Kuzman, Taja and
73
+ Ljube{\v{s}}i{\'c}, Nikola and
74
+ van Noord, Rik and
75
+ Sempere, Leopoldo Pla and
76
+ Ram{\'\i}rez-S{\'a}nchez, Gema and
77
+ Rupnik, Peter and
78
+ Suchomel, V{\'\i}t and
79
+ Toral, Antonio and
80
+ van der Werff, Tobias and
81
+ Zaragoza, Jaume",
82
+ booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
83
+ month = jun,
84
+ year = "2022",
85
+ address = "Ghent, Belgium",
86
+ publisher = "European Association for Machine Translation",
87
+ url = "https://aclanthology.org/2022.eamt-1.41",
88
+ pages = "303--304"
89
+ }
90
+ ```