ZJaume commited on
Commit
4abb43d
1 Parent(s): a6b67ad

Add model card

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc0-1.0
3
+ language:
4
+ - is
5
+ tags:
6
+ - MaCoCu
7
+ ---
8
+
9
+ # Model description
10
+
11
+ **XLMR-base-MaCoCu-is** is a large pre-trained language model trained on **Icelandic** texts. It was created by continuing training from the [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project and only uses data that was crawled during the project. The main developer is [Jaume Zaragoza-Bernabeu](https://github.com/ZJaume) from Prompsit Language Engineering.
12
+
13
+ XLMR-base-MaCoCu-is was trained on 4.4GB of Icelandic text, which is equal to 688M tokens. It was trained for 40,000 steps with a batch size of 256. It uses the same vocabulary as the original XLMR-base model.
14
+
15
+ The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
16
+
17
+ ## Warning
18
+ This model has not been fully trained because it was intended for use as base of [Bicleaner AI Icelandic model](https://huggingface.co/bitextor/bicleaner-ai-full-en-is). If you need better performance, please use [XLMR-MaCoCu-is](https://huggingface.co/MaCoCu/XLMR-MaCoCu-is).
19
+
20
+ # How to use
21
+
22
+ ```python
23
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is")
26
+ model = AutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # PyTorch
27
+ model = TFAutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # Tensorflow
28
+ ```
29
+
30
+ # Data
31
+
32
+ For training, we used all Icelandic data that was present in the monolingual Icelandic [MaCoCu](https://macocu.eu/) corpus. After de-duplicating the data, we were left with a total of 4.4 GB of text, which equals 688M tokens.
33
+
34
+ # Acknowledgements
35
+
36
+ The authors received funding from the European Union’s Connecting Europe Facility 2014-
37
+ 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
38
+
39
+ # Citation
40
+
41
+ If you use this model, please cite the following paper:
42
+
43
+ ```bibtex
44
+ @inproceedings{non-etal-2022-macocu,
45
+ title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
46
+ author = "Ba{\~n}{\'o}n, Marta and
47
+ Espl{\`a}-Gomis, Miquel and
48
+ Forcada, Mikel L. and
49
+ Garc{\'\i}a-Romero, Cristian and
50
+ Kuzman, Taja and
51
+ Ljube{\v{s}}i{\'c}, Nikola and
52
+ van Noord, Rik and
53
+ Sempere, Leopoldo Pla and
54
+ Ram{\'\i}rez-S{\'a}nchez, Gema and
55
+ Rupnik, Peter and
56
+ Suchomel, V{\'\i}t and
57
+ Toral, Antonio and
58
+ van der Werff, Tobias and
59
+ Zaragoza, Jaume",
60
+ booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
61
+ month = jun,
62
+ year = "2022",
63
+ address = "Ghent, Belgium",
64
+ publisher = "European Association for Machine Translation",
65
+ url = "https://aclanthology.org/2022.eamt-1.41",
66
+ pages = "303--304"
67
+ }
68
+ ```