File size: 2,916 Bytes
4abb43d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: cc0-1.0
language:
- is
tags:
- MaCoCu
---

# Model description

**XLMR-base-MaCoCu-is** is a large pre-trained language model trained on **Icelandic** texts. It was created by continuing training from the [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project and only uses data that was crawled during the project. The main developer is [Jaume Zaragoza-Bernabeu](https://github.com/ZJaume) from Prompsit Language Engineering.

XLMR-base-MaCoCu-is was trained on 4.4GB of Icelandic text, which is equal to 688M tokens. It was trained for 40,000 steps with a batch size of 256. It uses the same vocabulary as the original XLMR-base model.

The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).

## Warning
This model has not been fully trained because it was intended for use as base of [Bicleaner AI Icelandic model](https://huggingface.co/bitextor/bicleaner-ai-full-en-is). If you need better performance, please use [XLMR-MaCoCu-is](https://huggingface.co/MaCoCu/XLMR-MaCoCu-is).

# How to use

```python
from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is")
model = AutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # PyTorch
model = TFAutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # Tensorflow
```

# Data

For training, we used all Icelandic data that was present in the monolingual Icelandic [MaCoCu](https://macocu.eu/) corpus. After de-duplicating the data, we were left with a total of 4.4 GB of text, which equals 688M tokens.

# Acknowledgements

The authors received funding from the European Union’s Connecting Europe Facility 2014-
2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

# Citation

If you use this model, please cite the following paper:

```bibtex
@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}
```