Add model card
Browse files
README.md
ADDED
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc0-1.0
|
3 |
+
language:
|
4 |
+
- is
|
5 |
+
tags:
|
6 |
+
- MaCoCu
|
7 |
+
---
|
8 |
+
|
9 |
+
# Model description
|
10 |
+
|
11 |
+
**XLMR-base-MaCoCu-is** is a large pre-trained language model trained on **Icelandic** texts. It was created by continuing training from the [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project and only uses data that was crawled during the project. The main developer is [Jaume Zaragoza-Bernabeu](https://github.com/ZJaume) from Prompsit Language Engineering.
|
12 |
+
|
13 |
+
XLMR-base-MaCoCu-is was trained on 4.4GB of Icelandic text, which is equal to 688M tokens. It was trained for 40,000 steps with a batch size of 256. It uses the same vocabulary as the original XLMR-base model.
|
14 |
+
|
15 |
+
The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
|
16 |
+
|
17 |
+
## Warning
|
18 |
+
This model has not been fully trained because it was intended for use as base of [Bicleaner AI Icelandic model](https://huggingface.co/bitextor/bicleaner-ai-full-en-is). If you need better performance, please use [XLMR-MaCoCu-is](https://huggingface.co/MaCoCu/XLMR-MaCoCu-is).
|
19 |
+
|
20 |
+
# How to use
|
21 |
+
|
22 |
+
```python
|
23 |
+
from transformers import AutoTokenizer, AutoModel, TFAutoModel
|
24 |
+
|
25 |
+
tokenizer = AutoTokenizer.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is")
|
26 |
+
model = AutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # PyTorch
|
27 |
+
model = TFAutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # Tensorflow
|
28 |
+
```
|
29 |
+
|
30 |
+
# Data
|
31 |
+
|
32 |
+
For training, we used all Icelandic data that was present in the monolingual Icelandic [MaCoCu](https://macocu.eu/) corpus. After de-duplicating the data, we were left with a total of 4.4 GB of text, which equals 688M tokens.
|
33 |
+
|
34 |
+
# Acknowledgements
|
35 |
+
|
36 |
+
The authors received funding from the European Union’s Connecting Europe Facility 2014-
|
37 |
+
2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
|
38 |
+
|
39 |
+
# Citation
|
40 |
+
|
41 |
+
If you use this model, please cite the following paper:
|
42 |
+
|
43 |
+
```bibtex
|
44 |
+
@inproceedings{non-etal-2022-macocu,
|
45 |
+
title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
|
46 |
+
author = "Ba{\~n}{\'o}n, Marta and
|
47 |
+
Espl{\`a}-Gomis, Miquel and
|
48 |
+
Forcada, Mikel L. and
|
49 |
+
Garc{\'\i}a-Romero, Cristian and
|
50 |
+
Kuzman, Taja and
|
51 |
+
Ljube{\v{s}}i{\'c}, Nikola and
|
52 |
+
van Noord, Rik and
|
53 |
+
Sempere, Leopoldo Pla and
|
54 |
+
Ram{\'\i}rez-S{\'a}nchez, Gema and
|
55 |
+
Rupnik, Peter and
|
56 |
+
Suchomel, V{\'\i}t and
|
57 |
+
Toral, Antonio and
|
58 |
+
van der Werff, Tobias and
|
59 |
+
Zaragoza, Jaume",
|
60 |
+
booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
|
61 |
+
month = jun,
|
62 |
+
year = "2022",
|
63 |
+
address = "Ghent, Belgium",
|
64 |
+
publisher = "European Association for Machine Translation",
|
65 |
+
url = "https://aclanthology.org/2022.eamt-1.41",
|
66 |
+
pages = "303--304"
|
67 |
+
}
|
68 |
+
```
|