File size: 3,229 Bytes
69db8a3
 
c9d2add
6f47f8d
 
 
 
d05ab17
3b81a59
 
 
9fab42a
3b81a59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
034de79
 
 
d10935e
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
license: mit
---

---
widget:
- text: "Jens Peter Hansen kommer fra Danmark"
---

# Xlm-roberta (based) language-detection model (modern and medieval)

This model is a fine-tuned version of xlm-roberta-base on the [monasterium.net](https://www.icar-us.eu/en/cooperation/online-portals/monasterium-net/) dataset.

## Model description
On the top of this XLM-RoBERTa transformer model is a classification head. Please refer this model together with to the [XLM-RoBERTa (base-sized model)](https://huggingface.co/xlm-roberta-base) card or the paper [Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al.](https://arxiv.org/abs/1911.02116) for additional information.

## Intended uses & limitations
You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 40 modern and medieval languages:

Modern: Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Irish (ga), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv),  Russian (ru), Turkish (tr), Basque (eu), Catalan (ca), Albanian (sq), Serbian (se), Ukrainian (uk), Norwegian (no), Arabic (ar), Chinese (zh), Hebrew (he)

Medieval: Middle High German (mhd), Latin (la), Middle Low German (gml), Old French (fro), Old Chruch Slavonic (chu), Early New High German (fnhd)

## Training and evaluation data
The model was fine-tuned on the Monasterium and Wikipedia datasets, which consists of text sequences in 40 languages. The training set contains 80k samples, while the validation and test sets 16k. The average accuracy on the test set is 99.59% (this matches the average macro/weighted F1-score being the test set perfectly balanced). A more detailed evaluation is provided by the following table.

## Training procedure
Fine-tuning was done via the Trainer API with WeightedLossTrainer.

## Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 20
- eval_batch_size: 20
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3

mixed_precision_training: Native AMP

## Training results

| Training Loss     | Training Loss      | F1   
| ------------- | ------------- | --------    |
| 0.000300        | 0.048985         | 0.991585   |
| 0.000100         | 0.033340         | 0.994663   |
| 0.000000         | 0.032938         | 0.995979   |

## Framework versions
- Transformers 4.24.0
- Pytorch 1.8.0
- Datasets 2.6.1
- Tokenizers 0.13.3

## Citation
Please cite the following papers when using this model.

```
@misc{ercdidip2022,
  title={40 langdetect v01 (Revision 9fab42a)},
  author={Kovács, Tamás, Atzenhofer-Baumgartner, Aoun, Sandy, Florian, Nicolaou, Anguelos, Luger, Daniel, Decker, Franziska, Lamminger, Florian and Vogeler, Georg},
  year         = { 2022 },
  url          = { https://huggingface.co/ERCDiDip/40_langdetect_v01 },
  doi          = { 10.57967/hf/0099 },
  publisher    = { Hugging Face }
}
```