ReaderBench commited on
Commit
537489f
1 Parent(s): cea027b

Update README

Browse files
Files changed (1) hide show
  1. README.md +96 -10
README.md CHANGED
@@ -1,27 +1,112 @@
1
- Model card for RoBERT-base
2
 
3
- # RoBERT-base
 
 
 
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- ## BERT base model for Romanian
7
 
8
 
9
  #### How to use
10
 
11
- TBC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
 
14
  ## Training data
15
 
16
- TBC
17
 
18
- ## Training procedure
 
 
 
 
 
19
 
20
- TBC
21
 
22
- ## Eval results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- TBC
25
 
26
  ### BibTeX entry and citation info
27
 
@@ -29,7 +114,8 @@ TBC
29
  @inproceedings{RoBERT,
30
  title={RoBERT – A Romanian BERT Model},
31
  author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai,
32
- booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
33
  year={2020}
34
  }
35
  ```
 
 
1
+ Model card for RoBERT-small
2
 
3
+ ---
4
+ language:
5
+ - ro
6
+ ---
7
 
8
+ # RoBERT-ba
9
+
10
+
11
+ ## Pretrained BERT model for Romanian
12
+
13
+ Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective.
14
+ It was introduced in this [paper](https://www.blank.org/). Three BERT models were released: RoBERT-small, **RoBERT-base** and RoBERT-large, all versions uncased.
15
+
16
+ | Model | Weights | L | H | A | MLM accuracy | NSP accuracy |
17
+ |----------------|:---------:|:------:|:------:|:------:|:--------------:|:--------------:|
18
+ | RoBERT-small | 19M | 12 | 256 | 8 | 0.5363 | 0.9687 |
19
+ | *RoBERT-base* | *114M* | *12* | *768* | *12* | *0.6511* | *0.9802* |
20
+ | RoBERT-large | 341M | 24 | 1024 | 24 | 0.6929 | 0.9843 |
21
+
22
+
23
+
24
+
25
+ All models are available:
26
+
27
+ * [RoBERT-small](https://huggingface.co/readerbench/RoBERT-small)
28
+ * [RoBERT-base](https://huggingface.co/readerbench/RoBERT-base)
29
+ * [RoBERT-large](https://huggingface.co/readerbench/RoBERT-large)
30
 
 
31
 
32
 
33
  #### How to use
34
 
35
+ ```python
36
+ # tensorflow
37
+ from transformers import AutoModel, AutoTokenizer, TFAutoModel
38
+ tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
39
+ model = TFAutoModel.from_pretrained("readerbench/RoBERT-base")
40
+ inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
41
+ outputs = model(inputs)
42
+
43
+ # pytorch
44
+ from transformers import AutoModel, AutoTokenizer, AutoModel
45
+ tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
46
+ model = AutoModel.from_pretrained("readerbench/RoBERT-base")
47
+ inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
48
+ outputs = model(**inputs)
49
+ ```
50
 
51
 
52
  ## Training data
53
 
54
+ The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.
55
 
56
+ | Corpus | Words | Sentences | Size (GB)|
57
+ |-----------|:---------:|:---------:|:--------:|
58
+ | Oscar | 1.78B | 87M | 10.8 |
59
+ | RoTex | 240M | 14M | 1.5 |
60
+ | RoWiki | 50M | 2M | 0.3 |
61
+ | **Total** | **2.07B** | **103M** | **12.6** |
62
 
 
63
 
64
+ ## Downstream performance
65
+
66
+ ### Sentiment analysis
67
+
68
+ We report Macro-averaged F1 score (in %)
69
+
70
+ | Model | Dev | Test |
71
+ |------------------|:--------:|:--------:|
72
+ | multilingual-BERT| 68.96 | 69.57 |
73
+ | XLM-R-base | 71.26 | 71.71 |
74
+ | BERT-base-ro | 70.49 | 71.02 |
75
+ | RoBERT-small | 66.32 | 66.37 |
76
+ | *RoBERT-base* | *70.89* | *71.61* |
77
+ | RoBERT-large | **72.48**| **72.11**|
78
+
79
+ ### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification
80
+
81
+ We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %).
82
+
83
+ | Model | Dialect Classification | MD to RO | RO to MD |
84
+ |-------------------|:----------------------:|:--------:|:--------:|
85
+ | 2-CNN + SVM | 93.40 | 65.09 | 75.21 |
86
+ | Char+Word SVM | 96.20 | 69.08 | 81.93 |
87
+ | BiGRU | 93.30 | **70.10**| 80.30 |
88
+ | multilingual-BERT | 95.34 | 68.76 | 78.24 |
89
+ | XLM-R-base | 96.28 | 69.93 | 82.28 |
90
+ | BERT-base-ro | 96.20 | 69.93 | 78.79 |
91
+ | RoBERT-small | 95.67 | 69.01 | 80.40 |
92
+ | *RoBERT-base* | *97.39* | *68.30* | *81.09* |
93
+ | RoBERT-large | **97.78** | 69.91 | **83.65**|
94
+
95
+ ### Diacritics Restoration
96
+
97
+ Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/). We report results on the official test set, as accuracies in %.
98
+
99
+ | Model | word level | char level |
100
+ |-----------------------------|:----------:|:----------:|
101
+ | BiLSTM | 99.42 | - |
102
+ | CharCNN | 98.40 | 99.65 |
103
+ | CharCNN + multilingual-BERT | 99.72 | 99.94 |
104
+ | CharCNN + XLM-R-base | 99.76 | **99.95** |
105
+ | CharCNN + BERT-base-ro | **99.79** | **99.95** |
106
+ | CharCNN + RoBERT-small | 99.73 | 99.94 |
107
+ | *CharCNN + RoBERT-base* | *99.78* | **99.95** |
108
+ | CharCNN + RoBERT-large | 99.76 | **99.95** |
109
 
 
110
 
111
  ### BibTeX entry and citation info
112
 
 
114
  @inproceedings{RoBERT,
115
  title={RoBERT – A Romanian BERT Model},
116
  author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai,
117
+ booktitle={Proceedings of the 28th International Conference on Computational Linguistics (COLING)},
118
  year={2020}
119
  }
120
  ```
121
+