Update README.md
Browse files
README.md
CHANGED
@@ -4,11 +4,23 @@
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
-
#
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM.
|
11 |
-
The models range in size from 122M to 750M parameters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
| Domain | Dataset | Data Size | #Docs | #Tokens |
|
14 |
| -------- | ----------- | --------- | ------ | ------- |
|
@@ -29,7 +41,32 @@ The models range in size from 122M to 750M parameters. The pre-training dataset
|
|
29 |
| - | Total | 167GB | 116,079,769 | 35.8B |
|
30 |
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
|
|
33 |
|
|
|
34 |
|
|
|
35 |
|
|
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
+
# GeBERTa
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM.
|
11 |
+
The models range in size from 122M to 750M parameters.
|
12 |
+
|
13 |
+
|
14 |
+
## Model details
|
15 |
+
|
16 |
+
The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary,
|
17 |
+
while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps
|
18 |
+
and have a maximum sequence length of 512 tokens.
|
19 |
+
|
20 |
+
|
21 |
+
## Dataset
|
22 |
+
|
23 |
+
The pre-training dataset consists of documents from different domains:
|
24 |
|
25 |
| Domain | Dataset | Data Size | #Docs | #Tokens |
|
26 |
| -------- | ----------- | --------- | ------ | ------- |
|
|
|
41 |
| - | Total | 167GB | 116,079,769 | 35.8B |
|
42 |
|
43 |
|
44 |
+
## Benchmark
|
45 |
+
|
46 |
+
In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering,
|
47 |
+
classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection, using two existing datasets.
|
48 |
+
When the datasets provided training, development, and test sets, we used them accordingly.
|
49 |
+
|
50 |
+
|
51 |
+
|
52 |
+
We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available.
|
53 |
+
The following table presents the F1 scores:
|
54 |
+
|
55 |
+
|
56 |
+
|
57 |
+
| Model | [GE14](https://huggingface.co/datasets/germeval_14) | [GQuAD](https://huggingface.co/datasets/deepset/germanquad) | [GE18](https://huggingface.co/datasets/philschmid/germeval18) | TS | [GGP](https://github.com/JULIELab/GGPOnc) | GRAS<sup>1</sup> | [JS](https://github.com/JULIELab/jsyncc) | [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) | Avg |
|
58 |
+
|:---------------------:|:--------:|:----------:|:--------:|:--------:|:-------:|:------:|:--------:|:------:|:------:|
|
59 |
+
| gbert-base | 87.10±0.12 | 72.19±0.82 | 51.27±1.4 | 72.34±0.48 | 78.17±0.25 | 62.90±0.01 | 77.18±3.34 | 88.03±0.20 | 73.65±0.50 |
|
60 |
+
| gelectra-base | 86.19±0.5 | 74.09±0.70 | 48.02±1.80 | 70.62±0.44 | 77.53±0.11 | 65.97±0.01 | 71.17±2.94 | 88.06±0.37 | 72.71±0.66 |
|
61 |
+
| gottbert | 87.15±0.19 | 72.76±0.378 | 51.12±1.20 | 74.25±0.80 | **78.18**±0.11 | 65.71±0.01 | 74.60±4.75 | 88.61±0.23 | 74.05±0.51 |
|
62 |
+
| geberta-base | **88.06**±0.22 | **78.54**±0.32 | **53.16**±1.39 | **74.83**±0.36 | 78.13±0.15 | **68.37**±1.11 | **81.85**±5.23 | **89.14**±0.32 | **76.51**±0.32 |
|
63 |
+
|
64 |
+
<sup>1</sup>Is not published yet but described in the [MedBERT.de paper](https://arxiv.org/abs/2303.08179).
|
65 |
|
66 |
+
## Publication
|
67 |
|
68 |
+
The publication is following soon.
|
69 |
|
70 |
+
## Contact
|
71 |
|
72 |