|
--- |
|
|
|
|
|
{} |
|
--- |
|
|
|
# GeBERTa |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. |
|
The models range in size from 122M to 750M parameters. |
|
|
|
|
|
## Model details |
|
|
|
The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary, |
|
while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps |
|
and have a maximum sequence length of 512 tokens. |
|
|
|
|
|
## Dataset |
|
|
|
The pre-training dataset consists of documents from different domains: |
|
|
|
| Domain | Dataset | Data Size | #Docs | #Tokens | |
|
| -------- | ----------- | --------- | ------ | ------- | |
|
| Formal | Wikipedia | 9GB | 2,665,357 | 1.9B | |
|
| Formal | News | 28GB | 12,305,326 | 6.1B | |
|
| Formal | GC4 | 90GB | 31,669,772 | 19.4B | |
|
| Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B | |
|
| Informal | Holiday Reviews | 2GB | 4,876,405 | 428M | |
|
| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B | |
|
| Medical | Smaller public datasets | 253MB | 179,776 | 50M | |
|
| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M | |
|
| Medical | Medicine Dissertations | 1.4GB | 14,496 | 295M | |
|
| Medical | Pubmed abstracts (translated) | 8.5GB | 21,044,382 | 1.7B | |
|
| Medical | MIMIC III (translated) | 2.6GB | 24,221,834 | 695M | |
|
| Medical | PMC-Patients-ReCDS (translated) | 2.1GB | 1,743,344 | 414M | |
|
| Literature | German Fiction | 1.1GB | 3,219 | 243M | |
|
| Literature | English books (translated) | 7.1GB | 11,038 | 1.6B | |
|
| - | Total | 167GB | 116,079,769 | 35.8B | |
|
|
|
|
|
## Benchmark |
|
|
|
In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering, |
|
classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets. |
|
When the datasets provided training, development, and test sets, we used them accordingly. |
|
|
|
|
|
|
|
We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available. |
|
The following table presents the F1 scores: |
|
|
|
|
|
| Model | [GE14](https://huggingface.co/datasets/germeval_14) | [GQuAD](https://huggingface.co/datasets/deepset/germanquad) | [GE18](https://huggingface.co/datasets/philschmid/germeval18) | TS | [GGP](https://github.com/JULIELab/GGPOnc) | GRAS<sup>1</sup> | [JS](https://github.com/JULIELab/jsyncc) | [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) | Avg | |
|
|:---------------------:|:--------:|:----------:|:--------:|:--------:|:-------:|:------:|:--------:|:------:|:------:| |
|
| [GBERT](https://huggingface.co/deepset/gbert-base)<sub>base</sub> | 87.10±0.12 | 72.19±0.82 | 51.27±1.4 | 72.34±0.48 | 78.17±0.25 | 62.90±0.01 | 77.18±3.34 | 88.03±0.20 | 73.65±0.50 | |
|
| [GELECTRA](https://huggingface.co/deepset/gelectra-base)<sub>base</sub> | 86.19±0.5 | 74.09±0.70 | 48.02±1.80 | 70.62±0.44 | 77.53±0.11 | 65.97±0.01 | 71.17±2.94 | 88.06±0.37 | 72.71±0.66 | |
|
| [GottBERT](https://huggingface.co/uklfr/gottbert-base) | 87.15±0.19 | 72.76±0.378 | 51.12±1.20 | 74.25±0.80 | **78.18**±0.11 | 65.71±0.01 | 74.60±4.75 | 88.61±0.23 | 74.05±0.51 | |
|
| GeBERTa<sub>base</sub> | **88.06**±0.22 | **78.54**±0.32 | **53.16**±1.39 | **74.83**±0.36 | 78.13±0.15 | **68.37**±1.11 | **81.85**±5.23 | **89.14**±0.32 | **76.51**±0.32 | |
|
|
|
<sup>1</sup>Is not published yet but is described in the [MedBERT.de paper](https://arxiv.org/abs/2303.08179). |
|
|
|
## Publication |
|
|
|
The publication is following soon. |
|
|
|
## Contact |
|
|
|
<[email protected]> |
|
|