File size: 1,493 Bytes
e22d013
 
 
 
 
 
 
 
 
acb1449
 
e22d013
 
 
 
 
 
 
 
 
acb1449
 
e22d013
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{}
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. 
The models range in size from 122M to 750M parameters. The pre-training dataset consists of documents from different domains:

| Category | Source Data | Data Size | #Docs | #Tokens |
| -------- | ----------- | --------- | ------ | ------- |
| Formal | Wikipedia | 9GB | 2,665,357 | 1.9B |
| Formal | News | 28GB | 12,305,326 | 6.1B |
| Formal | GC4 | 90GB | 31,669,772 | 19.4B |
| Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B |
| Informal | Holiday Reviews | 2GB | 4,876,405 | 428M |
| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B |
| Medical | Smaller public datasets | 253MB | 179,776 | 50M |
| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M |
| Medical | Medicine Dissertations | 1.4GB | 14,496 | 295M |
| Medical | Pubmed abstracts | 8.5GB | 21,044,382 | 1.7B |
| Medical | MIMIC III | 2.6GB | 24,221,834 | 695M |
| Medical | PMC-Patients-ReCDS | 2.1GB | 1,743,344 | 414M |
| Literature | German Fiction | 1.1GB | 3,219 | 243M |
| Literature | English books | 7.1GB | 11,038 | 1.6B |
| - | Total | 167GB | 116,079,769 | 35.8B |