geberta-base / README.md
amindada's picture
Update README.md
acb1449
|
raw
history blame
1.49 kB
metadata
{}

Model Card for Model ID

GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. The models range in size from 122M to 750M parameters. The pre-training dataset consists of documents from different domains:

Category Source Data Data Size #Docs #Tokens
Formal Wikipedia 9GB 2,665,357 1.9B
Formal News 28GB 12,305,326 6.1B
Formal GC4 90GB 31,669,772 19.4B
Informal Reddit 2019-2023 (GER) 5.8GB 15,036,592 1.3B
Informal Holiday Reviews 2GB 4,876,405 428M
Legal OpenLegalData: German cases and laws 5.4GB 308,228 1B
Medical Smaller public datasets 253MB 179,776 50M
Medical CC medical texts 3.6GB 2,000,000 682M
Medical Medicine Dissertations 1.4GB 14,496 295M
Medical Pubmed abstracts 8.5GB 21,044,382 1.7B
Medical MIMIC III 2.6GB 24,221,834 695M
Medical PMC-Patients-ReCDS 2.1GB 1,743,344 414M
Literature German Fiction 1.1GB 3,219 243M
Literature English books 7.1GB 11,038 1.6B
- Total 167GB 116,079,769 35.8B