Update README.md
Browse files
README.md
CHANGED
@@ -85,13 +85,32 @@ We recommend fine-tuning this model to your curated data to maximally avoid unde
|
|
85 |
|
86 |
### Training Data
|
87 |
|
88 |
-
|
89 |
|
90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
### Training Procedure
|
93 |
|
94 |
-
This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of
|
95 |
|
96 |
#### Training Hyperparameters
|
97 |
|
@@ -142,7 +161,8 @@ On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art
|
|
142 |
|
143 |
### Qualitative evaluation
|
144 |
|
145 |
-
|
|
|
146 |
|
147 |
### Compute Infrastructure
|
148 |
|
|
|
85 |
|
86 |
### Training Data
|
87 |
|
88 |
+
We collect a diverse set of Dutch natural language.
|
89 |
|
90 |
+
1. **OSCAR**
|
91 |
+
The bulk of our data comes from the Dutch portion of [OSCAR](https://oscar-corpus.com), January 2023 version, based on Common Crawl. This dataset includes **93 GB** of text (~28.6B tokens).
|
92 |
+
|
93 |
+
2. **Open Subtitles**
|
94 |
+
We collected Dutch text from movie subtitles, focusing on unique movies either in Dutch or with Dutch subtitles. This dataset contains **5 GB** of text (~1.54B tokens) from **214k samples**.
|
95 |
+
|
96 |
+
3. **Project Gutenberg**
|
97 |
+
We downloaded **970 full Dutch books** from [Project Gutenberg](https://www.gutenberg.org) using a public scraper. The dataset includes **0.3 GB** of text (~92M tokens) and is available on [Hugging Face](https://huggingface.co/datasets/ChocoLlama/gutenberg-dutch).
|
98 |
+
|
99 |
+
4. **Wikipedia**
|
100 |
+
Using the March 2023 [Wikipedia dump](https://dumps.wikimedia.org), we included **2.5 GB** of text (~769M tokens). Despite some duplication with OSCAR, Wikipedia's high quality justifies its inclusion.
|
101 |
+
|
102 |
+
5. **Job Descriptions (TechWolf)**
|
103 |
+
A sample of **750k Dutch job descriptions** collected over five years from public websites, provided by TechWolf. This dataset contains **1.5 GB** of text (~462M tokens).
|
104 |
+
|
105 |
+
6. **Staatsblad (Bizzy)**
|
106 |
+
A sample of **80k legal filings** from [Het Belgisch Staatsblad](https://www.ejustice.just.fgov.be/cgi/welcome.pl). Documents were OCR-processed, and personal data was excluded. This dataset includes **1.4 GB** of text (~431M tokens), collected with help from Bizzy.
|
107 |
+
|
108 |
+
7. **Legislation (ML6)**
|
109 |
+
**15k documents** from Flemish legislation accessed via the [Open Data API](https://www.vlaanderen.be/vlaams-parlement/de-vlaamse-codex). This dataset contains **0.2 GB** of text (~62M tokens), collected with support from ML6.
|
110 |
|
111 |
### Training Procedure
|
112 |
|
113 |
+
This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 7.8% trainable parameters.
|
114 |
|
115 |
#### Training Hyperparameters
|
116 |
|
|
|
161 |
|
162 |
### Qualitative evaluation
|
163 |
|
164 |
+
In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
|
165 |
+
For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
|
166 |
|
167 |
### Compute Infrastructure
|
168 |
|