matthieumeeus97 commited on
Commit
18968f9
·
verified ·
1 Parent(s): d49a2c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -4
README.md CHANGED
@@ -85,13 +85,32 @@ We recommend fine-tuning this model to your curated data to maximally avoid unde
85
 
86
  ### Training Data
87
 
88
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
89
 
90
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ### Training Procedure
93
 
94
- This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 4% trainable parameters.
95
 
96
  #### Training Hyperparameters
97
 
@@ -142,7 +161,8 @@ On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art
142
 
143
  ### Qualitative evaluation
144
 
145
-
 
146
 
147
  ### Compute Infrastructure
148
 
 
85
 
86
  ### Training Data
87
 
88
+ We collect a diverse set of Dutch natural language.
89
 
90
+ 1. **OSCAR**
91
+ The bulk of our data comes from the Dutch portion of [OSCAR](https://oscar-corpus.com), January 2023 version, based on Common Crawl. This dataset includes **93 GB** of text (~28.6B tokens).
92
+
93
+ 2. **Open Subtitles**
94
+ We collected Dutch text from movie subtitles, focusing on unique movies either in Dutch or with Dutch subtitles. This dataset contains **5 GB** of text (~1.54B tokens) from **214k samples**.
95
+
96
+ 3. **Project Gutenberg**
97
+ We downloaded **970 full Dutch books** from [Project Gutenberg](https://www.gutenberg.org) using a public scraper. The dataset includes **0.3 GB** of text (~92M tokens) and is available on [Hugging Face](https://huggingface.co/datasets/ChocoLlama/gutenberg-dutch).
98
+
99
+ 4. **Wikipedia**
100
+ Using the March 2023 [Wikipedia dump](https://dumps.wikimedia.org), we included **2.5 GB** of text (~769M tokens). Despite some duplication with OSCAR, Wikipedia's high quality justifies its inclusion.
101
+
102
+ 5. **Job Descriptions (TechWolf)**
103
+ A sample of **750k Dutch job descriptions** collected over five years from public websites, provided by TechWolf. This dataset contains **1.5 GB** of text (~462M tokens).
104
+
105
+ 6. **Staatsblad (Bizzy)**
106
+ A sample of **80k legal filings** from [Het Belgisch Staatsblad](https://www.ejustice.just.fgov.be/cgi/welcome.pl). Documents were OCR-processed, and personal data was excluded. This dataset includes **1.4 GB** of text (~431M tokens), collected with help from Bizzy.
107
+
108
+ 7. **Legislation (ML6)**
109
+ **15k documents** from Flemish legislation accessed via the [Open Data API](https://www.vlaanderen.be/vlaams-parlement/de-vlaamse-codex). This dataset contains **0.2 GB** of text (~62M tokens), collected with support from ML6.
110
 
111
  ### Training Procedure
112
 
113
+ This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 7.8% trainable parameters.
114
 
115
  #### Training Hyperparameters
116
 
 
161
 
162
  ### Qualitative evaluation
163
 
164
+ In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
165
+ For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
166
 
167
  ### Compute Infrastructure
168