ChocoLlama
/

ChocoLlama-2-7B-base

Text Generation

text-generation-inference

Model card Files Files and versions Community

matthieumeeus97 commited on Nov 25, 2024

Commit

18968f9

·

verified ·

1 Parent(s): d49a2c6

Update README.md

Files changed (1) hide show

README.md +24 -4

README.md CHANGED Viewed

@@ -85,13 +85,32 @@ We recommend fine-tuning this model to your curated data to maximally avoid unde
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
-This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 4% trainable parameters.
 #### Training Hyperparameters
@@ -142,7 +161,8 @@ On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art
 ### Qualitative evaluation
 ### Compute Infrastructure

 ### Training Data
+We collect a diverse set of Dutch natural language.
+1. **OSCAR**
+   The bulk of our data comes from the Dutch portion of [OSCAR](https://oscar-corpus.com), January 2023 version, based on Common Crawl. This dataset includes **93 GB** of text (~28.6B tokens).
+2. **Open Subtitles**
+   We collected Dutch text from movie subtitles, focusing on unique movies either in Dutch or with Dutch subtitles. This dataset contains **5 GB** of text (~1.54B tokens) from **214k samples**.
+3. **Project Gutenberg**
+   We downloaded **970 full Dutch books** from [Project Gutenberg](https://www.gutenberg.org) using a public scraper. The dataset includes **0.3 GB** of text (~92M tokens) and is available on [Hugging Face](https://huggingface.co/datasets/ChocoLlama/gutenberg-dutch).
+4. **Wikipedia**
+   Using the March 2023 [Wikipedia dump](https://dumps.wikimedia.org), we included **2.5 GB** of text (~769M tokens). Despite some duplication with OSCAR, Wikipedia's high quality justifies its inclusion.
+5. **Job Descriptions (TechWolf)**
+   A sample of **750k Dutch job descriptions** collected over five years from public websites, provided by TechWolf. This dataset contains **1.5 GB** of text (~462M tokens).
+6. **Staatsblad (Bizzy)**
+   A sample of **80k legal filings** from [Het Belgisch Staatsblad](https://www.ejustice.just.fgov.be/cgi/welcome.pl). Documents were OCR-processed, and personal data was excluded. This dataset includes **1.4 GB** of text (~431M tokens), collected with help from Bizzy.
+7. **Legislation (ML6)**
+   **15k documents** from Flemish legislation accessed via the [Open Data API](https://www.vlaanderen.be/vlaams-parlement/de-vlaamse-codex). This dataset contains **0.2 GB** of text (~62M tokens), collected with support from ML6.
 ### Training Procedure
+This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 7.8% trainable parameters.
 #### Training Hyperparameters
 ### Qualitative evaluation
+In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
+For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
 ### Compute Infrastructure