ChocoLlama
/

ChocoLlama-2-7B-base

@@ -4,13 +4,146 @@ language:
 license: llama2
 ---
-## LLaMA-2-NL: Fine-tuned using LoRa and the original tokenizer
 ```
 from transformers import AutoModelForCausalLM, AutoTokenizer
-# take the original llama 2 tokenizer
-tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
-model = AutoModelForCausalLM.from_pretrained('llama-2-nl/Llama-2-7b-hf-lora-original')
-```

 license: llama2
 ---
+<p align="center" style="margin:0;padding:0">
+<img src="./images/chocollama_logo.png" alt="ChocoLlama logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
+</p>
+<div style="margin:auto; text-align:center">
+<h1 style="margin-bottom: 0">ChocoLlama</h1>
+<em>A Llama-2/3-based family of Dutch language models</em>
+</div>
+## Model Details
+ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
+We provide 6 variants (of which 3 base and 3 instruction-tuned models):
+- **ChocoLlama-2-7B-base**: A language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB (XXX tokens) using LoRa.
+- **ChocoLlama-2-7B-instruct**: An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
+- **ChocoLlama-2-7B-tokentrans-base**: A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **ChocoLlama-2-7B-tokentrans-instruct**: An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+- **Llama-3-ChocoLlama-8B-base**: A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **Llama-3-ChocoLlama-instruct**: An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+As far as we are aware, Llama-3-ChocoLlama-8B-instruct sets a new state-of-the-art for Dutch open models in its weight class.
+### Model Description
+- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
+- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA H100-80GB)
+- **Language(s):** Dutch
+- **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/)
+- **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
+### Model Sources
+- **Repository:** Will be released soon.
+- **Paper:** Will be released soon.
+## Uses
+### Direct Use
+Since this is a base model, we do not recommend using it for your use-cases directly. We instead recommend:
+1. Fine-tuning this model to your specific use-case
+2. Leveraging the instruction-tuned version of this model
+### Downstream Use
+Since this model is a base model, it can easily be adapted to specific use-cases that required Dutch language understanding and generation. We expect this model to be particularly useful for use-cases in the domains which were explicitly covered in our dataset, e.g. the analysis and/or generation of:
+- Dutch job descriptions
+- Dutch corporate filings
+- Dutch legislation
+### Out-of-Scope Use
+- Use-cases requiring a chat-style interface: since this is a base model, it cannot be used reliably for turn-based chat interaction. Please refer to the instruction-tuned version of this model instead.
+- Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
+## Bias, Risks, and Limitations
+We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
+However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
+### Recommendations
+We recommend fine-tuning this model to your curated data to maximally avoid undesirable outputs.
+## How to Get Started with the Model
+Use the code below to get started with the model.
 ```
 from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')
+model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 4% trainable parameters.
+#### Training Hyperparameters
+- **Training regime:** bf16 non-mixed precision
+- **Epochs:** 1
+- **LoRa parameters:**
+    - R: 8
+    - Alpha: 32
+    - Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head
+    - LoRa dropout: 0.05
+- **Learning Rate:**
+    - Scheduler: StepLR
+    - Step size: 6212
+    - Learning rate: 0.0003
+    - Gamma: 0.85
+- **Other parameters:**
+    - Minibatch size: 16
+    - Gradient accumulation steps: 8
+    - Parallelization factor: 8
+    - Weight decay: 0
+## Evaluation
+### Quantitative evaluation
+We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
+| Model                                        | ARC            | HellaSwag      | MMLU           | TruthfulQA     | Avg.           |
+|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
+| **Llama-3-ChocoLlama-instruct**        | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
+| llama-3-8B-rebatch                           | 0.44           | 0.64           | 0.46           | 0.48           | 0.51           |
+| llama-3-8B-instruct                          | 0.47           | 0.59           | 0.47           | 0.52           | 0.51           |
+| llama-3-8B                                   | 0.44           | 0.64           | 0.47           | 0.45           | 0.5            |
+| Reynaerde-7B-Chat                            | 0.44           | 0.62           | 0.39           | 0.52           | 0.49           |
+| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
+| zephyr-7b-beta                               | 0.43           | 0.58           | 0.43           | 0.53           | 0.49           |
+| geitje-7b-ultra                              | 0.40           | 0.66           | 0.36           | 0.49           | 0.48           |
+| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
+| mistral-7b-v0.1                              | 0.43           | 0.58           | 0.37           | 0.45           | 0.46           |
+| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
+| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
+| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
+| llama-2-7b-chat-hf                           | 0.36           | 0.49           | 0.33           | 0.44           | 0.41           |
+| llama-2-7b-hf                                | 0.36           | 0.51           | 0.32           | 0.41           | 0.40           |
+On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
+### Qualitative evaluation
+### Compute Infrastructure
+All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA H100 GPU's with 80 GB of VRAM.