anthonyrathe commited on
Commit
857ad3e
·
verified ·
1 Parent(s): 650a948

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -5
README.md CHANGED
@@ -4,13 +4,146 @@ language:
4
  license: llama2
5
  ---
6
 
7
- ## LLaMA-2-NL: Fine-tuned using LoRa and the original tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  ```
10
  from transformers import AutoModelForCausalLM, AutoTokenizer
11
 
12
- # take the original llama 2 tokenizer
13
- tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- model = AutoModelForCausalLM.from_pretrained('llama-2-nl/Llama-2-7b-hf-lora-original')
16
- ```
 
4
  license: llama2
5
  ---
6
 
7
+ <p align="center" style="margin:0;padding:0">
8
+ <img src="./images/chocollama_logo.png" alt="ChocoLlama logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
9
+ </p>
10
+ <div style="margin:auto; text-align:center">
11
+ <h1 style="margin-bottom: 0">ChocoLlama</h1>
12
+ <em>A Llama-2/3-based family of Dutch language models</em>
13
+ </div>
14
+
15
+ ## Model Details
16
+
17
+ ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
18
+
19
+ We provide 6 variants (of which 3 base and 3 instruction-tuned models):
20
+ - **ChocoLlama-2-7B-base**: A language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB (XXX tokens) using LoRa.
21
+ - **ChocoLlama-2-7B-instruct**: An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
22
+ - **ChocoLlama-2-7B-tokentrans-base**: A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
23
+ - **ChocoLlama-2-7B-tokentrans-instruct**: An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
24
+ - **Llama-3-ChocoLlama-8B-base**: A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
25
+ - **Llama-3-ChocoLlama-instruct**: An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
26
+
27
+
28
+ As far as we are aware, Llama-3-ChocoLlama-8B-instruct sets a new state-of-the-art for Dutch open models in its weight class.
29
+
30
+ ### Model Description
31
+
32
+ - **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
33
+ - **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA H100-80GB)
34
+ - **Language(s):** Dutch
35
+ - **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/)
36
+ - **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
37
+
38
+ ### Model Sources
39
+
40
+ - **Repository:** Will be released soon.
41
+ - **Paper:** Will be released soon.
42
+
43
+ ## Uses
44
+
45
+ ### Direct Use
46
+
47
+ Since this is a base model, we do not recommend using it for your use-cases directly. We instead recommend:
48
+ 1. Fine-tuning this model to your specific use-case
49
+ 2. Leveraging the instruction-tuned version of this model
50
+
51
+ ### Downstream Use
52
+
53
+ Since this model is a base model, it can easily be adapted to specific use-cases that required Dutch language understanding and generation. We expect this model to be particularly useful for use-cases in the domains which were explicitly covered in our dataset, e.g. the analysis and/or generation of:
54
+ - Dutch job descriptions
55
+ - Dutch corporate filings
56
+ - Dutch legislation
57
+
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ - Use-cases requiring a chat-style interface: since this is a base model, it cannot be used reliably for turn-based chat interaction. Please refer to the instruction-tuned version of this model instead.
62
+ - Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
63
+
64
+ ## Bias, Risks, and Limitations
65
+
66
+ We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
67
+ However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
68
+
69
+ ### Recommendations
70
+
71
+ We recommend fine-tuning this model to your curated data to maximally avoid undesirable outputs.
72
+
73
+ ## How to Get Started with the Model
74
+
75
+ Use the code below to get started with the model.
76
 
77
  ```
78
  from transformers import AutoModelForCausalLM, AutoTokenizer
79
 
80
+ tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')
81
+ model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')
82
+ ```
83
+
84
+ ## Training Details
85
+
86
+ ### Training Data
87
+
88
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
89
+
90
+ [More Information Needed]
91
+
92
+ ### Training Procedure
93
+
94
+ This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 4% trainable parameters.
95
+
96
+ #### Training Hyperparameters
97
+
98
+ - **Training regime:** bf16 non-mixed precision
99
+ - **Epochs:** 1
100
+ - **LoRa parameters:**
101
+ - R: 8
102
+ - Alpha: 32
103
+ - Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head
104
+ - LoRa dropout: 0.05
105
+ - **Learning Rate:**
106
+ - Scheduler: StepLR
107
+ - Step size: 6212
108
+ - Learning rate: 0.0003
109
+ - Gamma: 0.85
110
+ - **Other parameters:**
111
+ - Minibatch size: 16
112
+ - Gradient accumulation steps: 8
113
+ - Parallelization factor: 8
114
+ - Weight decay: 0
115
+
116
+
117
+ ## Evaluation
118
+
119
+ ### Quantitative evaluation
120
+
121
+ We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
122
+
123
+ | Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
124
+ |----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
125
+ | **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
126
+ | llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
127
+ | llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
128
+ | llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
129
+ | Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
130
+ | **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
131
+ | zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
132
+ | geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
133
+ | **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
134
+ | mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
135
+ | **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
136
+ | **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
137
+ | **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
138
+ | llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
139
+ | llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
140
+
141
+ On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
142
+
143
+ ### Qualitative evaluation
144
+
145
+
146
+
147
+ ### Compute Infrastructure
148
 
149
+ All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA H100 GPU's with 80 GB of VRAM.