File size: 5,003 Bytes
2443686
8eb1215
 
2663b93
 
e9183f7
 
 
7bdfdf1
548463a
2663b93
 
 
 
 
2443686
ed497aa
 
 
 
 
01f632e
 
ed497aa
 
 
7bdfdf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7676afe
 
 
 
 
 
 
 
 
 
 
 
7bdfdf1
 
ed497aa
 
 
01f632e
ed497aa
01f632e
 
 
ed497aa
 
 
 
 
01f632e
ed497aa
a153e57
 
 
 
 
 
 
 
 
 
 
2443686
 
ed497aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2443686
 
 
 
 
 
 
 
 
 
 
 
ed497aa
 
 
 
 
 
 
 
2443686
ed497aa
 
 
 
8eb1215
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language:
- nl
license: cc-by-nc-4.0
library_name: peft
tags:
- llama
- alpaca
- Transformers
- text-generation-inference
datasets:
- BramVanroy/alpaca-cleaned-dutch
inference: false
base_model: openlm-research/open_llama_7b
pipeline_tag: text-generation
---

# open_llama_7b_alpaca_clean_dutch_qlora

## Model description

This adapter model is a fine-tuned version of [openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b). 
Finetuning was performed on the Dutch [BramVanroy/alpaca-cleaned-dutch](https://www.huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) dataset which contains 52K of records with instruction following-data translated from English to Dutch.

See [openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b) for all information about the base model.

## Model usage

A basic example of how to use the finetuned model.

```
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "robinsmits/open_llama_7b_alpaca_clean_dutch_qlora"

tokenizer =  AutoTokenizer.from_pretrained(model_name, use_fast = False, add_eos_token = True)

config = PeftConfig.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, load_in_8bit = True, device_map = "auto")
model = PeftModel.from_pretrained(model, model_name)

prompt = "### Instructie:\nWat zijn de drie belangrijkste softwareonderdelen die worden gebruikt bij webontwikkeling?\n\n### Antwoord:\n" 

inputs = tokenizer(prompt, return_tensors = "pt", truncation = True).input_ids.cuda()
sample = model.generate(input_ids = inputs, max_new_tokens = 512, num_beams = 2, early_stopping = True, eos_token_id = tokenizer.eos_token_id)
output = tokenizer.decode(sample[0], skip_special_tokens = True)

print(output.split(prompt)[1])
```

The prompt and generated output for the above mentioned example is similar to the output shown below.

```
### Instructie:
Wat zijn de drie belangrijkste softwareonderdelen die worden gebruikt bij webontwikkeling?

### Antwoord:
 </br>
De drie belangrijkste softwareonderdelen die worden gebruikt bij webontwikkeling zijn HTML, CSS en JavaScript.
```

For more extensive usage and a lot of generated samples (both good and bad samples) see the following [Inference Notebook](https://github.com/RobinSmits/Dutch-LLMs/blob/main/Open_Llama_7B_Alpaca_Clean_Dutch_Inference.ipynb)

## Intended uses & limitations

The open_llama_7b model was primarily trained on the English language. Part of the dataset was a Wikipedia dump containing pages in 20 languages.
Dutch was one of those languages. Given the size of the total dataset and the wikipedia part the Dutch language was very likely less than 0.5% of the total data. 

The generated output and performance of this model for the Dutch language is very likely not always comparable to the various Open-Llama models that have been finetuned on English Alpaca datasets.

The primary intention of this model is to explore and research the use of the Dutch language in combination with an Open LLM model.

## Training and evaluation data

This model was trained on the [BramVanroy/alpaca-cleaned-dutch](https://www.huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) dataset.

Based on the dataset license only Non-Commercial use is allowed. Commercial use is strictly forbidden.

```
@misc{vanroy2023language,
      title={Language Resources for {Dutch} Large Language Modelling}, 
      author={Bram Vanroy},
      year={2023},
      eprint={2312.12852},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## Training procedure

This model was finetuned with a QLoRA setup on a Google Colab A100 GPU in about 6.5 hours.

The notebook used for training can be found here: [Training Notebook](https://github.com/RobinSmits/Dutch-LLMs/blob/main/Open_Llama_7B_Alpaca_Clean_Dutch_Qlora.ipynb)

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 64
- training_steps: 1536

The following `bitsandbytes` quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: True
- bnb_4bit_compute_dtype: bfloat16

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 1.1240        | 1.0   | 768  | 1.1227          |
| 1.0177        | 2.0   | 1536 | 1.0645          |

### Framework versions

- Transformers 4.30.2
- Pytorch 2.0.1+cu118
- Datasets 2.13.1
- Tokenizers 0.13.3
- PEFT 0.4.0.dev0