File size: 11,308 Bytes
876c8f1 f230207 876c8f1 f230207 876c8f1 f230207 710be35 f230207 d5c0069 f230207 0927925 f230207 68ce748 f230207 68ce748 ef255e3 68ce748 f230207 68ce748 f230207 68ce748 f230207 0927925 68ce748 0927925 68ce748 f230207 68ce748 f230207 0927925 4675e4a 0927925 68ce748 f230207 68ce748 f230207 68ce748 f230207 68ce748 beeda4c c33ccdd beeda4c c33ccdd beeda4c 4620a93 beeda4c 4620a93 beeda4c 4620a93 beeda4c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
---
language:
- 'no'
- nb
- nn
inference: true
tags:
- mistral
- norwegian
- instruction
- chat
license: apache-2.0
pipeline_tag: text-generation
datasets:
- CohereForAI/aya_dataset
- OpenAssistant/oasst1
- OpenAssistant/oasst2
- laion/OIG
- HuggingFaceH4/no_robots
- databricks/databricks-dolly-15k
- glaiveai/glaive-code-assistant-v2
---
# **Instruction-tuned NorMistral-7b-warm**
<img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
This is a model instruction-tuned on open datasets released under the most permissive apache-2.0 licence (in other words, we don't use any datasets generated by ChatGPT) — thus we can release this model under the same license and make it openly available for commercial applications. The model has been finetuned on 4096 context length, twice as many tokens as the base model.
The released weights are still a work in progress and they might change in the future. This is the first iteration of instruction-tuning our NorMistral models and it currently uses only the SFT phase without any preference optimization. Please let us know your feedback to improve the model in future releases.
## Finetuning corpus
The corpus was compiled by this process:
1. We gathered all openly available datasets: [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset), [OASST 1](https://huggingface.co/datasets/OpenAssistant/oasst1), [OASST 2](https://huggingface.co/datasets/OpenAssistant/oasst2), [OIG-small-chip2](https://huggingface.co/datasets/laion/OIG), [No Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots), [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and [Glaive code assistant](Glaive-code-assistant-v2).
2. These were first manually inspected and filtered, and then automatically filtered with [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) to remove incorrect, offensive, non-English and American-centric responses.
3. The responses were augmented to be more descriptive by [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
4. Since most of that dataset contains only a single dialogue turn, we generated more turns using [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
5. Finally, we translated the resulting dataset into Bokmål and Nynorsk using [NorMistral-7b-warm](https://huggingface.co/norallm/normistral-7b-warm).
## How to run the model?
### 1. Prompt format
NorMistral uses ChatML-like format for structuring the (multi-turn) conversations. An example of a prompt in this format looks like the following (notice the special `<|im_start|>` and `<|im_end|>` tokens).
```
<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
<|im_start|> user
Gi meg en liste over de beste stedene å besøke i hovedstaden<|im_end|>
<|im_start|> assistant
```
This prompt-format is available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) in the NorMistral's tokens, so it can be easily applied by calling the
`tokenizer.apply_chat_template()` method:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-warm-instruct")
messages = [
{"role": "user", "content": "Hva er hovedstaden i Norge?"},
{"role": "assistant", "content": "Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser."},
{"role": "user", "content": "Gi meg en liste over de beste stedene å besøke i hovedstaden"}
]
gen_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
```
When tokenizing messages for generation, set `add_generation_prompt=True` when calling `apply_chat_template()`. This will append `<|im_start|>assistant\n` to your prompt, to ensure
that the model continues with an assistant response.
### 2. Generation parameters
The model is quite sensitive to generation parameters, it's important to set them correctly. We give an example of a reasonable generation setting below. Note that other libraries have different defaults and that it's important to check them.
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-7b-warm-instruct", torch_dtype=torch.bfloat16)
model.generate(
gen_input,
max_new_tokens=1024,
top_k=64, # top-k sampling
top_p=0.9, # nucleus sampling
temperature=0.3, # a low temparature to make the outputs less chaotic
repetition_penalty=1.0, # turn the repetition penalty off, having it on can lead to very bad outputs
do_sample=True, # randomly sample the outputs
use_cache=True # speed-up generation
)
```
## About the base model
NorMistral-7b-warm is a large Norwegian language model initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and
continuously pretrained on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
This model is a part of the NORA.LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
All the models are pre-trained on the same dataset and with the same tokenizer.
NorMistral-7b-warm has over 7 billion parameters and is based on [the Mistral architecture](https://huggingface.co/mistralai/Mistral-7B-v0.1).
The NORA.LLM language model family includes (as of now):
- [**NorMistral-7b-warm**](https://huggingface.co/norallm/normistral-7b-warm) -- an LLM initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and continuously pretrained on Norwegian data;
- [**NorMistral-7b-scratch**](https://huggingface.co/norallm/normistral-7b-scratch) -- a Mistral-based LLM pretrained from scratch on Norwegian data;
- [**NorBLOOM-7b-scratch**](https://huggingface.co/norallm/NorBLOOM-7b-scratch) -- a BLOOM-based LLM pretrained from scratch on Norwegian data.
_____
## Quantization
### Provided files
| Name | Quant method | Bits Per Weight | Size | Max RAM/VRAM required | Use case |
| ---- | ---- | ---- | ---- | ---- | ----- |
| [normistral-7b-warm-instruct.Q3_K_M.gguf](https://huggingface.co/norallm/normistral-7b-warm/blob/main/normistral-7b-warm-instruct.Q3_K_M.gguf) | Q3_K_M | 3.89 | 3.28 GB| 5.37 GB | very small, high loss of quality |
| [normistral-7b-warm-instruct.Q4_K_M.gguf](https://huggingface.co/norallm/normistral-7b-warm/blob/main/normistral-7b-warm-instruct.Q4_K_M.gguf) | Q4_K_M | 4.83 | 4.07 GB| 6.16 GB | medium, balanced quality |
| [normistral-7b-warm-instruct.Q5_K_M.gguf](https://huggingface.co/norallm/normistral-7b-warm/blob/main/normistral-7b-warm-instruct.Q5_K_M.gguf) | Q5_K_M | 5.67 | 4.78 GB| 6.87 GB | large, very low quality loss |
| [normistral-7b-warm-instruct.Q6_K.gguf](https://huggingface.co/norallm/normistral-7b-warm/blob/main/normistral-7b-warm-instruct.Q6_K.gguf) | Q6_K | 6.56 | 5.54 GB| 7.63 GB | very large, extremely low quality loss |
| [normistral-7b-warm-instruct.Q8_0.gguf](https://huggingface.co/norallm/normistral-7b-warm/blob/main/normistral-7b-warm-instruct.Q8_0.gguf) | Q8_0 | 8.50 | 7.17 GB| 9.26 GB | very large, extremely low quality loss |
### How to run from Python code
You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for example.
#### How to load this model in Python code, using llama-cpp-python
For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/).
#### First install the package
Run one of the following commands, according to your system:
```shell
# Base llama-ccp-python with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
pip install llama-cpp-python
```
#### Simple llama-cpp-python example code
```python
from llama_cpp import Llama
# Directly from huggingface-hub (requires huggingface-hub to be installed)
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama.from_pretrained(
repo_id="norallm/normistral-7b-warm-instruct", # HuggingFace repository containing the GGUF files.
filename="*Q4_K_M.gguf", # suffix of the filename containing the level of quantization.
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
)
# Simple inference example
output = llm(
"""<s><|im_start|> user
Hva kan jeg bruke einstape til?<|im_end|>
<|im_start|> assistant
""", # Prompt
max_tokens=512, # Generate up to 512 tokens
stop=["<|im_end|>"], # Example stop token
echo=True, # Whether to echo the prompt
temperature=0.3 # Temperature to set, for Q3_K_M, Q4_K_M, Q5_K_M, and Q6_0 it is recommended to set it relatively low.
)
# Chat Completion API
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "Hva kan jeg bruke einstape til?"
}
]
)
``` |