File size: 10,197 Bytes
f03f56d
 
 
 
 
 
 
 
 
 
 
e473468
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b00109
e473468
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
license: llama3
base_model: openchat/openchat-3.6-8b-20240522
tags:
  - openchat
  - llama3
  - C-RLFT
library_name: transformers
pipeline_tag: text-generation
quantized_by: NeuralNet-Hub
---

<div align="center">
  <a href="http://neuralnet.solutions" target="_blank">
    <img width="450" src="https://raw.githubusercontent.com/NeuralNet-Hub/assets/main/logo/LOGO_png_orig.png">
  </a>
</div>

NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence


## 🌟 OpenChat-3.6-8b-20240522 llama.cpp quantization by NeuralNet πŸ§ πŸ€–

All the models have been quantized following the instructions provided by [`llama.cpp`](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize). This is:
```
# obtain the official LLaMA model weights and place them in ./models
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [Optional] for models using BPE tokenizers
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert-hf-to-gguf.py models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
```

Original model: https://huggingface.co/openchat/openchat-3.6-8b-20240522

## Prompt format πŸ“

### Original Format:
```
<|begin_of_text|><|start_header_id|>System<|end_header_id|>

{system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|>

{user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
```

### Ollama Template:
```
{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>

{{ .Response }}<|eot_id|>
```

## Summary models πŸ“‹

| Filename | Quant type | File Size | Description |
| -------- | ---------- | --------- | ----------- |
| [openchat-3.6-8b-20240522-fp16.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-fp16.gguf) | fp16 | 16.06GB | Half precision, no quantization applied |
| [openchat-3.6-8b-20240522-q8_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q8_0.gguf) | q8_0 | 8.54GB | Extremely high quality, generally unneeded but max available quant. |
| [openchat-3.6-8b-20240522-q6_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q6_K.gguf) | q6_K | 6.59GB | Very high quality, near perfect, *recommended*. |
| [openchat-3.6-8b-20240522-q5_1.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_1.gguf) | q5_1 | 6.06GB | High quality, *recommended*. |
| [openchat-3.6-8b-20240522-q5_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_M.gguf) | q5_K_M | 5.73GB | High quality, *recommended*. |
| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_S.gguf) | q5_K_S | 5.59GB | High quality, *recommended*. |
| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_0.gguf) | q5_0 | 5.59GB | High quality, *recommended*. |
| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_1.gguf) | q4_1 | 4.92GB | Good quality, *recommended*. |
| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_M.gguf) | q4_K_M | 4.92GB | Good quality, uses about 4.83 bits per weight, *recommended*. |
| [openchat-3.6-8b-20240522-q4_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_S.gguf) | q4_K_S | 4.69GB | Slightly lower quality with more space savings, *recommended*. |
| [openchat-3.6-8b-20240522-q4_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_0.gguf) | q4_0 | 4.66GB | Slightly lower quality with more space savings, *recommended*. |
| [openchat-3.6-8b-20240522-q3_K_L.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_L.gguf) | q3_K_L | 4.32GB | Lower quality but usable, good for low RAM availability. |
| [openchat-3.6-8b-20240522-q3_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_M.gguf) | q3_K_M | 4.01GB | Even lower quality. |
| [openchat-3.6-8b-20240522-q3_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_S.gguf) | q3_K_S | 3.66GB | Low quality, not recommended. |
| [openchat-3.6-8b-20240522-q2_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q2_K.gguf) | q2_K | 3.17GB | Very low quality but surprisingly usable. |

## Usage with Ollama πŸ¦™

### Direct from Ollama
```
ollama run NeuralNet/openchat-3.6-8b-20240522
```

### Create your own template
Create a text plain file named `Modelfile` (no extension needed)
```
FROM NeuralNet/openchat-3.6

# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.5

# sets the context window size to 8192, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 8192

# tokens to generate set to 4096 (max)
PARAMETER num_predict 4096

# set system
SYSTEM "You are an AI assistant created by NeuralNet, your answer are clear and consice"

# template OpenChat3.6
TEMPLATE "{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
```
Then, after previously install ollama, just run:
```
ollama create openchat-3.6-8b-20240522 -f openchat-3.6-8b-20240522
```


## Download Models Using huggingface-cli πŸ€—

### Installation of `huggingface_hub[cli]`
Ensure you have the necessary CLI tool installed by running:
```bash
pip install -U "huggingface_hub[cli]"
```

### Downloading Specific Model Files
To download a specific model file, use the following command:
```bash
huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q4_K_M.gguf" --local-dir ./
```
This command downloads the specified model file and places it in the current directory (./).

### Downloading Large Models Split into Multiple Files
For models exceeding 50GB, which are typically split into multiple files for easier download and management:
```bash
huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q8_0.gguf/*" --local-dir openchat-3.6-8b-20240522-Q8_0
```
This command downloads all files in the specified directory and places them into the chosen local folder (openchat-3.6-8b-20240522-Q8_0). You can choose to download everything in place or specify a new location for the downloaded files.

## Which File Should I Choose? πŸ“ˆ

A comprehensive analysis with performance charts is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9).

### Assessing System Capabilities
1. **Determine Your Model Size**: Start by checking the amount of RAM and VRAM available in your system. This will help you decide the largest possible model you can run.
2. **Optimizing for Speed**:
    - **GPU Utilization**: To run your model as quickly as possible, aim to fit the entire model into your GPU's VRAM. Pick a version that’s 1-2GB smaller than the total VRAM.
3. **Maximizing Quality**:
    - **Combined Memory**: For the highest possible quality, sum your system RAM and GPU's VRAM. Then choose a model that's 1-2GB smaller than this combined total.

### Deciding Between 'I-Quant' and 'K-Quant'
1. **Simplicity**:
    - **K-Quant**: If you prefer a straightforward approach, select a K-quant model. These are labeled as 'QX_K_X', such as Q5_K_M.
2. **Advanced Configuration**:
    - **Feature Chart**: For a more nuanced choice, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix).
    - **I-Quant Models**: Best suited for configurations below Q4 and for systems running cuBLAS (Nvidia) or rocBLAS (AMD). These are labeled 'IQX_X', such as IQ3_M, and offer better performance for their size.
    - **Compatibility Considerations**:
        - **I-Quant Models**: While usable on CPU and Apple Metal, they perform slower compared to their K-quant counterparts. The choice between speed and performance becomes a significant tradeoff.
        - **AMD Cards**: Verify if you are using the rocBLAS build or the Vulkan build. I-quants are not compatible with Vulkan.
        - **Current Support**: At the time of writing, LM Studio offers a preview with ROCm support, and other inference engines provide specific ROCm builds.

By following these guidelines, you can make an informed decision on which file best suits your system and performance needs.

## Contact us 🌐

NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence

Website: https://neuralnet.solutions
Email: info[at]neuralnet.solutions