hdnh2006
commited on
Commit
Β·
e473468
1
Parent(s):
a1766ca
README added
Browse files
README.md
CHANGED
@@ -1,3 +1,189 @@
|
|
1 |
-
---
|
2 |
-
license:
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3
|
3 |
+
base_model: openchat/openchat-3.6-8b-20240522
|
4 |
+
tags:
|
5 |
+
- openchat
|
6 |
+
- llama3
|
7 |
+
- C-RLFT
|
8 |
+
library_name: transformers
|
9 |
+
pipeline_tag: text-generation
|
10 |
+
prompt_template:<|begin_of_text|><|start_header_id|>System<|end_header_id|>
|
11 |
+
|
12 |
+
{system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|>
|
13 |
+
|
14 |
+
{user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
|
15 |
+
quantized_by: NeuralNet-Hub
|
16 |
+
---
|
17 |
+
|
18 |
+
<div align="center">
|
19 |
+
<a href="http://neuralnet.solutions" target="_blank">
|
20 |
+
<img width="450" src="https://raw.githubusercontent.com/NeuralNet-Hub/assets/main/logo/LOGO_png_orig.png">
|
21 |
+
</a>
|
22 |
+
</div>
|
23 |
+
|
24 |
+
NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence
|
25 |
+
|
26 |
+
|
27 |
+
## π OpenChat-3.6-8b-20240522 llama.cpp quantization by NeuralNet π§ π€
|
28 |
+
|
29 |
+
All the models have been quantized following the instructions provided by [`llama.cpp`](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize). This is:
|
30 |
+
```
|
31 |
+
# obtain the official LLaMA model weights and place them in ./models
|
32 |
+
ls ./models
|
33 |
+
llama-2-7b tokenizer_checklist.chk tokenizer.model
|
34 |
+
# [Optional] for models using BPE tokenizers
|
35 |
+
ls ./models
|
36 |
+
<folder containing weights and tokenizer json> vocab.json
|
37 |
+
# [Optional] for PyTorch .bin models like Mistral-7B
|
38 |
+
ls ./models
|
39 |
+
<folder containing weights and tokenizer json>
|
40 |
+
|
41 |
+
# install Python dependencies
|
42 |
+
python3 -m pip install -r requirements.txt
|
43 |
+
|
44 |
+
# convert the model to ggml FP16 format
|
45 |
+
python3 convert-hf-to-gguf.py models/mymodel/
|
46 |
+
|
47 |
+
# quantize the model to 4-bits (using Q4_K_M method)
|
48 |
+
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
|
49 |
+
|
50 |
+
# update the gguf filetype to current version if older version is now unsupported
|
51 |
+
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
|
52 |
+
```
|
53 |
+
|
54 |
+
Original model: https://huggingface.co/openchat/openchat-3.6-8b-20240522
|
55 |
+
|
56 |
+
## Prompt format π
|
57 |
+
|
58 |
+
### Original Format:
|
59 |
+
```
|
60 |
+
<|begin_of_text|><|start_header_id|>System<|end_header_id|>
|
61 |
+
|
62 |
+
{system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|>
|
63 |
+
|
64 |
+
{user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
|
65 |
+
```
|
66 |
+
|
67 |
+
### Ollama Template:
|
68 |
+
```
|
69 |
+
{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>
|
70 |
+
|
71 |
+
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>
|
72 |
+
|
73 |
+
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
|
74 |
+
|
75 |
+
{{ .Response }}<|eot_id|>
|
76 |
+
```
|
77 |
+
|
78 |
+
## Summary models π
|
79 |
+
|
80 |
+
| Filename | Quant type | File Size | Description |
|
81 |
+
| -------- | ---------- | --------- | ----------- |
|
82 |
+
| [openchat-3.6-8b-20240522-fp16.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-fp16.gguf) | fp16 | 16.06GB | Half precision, no quantization applied |
|
83 |
+
| [openchat-3.6-8b-20240522-q8_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q8_0.gguf) | q8_0 | 8.54GB | Extremely high quality, generally unneeded but max available quant. |
|
84 |
+
| [openchat-3.6-8b-20240522-q6_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q6_K.gguf) | q6_K | 6.59GB | Very high quality, near perfect, *recommended*. |
|
85 |
+
| [openchat-3.6-8b-20240522-q5_1.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_1.gguf) | q5_1 | 6.06GB | High quality, *recommended*. |
|
86 |
+
| [openchat-3.6-8b-20240522-q5_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_M.gguf) | q5_K_M | 5.73GB | High quality, *recommended*. |
|
87 |
+
| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_S.gguf) | q5_K_S | 5.59GB | High quality, *recommended*. |
|
88 |
+
| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_0.gguf) | q5_0 | 5.59GB | High quality, *recommended*. |
|
89 |
+
| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_1.gguf) | q4_1 | 4.92GB | Good quality, *recommended*. |
|
90 |
+
| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_M.gguf) | q4_K_M | 4.92GB | Good quality, uses about 4.83 bits per weight, *recommended*. |
|
91 |
+
| [openchat-3.6-8b-20240522-q4_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_S.gguf) | q4_K_S | 4.69GB | Slightly lower quality with more space savings, *recommended*. |
|
92 |
+
| [openchat-3.6-8b-20240522-q4_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_0.gguf) | q4_0 | 4.66GB | Slightly lower quality with more space savings, *recommended*. |
|
93 |
+
| [openchat-3.6-8b-20240522-q3_K_L.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_L.gguf) | q3_K_L | 4.32GB | Lower quality but usable, good for low RAM availability. |
|
94 |
+
| [openchat-3.6-8b-20240522-q3_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_M.gguf) | q3_K_M | 4.01GB | Even lower quality. |
|
95 |
+
| [openchat-3.6-8b-20240522-q3_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_S.gguf) | q3_K_S | 3.66GB | Low quality, not recommended. |
|
96 |
+
| [openchat-3.6-8b-20240522-q2_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q2_K.gguf) | q2_K | 3.17GB | Very low quality but surprisingly usable. |
|
97 |
+
|
98 |
+
## Usage with Ollama π¦
|
99 |
+
|
100 |
+
### Direct from Ollama
|
101 |
+
```
|
102 |
+
ollama run NeuralNet/openchat-3.6-8b-20240522
|
103 |
+
```
|
104 |
+
|
105 |
+
### Create your own template
|
106 |
+
Create a text plain file named `Modelfile` (no extension needed)
|
107 |
+
```
|
108 |
+
FROM ./openchat-3.6-8b-20240522-GGUF/openchat-3.6-8b-20240522-q4_K_M.gguf
|
109 |
+
|
110 |
+
# sets the temperature to 1 [higher is more creative, lower is more coherent]
|
111 |
+
PARAMETER temperature 0.5
|
112 |
+
|
113 |
+
# sets the context window size to 8192, this controls how many tokens the LLM can use as context to generate the next token
|
114 |
+
PARAMETER num_ctx 8192
|
115 |
+
|
116 |
+
# tokens to generate set to 4096 (max)
|
117 |
+
PARAMETER num_predict 4096
|
118 |
+
|
119 |
+
# set system
|
120 |
+
SYSTEM "You are an AI assistant created by NeuralNet, your answer are clear and consice"
|
121 |
+
|
122 |
+
# template OpenChat3.6
|
123 |
+
TEMPLATE "{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>
|
124 |
+
|
125 |
+
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>
|
126 |
+
|
127 |
+
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
|
128 |
+
|
129 |
+
{{ .Response }}<|eot_id|>"
|
130 |
+
```
|
131 |
+
Then, after previously install ollama, just run:
|
132 |
+
```
|
133 |
+
ollama create openchat-3.6-8b-20240522 -f openchat-3.6-8b-20240522
|
134 |
+
```
|
135 |
+
|
136 |
+
|
137 |
+
## Download Models Using huggingface-cli π€
|
138 |
+
|
139 |
+
### Installation of `huggingface_hub[cli]`
|
140 |
+
Ensure you have the necessary CLI tool installed by running:
|
141 |
+
```bash
|
142 |
+
pip install -U "huggingface_hub[cli]"
|
143 |
+
```
|
144 |
+
|
145 |
+
### Downloading Specific Model Files
|
146 |
+
To download a specific model file, use the following command:
|
147 |
+
```bash
|
148 |
+
huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q4_K_M.gguf" --local-dir ./
|
149 |
+
```
|
150 |
+
This command downloads the specified model file and places it in the current directory (./).
|
151 |
+
|
152 |
+
### Downloading Large Models Split into Multiple Files
|
153 |
+
For models exceeding 50GB, which are typically split into multiple files for easier download and management:
|
154 |
+
```bash
|
155 |
+
huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q8_0.gguf/*" --local-dir openchat-3.6-8b-20240522-Q8_0
|
156 |
+
```
|
157 |
+
This command downloads all files in the specified directory and places them into the chosen local folder (openchat-3.6-8b-20240522-Q8_0). You can choose to download everything in place or specify a new location for the downloaded files.
|
158 |
+
|
159 |
+
## Which File Should I Choose? π
|
160 |
+
|
161 |
+
A comprehensive analysis with performance charts is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9).
|
162 |
+
|
163 |
+
### Assessing System Capabilities
|
164 |
+
1. **Determine Your Model Size**: Start by checking the amount of RAM and VRAM available in your system. This will help you decide the largest possible model you can run.
|
165 |
+
2. **Optimizing for Speed**:
|
166 |
+
- **GPU Utilization**: To run your model as quickly as possible, aim to fit the entire model into your GPU's VRAM. Pick a version thatβs 1-2GB smaller than the total VRAM.
|
167 |
+
3. **Maximizing Quality**:
|
168 |
+
- **Combined Memory**: For the highest possible quality, sum your system RAM and GPU's VRAM. Then choose a model that's 1-2GB smaller than this combined total.
|
169 |
+
|
170 |
+
### Deciding Between 'I-Quant' and 'K-Quant'
|
171 |
+
1. **Simplicity**:
|
172 |
+
- **K-Quant**: If you prefer a straightforward approach, select a K-quant model. These are labeled as 'QX_K_X', such as Q5_K_M.
|
173 |
+
2. **Advanced Configuration**:
|
174 |
+
- **Feature Chart**: For a more nuanced choice, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix).
|
175 |
+
- **I-Quant Models**: Best suited for configurations below Q4 and for systems running cuBLAS (Nvidia) or rocBLAS (AMD). These are labeled 'IQX_X', such as IQ3_M, and offer better performance for their size.
|
176 |
+
- **Compatibility Considerations**:
|
177 |
+
- **I-Quant Models**: While usable on CPU and Apple Metal, they perform slower compared to their K-quant counterparts. The choice between speed and performance becomes a significant tradeoff.
|
178 |
+
- **AMD Cards**: Verify if you are using the rocBLAS build or the Vulkan build. I-quants are not compatible with Vulkan.
|
179 |
+
- **Current Support**: At the time of writing, LM Studio offers a preview with ROCm support, and other inference engines provide specific ROCm builds.
|
180 |
+
|
181 |
+
By following these guidelines, you can make an informed decision on which file best suits your system and performance needs.
|
182 |
+
|
183 |
+
## Contact us π
|
184 |
+
|
185 |
+
NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence
|
186 |
+
|
187 |
+
Website: https://neuralnet.solutions
|
188 |
+
Email: info[at]neuralnet.solutions
|
189 |
+
|