hdnh2006 commited on
Commit
e473468
Β·
1 Parent(s): a1766ca

README added

Browse files
Files changed (1) hide show
  1. README.md +189 -3
README.md CHANGED
@@ -1,3 +1,189 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ base_model: openchat/openchat-3.6-8b-20240522
4
+ tags:
5
+ - openchat
6
+ - llama3
7
+ - C-RLFT
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ prompt_template:<|begin_of_text|><|start_header_id|>System<|end_header_id|>
11
+
12
+ {system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|>
13
+
14
+ {user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
15
+ quantized_by: NeuralNet-Hub
16
+ ---
17
+
18
+ <div align="center">
19
+ <a href="http://neuralnet.solutions" target="_blank">
20
+ <img width="450" src="https://raw.githubusercontent.com/NeuralNet-Hub/assets/main/logo/LOGO_png_orig.png">
21
+ </a>
22
+ </div>
23
+
24
+ NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence
25
+
26
+
27
+ ## 🌟 OpenChat-3.6-8b-20240522 llama.cpp quantization by NeuralNet πŸ§ πŸ€–
28
+
29
+ All the models have been quantized following the instructions provided by [`llama.cpp`](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize). This is:
30
+ ```
31
+ # obtain the official LLaMA model weights and place them in ./models
32
+ ls ./models
33
+ llama-2-7b tokenizer_checklist.chk tokenizer.model
34
+ # [Optional] for models using BPE tokenizers
35
+ ls ./models
36
+ <folder containing weights and tokenizer json> vocab.json
37
+ # [Optional] for PyTorch .bin models like Mistral-7B
38
+ ls ./models
39
+ <folder containing weights and tokenizer json>
40
+
41
+ # install Python dependencies
42
+ python3 -m pip install -r requirements.txt
43
+
44
+ # convert the model to ggml FP16 format
45
+ python3 convert-hf-to-gguf.py models/mymodel/
46
+
47
+ # quantize the model to 4-bits (using Q4_K_M method)
48
+ ./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
49
+
50
+ # update the gguf filetype to current version if older version is now unsupported
51
+ ./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
52
+ ```
53
+
54
+ Original model: https://huggingface.co/openchat/openchat-3.6-8b-20240522
55
+
56
+ ## Prompt format πŸ“
57
+
58
+ ### Original Format:
59
+ ```
60
+ <|begin_of_text|><|start_header_id|>System<|end_header_id|>
61
+
62
+ {system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|>
63
+
64
+ {user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
65
+ ```
66
+
67
+ ### Ollama Template:
68
+ ```
69
+ {{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>
70
+
71
+ {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>
72
+
73
+ {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
74
+
75
+ {{ .Response }}<|eot_id|>
76
+ ```
77
+
78
+ ## Summary models πŸ“‹
79
+
80
+ | Filename | Quant type | File Size | Description |
81
+ | -------- | ---------- | --------- | ----------- |
82
+ | [openchat-3.6-8b-20240522-fp16.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-fp16.gguf) | fp16 | 16.06GB | Half precision, no quantization applied |
83
+ | [openchat-3.6-8b-20240522-q8_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q8_0.gguf) | q8_0 | 8.54GB | Extremely high quality, generally unneeded but max available quant. |
84
+ | [openchat-3.6-8b-20240522-q6_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q6_K.gguf) | q6_K | 6.59GB | Very high quality, near perfect, *recommended*. |
85
+ | [openchat-3.6-8b-20240522-q5_1.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_1.gguf) | q5_1 | 6.06GB | High quality, *recommended*. |
86
+ | [openchat-3.6-8b-20240522-q5_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_M.gguf) | q5_K_M | 5.73GB | High quality, *recommended*. |
87
+ | [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_S.gguf) | q5_K_S | 5.59GB | High quality, *recommended*. |
88
+ | [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_0.gguf) | q5_0 | 5.59GB | High quality, *recommended*. |
89
+ | [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_1.gguf) | q4_1 | 4.92GB | Good quality, *recommended*. |
90
+ | [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_M.gguf) | q4_K_M | 4.92GB | Good quality, uses about 4.83 bits per weight, *recommended*. |
91
+ | [openchat-3.6-8b-20240522-q4_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_S.gguf) | q4_K_S | 4.69GB | Slightly lower quality with more space savings, *recommended*. |
92
+ | [openchat-3.6-8b-20240522-q4_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_0.gguf) | q4_0 | 4.66GB | Slightly lower quality with more space savings, *recommended*. |
93
+ | [openchat-3.6-8b-20240522-q3_K_L.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_L.gguf) | q3_K_L | 4.32GB | Lower quality but usable, good for low RAM availability. |
94
+ | [openchat-3.6-8b-20240522-q3_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_M.gguf) | q3_K_M | 4.01GB | Even lower quality. |
95
+ | [openchat-3.6-8b-20240522-q3_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_S.gguf) | q3_K_S | 3.66GB | Low quality, not recommended. |
96
+ | [openchat-3.6-8b-20240522-q2_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q2_K.gguf) | q2_K | 3.17GB | Very low quality but surprisingly usable. |
97
+
98
+ ## Usage with Ollama πŸ¦™
99
+
100
+ ### Direct from Ollama
101
+ ```
102
+ ollama run NeuralNet/openchat-3.6-8b-20240522
103
+ ```
104
+
105
+ ### Create your own template
106
+ Create a text plain file named `Modelfile` (no extension needed)
107
+ ```
108
+ FROM ./openchat-3.6-8b-20240522-GGUF/openchat-3.6-8b-20240522-q4_K_M.gguf
109
+
110
+ # sets the temperature to 1 [higher is more creative, lower is more coherent]
111
+ PARAMETER temperature 0.5
112
+
113
+ # sets the context window size to 8192, this controls how many tokens the LLM can use as context to generate the next token
114
+ PARAMETER num_ctx 8192
115
+
116
+ # tokens to generate set to 4096 (max)
117
+ PARAMETER num_predict 4096
118
+
119
+ # set system
120
+ SYSTEM "You are an AI assistant created by NeuralNet, your answer are clear and consice"
121
+
122
+ # template OpenChat3.6
123
+ TEMPLATE "{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>
124
+
125
+ {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>
126
+
127
+ {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
128
+
129
+ {{ .Response }}<|eot_id|>"
130
+ ```
131
+ Then, after previously install ollama, just run:
132
+ ```
133
+ ollama create openchat-3.6-8b-20240522 -f openchat-3.6-8b-20240522
134
+ ```
135
+
136
+
137
+ ## Download Models Using huggingface-cli πŸ€—
138
+
139
+ ### Installation of `huggingface_hub[cli]`
140
+ Ensure you have the necessary CLI tool installed by running:
141
+ ```bash
142
+ pip install -U "huggingface_hub[cli]"
143
+ ```
144
+
145
+ ### Downloading Specific Model Files
146
+ To download a specific model file, use the following command:
147
+ ```bash
148
+ huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q4_K_M.gguf" --local-dir ./
149
+ ```
150
+ This command downloads the specified model file and places it in the current directory (./).
151
+
152
+ ### Downloading Large Models Split into Multiple Files
153
+ For models exceeding 50GB, which are typically split into multiple files for easier download and management:
154
+ ```bash
155
+ huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q8_0.gguf/*" --local-dir openchat-3.6-8b-20240522-Q8_0
156
+ ```
157
+ This command downloads all files in the specified directory and places them into the chosen local folder (openchat-3.6-8b-20240522-Q8_0). You can choose to download everything in place or specify a new location for the downloaded files.
158
+
159
+ ## Which File Should I Choose? πŸ“ˆ
160
+
161
+ A comprehensive analysis with performance charts is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9).
162
+
163
+ ### Assessing System Capabilities
164
+ 1. **Determine Your Model Size**: Start by checking the amount of RAM and VRAM available in your system. This will help you decide the largest possible model you can run.
165
+ 2. **Optimizing for Speed**:
166
+ - **GPU Utilization**: To run your model as quickly as possible, aim to fit the entire model into your GPU's VRAM. Pick a version that’s 1-2GB smaller than the total VRAM.
167
+ 3. **Maximizing Quality**:
168
+ - **Combined Memory**: For the highest possible quality, sum your system RAM and GPU's VRAM. Then choose a model that's 1-2GB smaller than this combined total.
169
+
170
+ ### Deciding Between 'I-Quant' and 'K-Quant'
171
+ 1. **Simplicity**:
172
+ - **K-Quant**: If you prefer a straightforward approach, select a K-quant model. These are labeled as 'QX_K_X', such as Q5_K_M.
173
+ 2. **Advanced Configuration**:
174
+ - **Feature Chart**: For a more nuanced choice, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix).
175
+ - **I-Quant Models**: Best suited for configurations below Q4 and for systems running cuBLAS (Nvidia) or rocBLAS (AMD). These are labeled 'IQX_X', such as IQ3_M, and offer better performance for their size.
176
+ - **Compatibility Considerations**:
177
+ - **I-Quant Models**: While usable on CPU and Apple Metal, they perform slower compared to their K-quant counterparts. The choice between speed and performance becomes a significant tradeoff.
178
+ - **AMD Cards**: Verify if you are using the rocBLAS build or the Vulkan build. I-quants are not compatible with Vulkan.
179
+ - **Current Support**: At the time of writing, LM Studio offers a preview with ROCm support, and other inference engines provide specific ROCm builds.
180
+
181
+ By following these guidelines, you can make an informed decision on which file best suits your system and performance needs.
182
+
183
+ ## Contact us 🌐
184
+
185
+ NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence
186
+
187
+ Website: https://neuralnet.solutions
188
+ Email: info[at]neuralnet.solutions
189
+