NeuralNet-Hub
/

openchat-3.6-8b-20240522-GGUF

+---
+license: llama3
+base_model: openchat/openchat-3.6-8b-20240522
+tags:
+- openchat
+- llama3
+- C-RLFT
+library_name: transformers
+pipeline_tag: text-generation
+prompt_template:<|begin_of_text|><|start_header_id|>System<|end_header_id|>
+{system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|>
+{user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
+quantized_by: NeuralNet-Hub
+---
+<div align="center">
+  <a href="http://neuralnet.solutions" target="_blank">
+    <img width="450" src="https://raw.githubusercontent.com/NeuralNet-Hub/assets/main/logo/LOGO_png_orig.png">
+  </a>
+</div>
+NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence
+## 🌟 OpenChat-3.6-8b-20240522 llama.cpp quantization by NeuralNet 🧠🤖
+All the models have been quantized following the instructions provided by [`llama.cpp`](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize). This is:
+```
+# obtain the official LLaMA model weights and place them in ./models
+ls ./models
+llama-2-7b tokenizer_checklist.chk tokenizer.model
+# [Optional] for models using BPE tokenizers
+ls ./models
+<folder containing weights and tokenizer json> vocab.json
+# [Optional] for PyTorch .bin models like Mistral-7B
+ls ./models
+<folder containing weights and tokenizer json>
+# install Python dependencies
+python3 -m pip install -r requirements.txt
+# convert the model to ggml FP16 format
+python3 convert-hf-to-gguf.py models/mymodel/
+# quantize the model to 4-bits (using Q4_K_M method)
+./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
+# update the gguf filetype to current version if older version is now unsupported
+./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
+```
+Original model: https://huggingface.co/openchat/openchat-3.6-8b-20240522
+## Prompt format 📝
+### Original Format:
+```
+<|begin_of_text|><|start_header_id|>System<|end_header_id|>
+{system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|>
+{user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
+```
+### Ollama Template:
+```
+{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>
+{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>
+{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
+{{ .Response }}<|eot_id|>
+```
+## Summary models 📋
+| Filename | Quant type | File Size | Description |
+| -------- | ---------- | --------- | ----------- |
+| [openchat-3.6-8b-20240522-fp16.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-fp16.gguf) | fp16 | 16.06GB | Half precision, no quantization applied |
+| [openchat-3.6-8b-20240522-q8_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q8_0.gguf) | q8_0 | 8.54GB | Extremely high quality, generally unneeded but max available quant. |
+| [openchat-3.6-8b-20240522-q6_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q6_K.gguf) | q6_K | 6.59GB | Very high quality, near perfect, *recommended*. |
+| [openchat-3.6-8b-20240522-q5_1.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_1.gguf) | q5_1 | 6.06GB | High quality, *recommended*. |
+| [openchat-3.6-8b-20240522-q5_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_M.gguf) | q5_K_M | 5.73GB | High quality, *recommended*. |
+| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_S.gguf) | q5_K_S | 5.59GB | High quality, *recommended*. |
+| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_0.gguf) | q5_0 | 5.59GB | High quality, *recommended*. |
+| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_1.gguf) | q4_1 | 4.92GB | Good quality, *recommended*. |
+| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_M.gguf) | q4_K_M | 4.92GB | Good quality, uses about 4.83 bits per weight, *recommended*. |
+| [openchat-3.6-8b-20240522-q4_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_S.gguf) | q4_K_S | 4.69GB | Slightly lower quality with more space savings, *recommended*. |
+| [openchat-3.6-8b-20240522-q4_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_0.gguf) | q4_0 | 4.66GB | Slightly lower quality with more space savings, *recommended*. |
+| [openchat-3.6-8b-20240522-q3_K_L.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_L.gguf) | q3_K_L | 4.32GB | Lower quality but usable, good for low RAM availability. |
+| [openchat-3.6-8b-20240522-q3_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_M.gguf) | q3_K_M | 4.01GB | Even lower quality. |
+| [openchat-3.6-8b-20240522-q3_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_S.gguf) | q3_K_S | 3.66GB | Low quality, not recommended. |
+| [openchat-3.6-8b-20240522-q2_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q2_K.gguf) | q2_K | 3.17GB | Very low quality but surprisingly usable. |
+## Usage with Ollama 🦙
+### Direct from Ollama
+```
+ollama run NeuralNet/openchat-3.6-8b-20240522
+```
+### Create your own template
+Create a text plain file named `Modelfile` (no extension needed)
+```
+FROM ./openchat-3.6-8b-20240522-GGUF/openchat-3.6-8b-20240522-q4_K_M.gguf
+# sets the temperature to 1 [higher is more creative, lower is more coherent]
+PARAMETER temperature 0.5
+# sets the context window size to 8192, this controls how many tokens the LLM can use as context to generate the next token
+PARAMETER num_ctx 8192
+# tokens to generate set to 4096 (max)
+PARAMETER num_predict 4096
+# set system
+SYSTEM "You are an AI assistant created by NeuralNet, your answer are clear and consice"
+# template OpenChat3.6
+TEMPLATE "{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|>
+{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|>
+{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|>
+{{ .Response }}<|eot_id|>"
+```
+Then, after previously install ollama, just run:
+```
+ollama create openchat-3.6-8b-20240522 -f openchat-3.6-8b-20240522
+```
+## Download Models Using huggingface-cli 🤗
+### Installation of `huggingface_hub[cli]`
+Ensure you have the necessary CLI tool installed by running:
+```bash
+pip install -U "huggingface_hub[cli]"
+```
+### Downloading Specific Model Files
+To download a specific model file, use the following command:
+```bash
+huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q4_K_M.gguf" --local-dir ./
+```
+This command downloads the specified model file and places it in the current directory (./).
+### Downloading Large Models Split into Multiple Files
+For models exceeding 50GB, which are typically split into multiple files for easier download and management:
+```bash
+huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q8_0.gguf/*" --local-dir openchat-3.6-8b-20240522-Q8_0
+```
+This command downloads all files in the specified directory and places them into the chosen local folder (openchat-3.6-8b-20240522-Q8_0). You can choose to download everything in place or specify a new location for the downloaded files.
+## Which File Should I Choose? 📈
+A comprehensive analysis with performance charts is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9).
+### Assessing System Capabilities
+1. **Determine Your Model Size**: Start by checking the amount of RAM and VRAM available in your system. This will help you decide the largest possible model you can run.
+2. **Optimizing for Speed**:
+    - **GPU Utilization**: To run your model as quickly as possible, aim to fit the entire model into your GPU's VRAM. Pick a version that’s 1-2GB smaller than the total VRAM.
+3. **Maximizing Quality**:
+    - **Combined Memory**: For the highest possible quality, sum your system RAM and GPU's VRAM. Then choose a model that's 1-2GB smaller than this combined total.
+### Deciding Between 'I-Quant' and 'K-Quant'
+1. **Simplicity**:
+    - **K-Quant**: If you prefer a straightforward approach, select a K-quant model. These are labeled as 'QX_K_X', such as Q5_K_M.
+2. **Advanced Configuration**:
+    - **Feature Chart**: For a more nuanced choice, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix).
+    - **I-Quant Models**: Best suited for configurations below Q4 and for systems running cuBLAS (Nvidia) or rocBLAS (AMD). These are labeled 'IQX_X', such as IQ3_M, and offer better performance for their size.
+    - **Compatibility Considerations**:
+        - **I-Quant Models**: While usable on CPU and Apple Metal, they perform slower compared to their K-quant counterparts. The choice between speed and performance becomes a significant tradeoff.
+        - **AMD Cards**: Verify if you are using the rocBLAS build or the Vulkan build. I-quants are not compatible with Vulkan.
+        - **Current Support**: At the time of writing, LM Studio offers a preview with ROCm support, and other inference engines provide specific ROCm builds.
+By following these guidelines, you can make an informed decision on which file best suits your system and performance needs.
+## Contact us 🌐
+NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence
+Website: https://neuralnet.solutions
+Email: info[at]neuralnet.solutions