|
--- |
|
license: llama3 |
|
base_model: openchat/openchat-3.6-8b-20240522 |
|
tags: |
|
- openchat |
|
- llama3 |
|
- C-RLFT |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
quantized_by: NeuralNet-Hub |
|
--- |
|
|
|
<div align="center"> |
|
<a href="http://neuralnet.solutions" target="_blank"> |
|
<img width="450" src="https://raw.githubusercontent.com/NeuralNet-Hub/assets/main/logo/LOGO_png_orig.png"> |
|
</a> |
|
</div> |
|
|
|
NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence |
|
|
|
|
|
## π OpenChat-3.6-8b-20240522 llama.cpp quantization by NeuralNet π§ π€ |
|
|
|
All the models have been quantized following the instructions provided by [`llama.cpp`](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize). This is: |
|
``` |
|
# obtain the official LLaMA model weights and place them in ./models |
|
ls ./models |
|
llama-2-7b tokenizer_checklist.chk tokenizer.model |
|
# [Optional] for models using BPE tokenizers |
|
ls ./models |
|
<folder containing weights and tokenizer json> vocab.json |
|
# [Optional] for PyTorch .bin models like Mistral-7B |
|
ls ./models |
|
<folder containing weights and tokenizer json> |
|
|
|
# install Python dependencies |
|
python3 -m pip install -r requirements.txt |
|
|
|
# convert the model to ggml FP16 format |
|
python3 convert-hf-to-gguf.py models/mymodel/ |
|
|
|
# quantize the model to 4-bits (using Q4_K_M method) |
|
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M |
|
|
|
# update the gguf filetype to current version if older version is now unsupported |
|
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY |
|
``` |
|
|
|
Original model: https://huggingface.co/openchat/openchat-3.6-8b-20240522 |
|
|
|
## Prompt format π |
|
|
|
### Original Format: |
|
``` |
|
<|begin_of_text|><|start_header_id|>System<|end_header_id|> |
|
|
|
{system}<|eot_id|><|start_header_id|>GPT4 Correct User<|end_header_id|> |
|
|
|
{user}<|eot_id|><|start_header_id|>GPT4 Correct Assistant<|end_header_id|> |
|
``` |
|
|
|
### Ollama Template: |
|
``` |
|
{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|> |
|
|
|
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|> |
|
|
|
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|> |
|
|
|
{{ .Response }}<|eot_id|> |
|
``` |
|
|
|
## Summary models π |
|
|
|
| Filename | Quant type | File Size | Description | |
|
| -------- | ---------- | --------- | ----------- | |
|
| [openchat-3.6-8b-20240522-fp16.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-fp16.gguf) | fp16 | 16.06GB | Half precision, no quantization applied | |
|
| [openchat-3.6-8b-20240522-q8_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q8_0.gguf) | q8_0 | 8.54GB | Extremely high quality, generally unneeded but max available quant. | |
|
| [openchat-3.6-8b-20240522-q6_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q6_K.gguf) | q6_K | 6.59GB | Very high quality, near perfect, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q5_1.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_1.gguf) | q5_1 | 6.06GB | High quality, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q5_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_M.gguf) | q5_K_M | 5.73GB | High quality, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_K_S.gguf) | q5_K_S | 5.59GB | High quality, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q5_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q5_0.gguf) | q5_0 | 5.59GB | High quality, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_1.gguf) | q4_1 | 4.92GB | Good quality, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q4_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_M.gguf) | q4_K_M | 4.92GB | Good quality, uses about 4.83 bits per weight, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q4_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_K_S.gguf) | q4_K_S | 4.69GB | Slightly lower quality with more space savings, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q4_0.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q4_0.gguf) | q4_0 | 4.66GB | Slightly lower quality with more space savings, *recommended*. | |
|
| [openchat-3.6-8b-20240522-q3_K_L.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_L.gguf) | q3_K_L | 4.32GB | Lower quality but usable, good for low RAM availability. | |
|
| [openchat-3.6-8b-20240522-q3_K_M.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_M.gguf) | q3_K_M | 4.01GB | Even lower quality. | |
|
| [openchat-3.6-8b-20240522-q3_K_S.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q3_K_S.gguf) | q3_K_S | 3.66GB | Low quality, not recommended. | |
|
| [openchat-3.6-8b-20240522-q2_K.gguf](https://huggingface.co/NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-q2_K.gguf) | q2_K | 3.17GB | Very low quality but surprisingly usable. | |
|
|
|
## Usage with Ollama π¦ |
|
|
|
### Direct from Ollama |
|
``` |
|
ollama run NeuralNet/openchat-3.6-8b-20240522 |
|
``` |
|
|
|
### Create your own template |
|
Create a text plain file named `Modelfile` (no extension needed) |
|
``` |
|
FROM NeuralNet/openchat-3.6 |
|
|
|
# sets the temperature to 1 [higher is more creative, lower is more coherent] |
|
PARAMETER temperature 0.5 |
|
|
|
# sets the context window size to 8192, this controls how many tokens the LLM can use as context to generate the next token |
|
PARAMETER num_ctx 8192 |
|
|
|
# tokens to generate set to 4096 (max) |
|
PARAMETER num_predict 4096 |
|
|
|
# set system |
|
SYSTEM "You are an AI assistant created by NeuralNet, your answer are clear and consice" |
|
|
|
# template OpenChat3.6 |
|
TEMPLATE "{{ if .System }}<|begin_of_text|><|start_header_id|>System<|end_header_id|> |
|
|
|
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>GPT4 Correct User<|end_header_id|> |
|
|
|
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>GPT4 Correct Assistant<|end_header_id|> |
|
|
|
{{ .Response }}<|eot_id|>" |
|
``` |
|
Then, after previously install ollama, just run: |
|
``` |
|
ollama create openchat-3.6-8b-20240522 -f openchat-3.6-8b-20240522 |
|
``` |
|
|
|
|
|
## Download Models Using huggingface-cli π€ |
|
|
|
### Installation of `huggingface_hub[cli]` |
|
Ensure you have the necessary CLI tool installed by running: |
|
```bash |
|
pip install -U "huggingface_hub[cli]" |
|
``` |
|
|
|
### Downloading Specific Model Files |
|
To download a specific model file, use the following command: |
|
```bash |
|
huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q4_K_M.gguf" --local-dir ./ |
|
``` |
|
This command downloads the specified model file and places it in the current directory (./). |
|
|
|
### Downloading Large Models Split into Multiple Files |
|
For models exceeding 50GB, which are typically split into multiple files for easier download and management: |
|
```bash |
|
huggingface-cli download NeuralNet-Hub/openchat-3.6-8b-20240522-GGUF --include "openchat-3.6-8b-20240522-Q8_0.gguf/*" --local-dir openchat-3.6-8b-20240522-Q8_0 |
|
``` |
|
This command downloads all files in the specified directory and places them into the chosen local folder (openchat-3.6-8b-20240522-Q8_0). You can choose to download everything in place or specify a new location for the downloaded files. |
|
|
|
## Which File Should I Choose? π |
|
|
|
A comprehensive analysis with performance charts is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9). |
|
|
|
### Assessing System Capabilities |
|
1. **Determine Your Model Size**: Start by checking the amount of RAM and VRAM available in your system. This will help you decide the largest possible model you can run. |
|
2. **Optimizing for Speed**: |
|
- **GPU Utilization**: To run your model as quickly as possible, aim to fit the entire model into your GPU's VRAM. Pick a version thatβs 1-2GB smaller than the total VRAM. |
|
3. **Maximizing Quality**: |
|
- **Combined Memory**: For the highest possible quality, sum your system RAM and GPU's VRAM. Then choose a model that's 1-2GB smaller than this combined total. |
|
|
|
### Deciding Between 'I-Quant' and 'K-Quant' |
|
1. **Simplicity**: |
|
- **K-Quant**: If you prefer a straightforward approach, select a K-quant model. These are labeled as 'QX_K_X', such as Q5_K_M. |
|
2. **Advanced Configuration**: |
|
- **Feature Chart**: For a more nuanced choice, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix). |
|
- **I-Quant Models**: Best suited for configurations below Q4 and for systems running cuBLAS (Nvidia) or rocBLAS (AMD). These are labeled 'IQX_X', such as IQ3_M, and offer better performance for their size. |
|
- **Compatibility Considerations**: |
|
- **I-Quant Models**: While usable on CPU and Apple Metal, they perform slower compared to their K-quant counterparts. The choice between speed and performance becomes a significant tradeoff. |
|
- **AMD Cards**: Verify if you are using the rocBLAS build or the Vulkan build. I-quants are not compatible with Vulkan. |
|
- **Current Support**: At the time of writing, LM Studio offers a preview with ROCm support, and other inference engines provide specific ROCm builds. |
|
|
|
By following these guidelines, you can make an informed decision on which file best suits your system and performance needs. |
|
|
|
## Contact us π |
|
|
|
NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence |
|
|
|
Website: https://neuralnet.solutions |
|
Email: info[at]neuralnet.solutions |
|
|
|
|