Commit
·
4326e39
1
Parent(s):
375c316
quantised q5_k_m GGUFv2 file model
Browse files
README.md
CHANGED
@@ -68,11 +68,11 @@ They are also compatible with many third party UIs and libraries - please see th
|
|
68 |
GGML_TYPE_Q5_K - "type-1" 5-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 5.5 bpw.
|
69 |
|
70 |
## Models
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
|
77 |
## Downloading the GGUF file(s)
|
78 |
### using manual `download`
|
@@ -99,7 +99,7 @@ huggingface-cli download kalpsnuti/llama-213-chat-gguf ggml-model-q5km.gguf --lo
|
|
99 |
```
|
100 |
[*huggingface.co/docs => Hub Python Library => HOW-TO GUIDES => Download files*](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli) has full documentation on downloading with `huggingface-cli`.
|
101 |
```shell
|
102 |
-
#
|
103 |
pip3 install hf_transfer
|
104 |
```
|
105 |
##### ...first set the environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
@@ -117,12 +117,12 @@ Clone and cd to the [llama.cpp](https://github.com/ggerganov/llama.cpp/commit/24
|
|
117 |
```
|
118 |
##### first run screenshot...
|
119 |

|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
|
127 |
## Thanks
|
128 |
Thanks **TheBlokeAI** team for inspirations!
|
|
|
68 |
GGML_TYPE_Q5_K - "type-1" 5-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 5.5 bpw.
|
69 |
|
70 |
## Models
|
71 |
+
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
72 |
+
| ---- | ---- | ---- | ---- | ---- | ----- |
|
73 |
+
| [ggml-model-q5km.gguf](https://huggingface.co/kalpsnuti/llama-213-chat-gguf/blob/main/ggml-model-q5km.gguf) | Q5_K_M | 5 | 8.6 GB| 11.73 GB | large, very low quality loss|
|
74 |
+
|
75 |
+
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
76 |
|
77 |
## Downloading the GGUF file(s)
|
78 |
### using manual `download`
|
|
|
99 |
```
|
100 |
[*huggingface.co/docs => Hub Python Library => HOW-TO GUIDES => Download files*](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli) has full documentation on downloading with `huggingface-cli`.
|
101 |
```shell
|
102 |
+
#downloads on fast connections (1Gbit/s or higher)
|
103 |
pip3 install hf_transfer
|
104 |
```
|
105 |
##### ...first set the environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
|
|
117 |
```
|
118 |
##### first run screenshot...
|
119 |

|
120 |
+
**Options - set as appropriate**
|
121 |
+
`-ngl 32` indicates `32` layers to offload to GPU. Remove if GPU acceleration is not available.
|
122 |
+
`-c 4096` indicates `4k` context length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically.
|
123 |
+
`-p <PROMPT>` indicates the *conversation style*, change to `-i` *or* `--interactive` to interact by giving `<PROMPT>` in chat style.
|
124 |
+
|
125 |
+
*The [llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) has detailed information on the ***above & other*** model running parameters.*
|
126 |
|
127 |
## Thanks
|
128 |
Thanks **TheBlokeAI** team for inspirations!
|