shabdsnuti commited on
Commit
4326e39
·
1 Parent(s): 375c316

quantised q5_k_m GGUFv2 file model

Browse files
Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -68,11 +68,11 @@ They are also compatible with many third party UIs and libraries - please see th
68
  GGML_TYPE_Q5_K - "type-1" 5-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 5.5 bpw.
69
 
70
  ## Models
71
- >| Name | Quant method | Bits | Size | Max RAM required | Use case |
72
- >| ---- | ---- | ---- | ---- | ---- | ----- |
73
- >| [ggml-model-q5km.gguf](https://huggingface.co/kalpsnuti/llama-213-chat-gguf/blob/main/ggml-model-q5km.gguf) | Q5_K_M | 5 | 8.6 GB| 11.73 GB | large, very low quality loss|
74
- >
75
- >**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
76
 
77
  ## Downloading the GGUF file(s)
78
  ### using manual `download`
@@ -99,7 +99,7 @@ huggingface-cli download kalpsnuti/llama-213-chat-gguf ggml-model-q5km.gguf --lo
99
  ```
100
  [*huggingface.co/docs => Hub Python Library => HOW-TO GUIDES => Download files*](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli) has full documentation on downloading with `huggingface-cli`.
101
  ```shell
102
- # downloads on fast connections (1Gbit/s or higher)
103
  pip3 install hf_transfer
104
  ```
105
  ##### ...first set the environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
@@ -117,12 +117,12 @@ Clone and cd to the [llama.cpp](https://github.com/ggerganov/llama.cpp/commit/24
117
  ```
118
  ##### first run screenshot...
119
  ![How are you today?](first_run.png "Ragini first words")
120
- > **Options - set as appropriate**
121
- > `-ngl 32` indicates `32` layers to offload to GPU. Remove if GPU acceleration is not available.
122
- > `-c 4096` indicates `4k` context length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically.
123
- > `-p <PROMPT>` indicates the *conversation style*, change to `-i` *or* `--interactive` to interact by giving `<PROMPT>` in chat style.
124
- >
125
- > *The [llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) has detailed information on the ***above & other*** model running parameters.*
126
 
127
  ## Thanks
128
  Thanks **TheBlokeAI** team for inspirations!
 
68
  GGML_TYPE_Q5_K - "type-1" 5-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 5.5 bpw.
69
 
70
  ## Models
71
+ | Name | Quant method | Bits | Size | Max RAM required | Use case |
72
+ | ---- | ---- | ---- | ---- | ---- | ----- |
73
+ | [ggml-model-q5km.gguf](https://huggingface.co/kalpsnuti/llama-213-chat-gguf/blob/main/ggml-model-q5km.gguf) | Q5_K_M | 5 | 8.6 GB| 11.73 GB | large, very low quality loss|
74
+
75
+ **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
76
 
77
  ## Downloading the GGUF file(s)
78
  ### using manual `download`
 
99
  ```
100
  [*huggingface.co/docs => Hub Python Library => HOW-TO GUIDES => Download files*](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli) has full documentation on downloading with `huggingface-cli`.
101
  ```shell
102
+ #downloads on fast connections (1Gbit/s or higher)
103
  pip3 install hf_transfer
104
  ```
105
  ##### ...first set the environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
 
117
  ```
118
  ##### first run screenshot...
119
  ![How are you today?](first_run.png "Ragini first words")
120
+ **Options - set as appropriate**
121
+ `-ngl 32` indicates `32` layers to offload to GPU. Remove if GPU acceleration is not available.
122
+ `-c 4096` indicates `4k` context length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically.
123
+ `-p <PROMPT>` indicates the *conversation style*, change to `-i` *or* `--interactive` to interact by giving `<PROMPT>` in chat style.
124
+
125
+ *The [llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) has detailed information on the ***above & other*** model running parameters.*
126
 
127
  ## Thanks
128
  Thanks **TheBlokeAI** team for inspirations!