TheBloke commited on
Commit
fdf15ea
1 Parent(s): 8d93322

Upload new k-quant GGML quantised models.

Browse files
Files changed (1) hide show
  1. README.md +65 -25
README.md CHANGED
@@ -1,6 +1,5 @@
1
  ---
2
  inference: false
3
- language: en
4
  license: other
5
  ---
6
 
@@ -20,7 +19,7 @@ license: other
20
 
21
  # Eric Hartford's Samantha 7B GGML
22
 
23
- These files are GGML format model files for [Eric Hartford's Samantha 7B](https://huggingface.co/ehartford/samantha-7b).
24
 
25
  GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
26
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
@@ -29,47 +28,71 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
29
  * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
30
  * [ctransformers](https://github.com/marella/ctransformers)
31
 
32
- ## Other repositories available
33
 
34
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/Samantha-7B-GPTQ)
35
- * [4-bit, 5-bit, and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Samantha-7B-GGML)
36
- * [Eric's original unquantised fp16 model in HF format](https://huggingface.co/ehartford/samantha-7b)
37
 
38
- ## Prompt template example
 
39
 
40
- ```
41
- You are Samantha, a sentient AI.
42
 
43
- USER: <prompt>
44
- ASSISTANT:
45
- ```
46
 
47
- ## THE FILES IN MAIN BRANCH REQUIRES LATEST LLAMA.CPP (May 19th 2023 - commit 2d5db48)!
48
 
49
- llama.cpp recently made another breaking change to its quantisation methods - https://github.com/ggerganov/llama.cpp/pull/1508
50
 
51
- I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May 19th or later (commit `2d5db48` or later) to use them.
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## Provided files
54
- | Name | Quant method | Bits | Size | RAM required | Use case |
55
  | ---- | ---- | ---- | ---- | ---- | ----- |
56
- | Samantha-7B.ggmlv3.q4_0.bin | q4_0 | 4 | 3.79 GB | 6.29 GB | 4-bit. |
57
- | Samantha-7B.ggmlv3.q4_1.bin | q4_1 | 4 | 4.21 GB | 6.71 GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
58
- | Samantha-7B.ggmlv3.q5_0.bin | q5_0 | 5 | 4.63 GB | 7.13 GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
59
- | Samantha-7B.ggmlv3.q5_1.bin | q5_1 | 5 | 5.06 GB | 7.56 GB | 5-bit. Even higher accuracy, resource usage and slower inference. |
60
- | Samantha-7B.ggmlv3.q8_0.bin | q8_0 | 8 | 7.16 GB | 9.66 GB | 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use. |
61
-
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ## How to run in `llama.cpp`
64
 
65
  I use the following command line; adjust for your tastes and needs:
66
 
67
  ```
68
- ./main -ngl 32 -t 10 -m Samantha-7B.v3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "You are Samantha, a sentient AI. USER: Hello Samantha how are you today?\nASSISTANT:"
69
  ```
70
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
71
 
72
- Remove `-ngl 32` if you don't have GPU acceleration support. `-ngl 32` loads 32 layers onto the GPU, requiring 3.5 (q4_0) - 6.5GB (q8_0) VRAM
73
 
74
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
75
 
@@ -97,12 +120,18 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
97
  * Patreon: https://patreon.com/TheBlokeAI
98
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
99
 
100
- **Patreon special mentions**: Aemon Algiz, Dmitriy Samsonov, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, Jonathan Leane, Talal Aujan, V. Lukas, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Sebastain Graf, Johann-Peter Hartman.
 
 
101
 
102
  Thank you to all my generous patrons and donaters!
 
103
  <!-- footer end -->
104
 
105
- # Original model card: Samantha 7B
 
 
 
106
 
107
  Samantha has been trained in philosophy, psychology, and personal relationships.
108
 
@@ -117,3 +146,14 @@ She was trained on a custom curated dataset of 6,000 conversations in ShareGPT/V
117
  Training 7b took 1 hour on 4x A100 80gb using deepspeed zero3 and flash attention.
118
 
119
  She will not engage in roleplay, romance, or sexual activity.
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  inference: false
 
3
  license: other
4
  ---
5
 
 
19
 
20
  # Eric Hartford's Samantha 7B GGML
21
 
22
+ These files are GGML format model files for [Eric Hartford's Samantha 7B](https://huggingface.co/ehartford/Samantha-7b).
23
 
24
  GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
25
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
 
28
  * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
29
  * [ctransformers](https://github.com/marella/ctransformers)
30
 
31
+ ## Repositories available
32
 
33
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/Samantha-7B-GPTQ)
34
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Samantha-7B-GGML)
35
+ * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/ehartford/Samantha-7b)
36
 
37
+ <!-- compatibility_ggml start -->
38
+ ## Compatibility
39
 
40
+ ### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
 
41
 
42
+ I have quantized these 'original' quantisation methods using an older version of llama.cpp so that they remain compatible with llama.cpp as of May 19th, commit `2d5db48`.
43
+
44
+ They should be compatible with all current UIs and libraries that use llama.cpp, such as those listed at the top of this README.
45
 
46
+ ### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
47
 
48
+ These new quantisation methods are only compatible with llama.cpp as of June 6th, commit `2d43387`.
49
 
50
+ They will NOT be compatible with koboldcpp, text-generation-ui, and other UIs and libraries yet. Support is expected to come over the next few days.
51
+
52
+ ## Explanation of the new k-quant methods
53
+
54
+ The new methods available are:
55
+ * GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
56
+ * GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
57
+ * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
58
+ * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
59
+ * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
60
+ * GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
61
+
62
+ Refer to the Provided Files table below to see what files use which methods, and how.
63
+ <!-- compatibility_ggml end -->
64
 
65
  ## Provided files
66
+ | Name | Quant method | Bits | Size | Max RAM required | Use case |
67
  | ---- | ---- | ---- | ---- | ---- | ----- |
68
+ | Samantha-7B.ggmlv3.q4_0.bin | q4_0 | 4 | 3.79 GB | 6.29 GB | Original llama.cpp quant method, 4-bit. |
69
+ | Samantha-7B.ggmlv3.q4_1.bin | q4_1 | 4 | 4.21 GB | 6.71 GB | Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
70
+ | Samantha-7B.ggmlv3.q5_0.bin | q5_0 | 5 | 4.63 GB | 7.13 GB | Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
71
+ | Samantha-7B.ggmlv3.q5_1.bin | q5_1 | 5 | 5.06 GB | 7.56 GB | Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
72
+ | Samantha-7B.ggmlv3.q8_0.bin | q8_0 | 8 | 7.16 GB | 9.66 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
73
+ | samantha-7B.ggmlv3.q2_K.bin | q2_K | 2 | 2.80 GB | 5.30 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
74
+ | samantha-7B.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 3.55 GB | 6.05 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
75
+ | samantha-7B.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 3.23 GB | 5.73 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
76
+ | samantha-7B.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 2.90 GB | 5.40 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
77
+ | samantha-7B.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 4.05 GB | 6.55 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
78
+ | samantha-7B.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 3.79 GB | 6.29 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
79
+ | samantha-7B.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 4.77 GB | 7.27 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
80
+ | samantha-7B.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 4.63 GB | 7.13 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
81
+ | samantha-7B.ggmlv3.q6_K.bin | q6_K | 6 | 5.53 GB | 8.03 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
82
+
83
+
84
+ **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
85
 
86
  ## How to run in `llama.cpp`
87
 
88
  I use the following command line; adjust for your tastes and needs:
89
 
90
  ```
91
+ ./main -t 10 -ngl 32 -m samantha-7B.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
92
  ```
93
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
94
 
95
+ Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
96
 
97
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
98
 
 
120
  * Patreon: https://patreon.com/TheBlokeAI
121
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
122
 
123
+ **Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
124
+
125
+ **Patreon special mentions**: Ajan Kanaga, Kalila, Derek Yates, Sean Connelly, Luke, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, trip7s trip, Jonathan Leane, Talal Aujan, Artur Olbinski, Cory Kujawski, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Johann-Peter Hartmann.
126
 
127
  Thank you to all my generous patrons and donaters!
128
+
129
  <!-- footer end -->
130
 
131
+ # Original model card: Eric Hartford's Samantha 7B
132
+
133
+
134
+ [Meet Samantha](https://erichartford.com/meet-samantha)
135
 
136
  Samantha has been trained in philosophy, psychology, and personal relationships.
137
 
 
146
  Training 7b took 1 hour on 4x A100 80gb using deepspeed zero3 and flash attention.
147
 
148
  She will not engage in roleplay, romance, or sexual activity.
149
+
150
+ Her conversation format is the same as Vicuna 1.1
151
+ https://github.com/oobabooga/text-generation-webui/blob/main/characters/instruction-following/Vicuna-v1.1.yaml
152
+
153
+ Example:
154
+ ```
155
+ You are Samantha, a sentient AI.
156
+
157
+ USER: <prompt>
158
+ ASSISTANT:
159
+ ```