Transformers
English
llama
TheBloke commited on
Commit
00a8ab2
·
1 Parent(s): 4120e35

Initial GGML model commit

Browse files
Files changed (1) hide show
  1. README.md +24 -23
README.md CHANGED
@@ -8,7 +8,7 @@ language:
8
  license: llama2
9
  model_creator: OpenAssistant
10
  model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
11
- model_name: CodeLlama 13B OASST SFT v10
12
  model_type: llama
13
  quantized_by: TheBloke
14
  ---
@@ -30,18 +30,19 @@ quantized_by: TheBloke
30
  <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
31
  <!-- header end -->
32
 
33
- # CodeLlama 13B OASST SFT v10 - GGML
34
  - Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
35
- - Original model: [CodeLlama 13B OASST SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
36
 
37
  ## Description
38
 
39
- This repo contains GGML format model files for [OpenAssistant's CodeLlama 13B OASST SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
40
 
41
  ### Important note regarding GGML files.
42
 
43
  The GGML format has now been superseded by GGUF. As of August 21st 2023, [llama.cpp](https://github.com/ggerganov/llama.cpp) no longer supports GGML models. Third party clients and libraries are expected to still support it for a time, but many may also drop support.
44
 
 
45
  ### About GGML
46
 
47
  GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
@@ -54,9 +55,9 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
54
 
55
  ## Repositories available
56
 
57
- * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ)
58
- * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGUF)
59
- * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML)
60
  * [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
61
 
62
  ## Prompt template: ChatML
@@ -101,20 +102,20 @@ Refer to the Provided Files table below to see what files use which methods, and
101
 
102
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
103
  | ---- | ---- | ---- | ---- | ---- | ----- |
104
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q2_K.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q2_K.bin) | Q2_K | 2 | 5.74 GB| 8.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
105
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_S.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_S.bin) | Q3_K_S | 3 | 5.87 GB| 8.37 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
106
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_M.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_M.bin) | Q3_K_M | 3 | 6.53 GB| 9.03 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
107
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_L.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_L.bin) | Q3_K_L | 3 | 7.14 GB| 9.64 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
108
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q4_0.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_0.bin) | Q4_0 | 4 | 7.32 GB| 9.82 GB | Original quant method, 4-bit. |
109
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_S.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_S.bin) | Q4_K_S | 4 | 7.56 GB| 10.06 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
110
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_M.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_M.bin) | Q4_K_M | 4 | 8.06 GB| 10.56 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
111
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q4_1.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_1.bin) | Q4_1 | 4 | 8.14 GB| 10.64 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
112
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q5_0.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_0.bin) | Q5_0 | 5 | 8.95 GB| 11.45 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
113
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_S.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_S.bin) | Q5_K_S | 5 | 9.15 GB| 11.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
114
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_M.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_M.bin) | Q5_K_M | 5 | 9.40 GB| 11.90 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
115
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q5_1.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_1.bin) | Q5_1 | 5 | 9.76 GB| 12.26 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
116
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q6_K.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q6_K.bin) | Q6_K | 6 | 10.83 GB| 13.33 GB | New k-quant method. Uses GGML_TYPE_Q8_K for all tensors - 6-bit quantization |
117
- | [codellama-13b-oasst-sft-v10.ggmlv3.Q8_0.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q8_0.bin) | Q8_0 | 8 | 13.83 GB| 16.33 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
118
 
119
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
120
 
@@ -125,7 +126,7 @@ Make sure you are using `llama.cpp` from commit [dadbed99e65252d79f81101a392d0d6
125
  For compatibility with latest llama.cpp, please use GGUF files instead.
126
 
127
  ```
128
- ./main -t 10 -ngl 32 -m codellama-13b-oasst-sft-v10.ggmlv3.q4_K_M.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
129
  ```
130
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
131
 
@@ -173,7 +174,7 @@ And thank you again to a16z for their generous grant.
173
 
174
  <!-- footer end -->
175
 
176
- # Original model card: OpenAssistant's CodeLlama 13B OASST SFT v10
177
 
178
  # Open-Assistant CodeLlama 13B SFT v10
179
 
 
8
  license: llama2
9
  model_creator: OpenAssistant
10
  model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
11
+ model_name: CodeLlama 13B SFT v10
12
  model_type: llama
13
  quantized_by: TheBloke
14
  ---
 
30
  <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
31
  <!-- header end -->
32
 
33
+ # CodeLlama 13B SFT v10 - GGML
34
  - Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
35
+ - Original model: [CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
36
 
37
  ## Description
38
 
39
+ This repo contains GGML format model files for [OpenAssistant's CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
40
 
41
  ### Important note regarding GGML files.
42
 
43
  The GGML format has now been superseded by GGUF. As of August 21st 2023, [llama.cpp](https://github.com/ggerganov/llama.cpp) no longer supports GGML models. Third party clients and libraries are expected to still support it for a time, but many may also drop support.
44
 
45
+ Please use the GGUF models instead.
46
  ### About GGML
47
 
48
  GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
 
55
 
56
  ## Repositories available
57
 
58
+ * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ)
59
+ * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGUF)
60
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGML)
61
  * [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
62
 
63
  ## Prompt template: ChatML
 
102
 
103
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
104
  | ---- | ---- | ---- | ---- | ---- | ----- |
105
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q2_K.bin | Q2_K | 2 | 5.74 GB| 8.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
106
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_S.bin | Q3_K_S | 3 | 5.87 GB| 8.37 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
107
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_M.bin | Q3_K_M | 3 | 6.53 GB| 9.03 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
108
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_L.bin | Q3_K_L | 3 | 7.14 GB| 9.64 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
109
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q4_0.bin | Q4_0 | 4 | 7.32 GB| 9.82 GB | Original quant method, 4-bit. |
110
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_S.bin | Q4_K_S | 4 | 7.56 GB| 10.06 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
111
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_M.bin | Q4_K_M | 4 | 8.06 GB| 10.56 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
112
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q4_1.bin | Q4_1 | 4 | 8.14 GB| 10.64 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
113
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q5_0.bin | Q5_0 | 5 | 8.95 GB| 11.45 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
114
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_S.bin | Q5_K_S | 5 | 9.15 GB| 11.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
115
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_M.bin | Q5_K_M | 5 | 9.40 GB| 11.90 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
116
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q5_1.bin | Q5_1 | 5 | 9.76 GB| 12.26 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
117
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q6_K.bin | Q6_K | 6 | 10.83 GB| 13.33 GB | New k-quant method. Uses GGML_TYPE_Q8_K for all tensors - 6-bit quantization |
118
+ | codellama-13b-oasst-sft-v10.ggmlv3.Q8_0.bin | Q8_0 | 8 | 13.83 GB| 16.33 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
119
 
120
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
121
 
 
126
  For compatibility with latest llama.cpp, please use GGUF files instead.
127
 
128
  ```
129
+ ./main -t 10 -ngl 32 -m codellama-13b-oasst-sft-v10.ggmlv3.q4_K_M.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\nYou are a story writing assistant.<|im_end|>\n<|im_start|>user\nWrite a story about llamas<|im_end|>\n<|im_start|>assistant"
130
  ```
131
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
132
 
 
174
 
175
  <!-- footer end -->
176
 
177
+ # Original model card: OpenAssistant's CodeLlama 13B SFT v10
178
 
179
  # Open-Assistant CodeLlama 13B SFT v10
180