TheBloke
/

CodeLlama-13B-oasst-sft-v10-GGML

Transformers

English

llama

Model card Files Files and versions Community

TheBloke commited on Aug 30, 2023

Commit

00a8ab2

1 Parent(s): 4120e35

Initial GGML model commit

Browse files

Files changed (1) hide show

README.md +24 -23

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ language:
 license: llama2
 model_creator: OpenAssistant
 model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
-model_name: CodeLlama 13B OASST SFT v10
 model_type: llama
 quantized_by: TheBloke
 ---
@@ -30,18 +30,19 @@ quantized_by: TheBloke
 <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
 <!-- header end -->
-# CodeLlama 13B OASST SFT v10 - GGML
 - Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
-- Original model: [CodeLlama 13B OASST SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
 ## Description
-This repo contains GGML format model files for [OpenAssistant's CodeLlama 13B OASST SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
 ### Important note regarding GGML files.
 The GGML format has now been superseded by GGUF. As of August 21st 2023, [llama.cpp](https://github.com/ggerganov/llama.cpp) no longer supports GGML models. Third party clients and libraries are expected to still support it for a time, but many may also drop support.
 ### About GGML
 GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
@@ -54,9 +55,9 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
 ## Repositories available
-* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ)
-* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGUF)
-* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML)
 * [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
 ## Prompt template: ChatML
@@ -101,20 +102,20 @@ Refer to the Provided Files table below to see what files use which methods, and
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q2_K.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q2_K.bin) | Q2_K | 2 | 5.74 GB| 8.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_S.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_S.bin) | Q3_K_S | 3 | 5.87 GB| 8.37 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_M.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_M.bin) | Q3_K_M | 3 | 6.53 GB| 9.03 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_L.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_L.bin) | Q3_K_L | 3 | 7.14 GB| 9.64 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q4_0.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_0.bin) | Q4_0 | 4 | 7.32 GB| 9.82 GB | Original quant method, 4-bit. |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_S.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_S.bin) | Q4_K_S | 4 | 7.56 GB| 10.06 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_M.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_M.bin) | Q4_K_M | 4 | 8.06 GB| 10.56 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q4_1.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q4_1.bin) | Q4_1 | 4 | 8.14 GB| 10.64 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q5_0.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_0.bin) | Q5_0 | 5 | 8.95 GB| 11.45 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_S.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_S.bin) | Q5_K_S | 5 | 9.15 GB| 11.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_M.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_M.bin) | Q5_K_M | 5 | 9.40 GB| 11.90 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q5_1.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q5_1.bin) | Q5_1 | 5 | 9.76 GB| 12.26 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q6_K.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q6_K.bin) | Q6_K | 6 | 10.83 GB| 13.33 GB | New k-quant method. Uses GGML_TYPE_Q8_K for all tensors - 6-bit quantization |
-| [codellama-13b-oasst-sft-v10.ggmlv3.Q8_0.bin](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML/blob/main/codellama-13b-oasst-sft-v10.ggmlv3.Q8_0.bin) | Q8_0 | 8 | 13.83 GB| 16.33 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
@@ -125,7 +126,7 @@ Make sure you are using `llama.cpp` from commit [dadbed99e65252d79f81101a392d0d6
 For compatibility with latest llama.cpp, please use GGUF files instead.
 ```
-./main -t 10 -ngl 32 -m codellama-13b-oasst-sft-v10.ggmlv3.q4_K_M.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
@@ -173,7 +174,7 @@ And thank you again to a16z for their generous grant.
 <!-- footer end -->
-# Original model card: OpenAssistant's CodeLlama 13B OASST SFT v10
 # Open-Assistant CodeLlama 13B SFT v10

 license: llama2
 model_creator: OpenAssistant
 model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
+model_name: CodeLlama 13B SFT v10
 model_type: llama
 quantized_by: TheBloke
 ---
 <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
 <!-- header end -->
+# CodeLlama 13B SFT v10 - GGML
 - Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
+- Original model: [CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
 ## Description
+This repo contains GGML format model files for [OpenAssistant's CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
 ### Important note regarding GGML files.
 The GGML format has now been superseded by GGUF. As of August 21st 2023, [llama.cpp](https://github.com/ggerganov/llama.cpp) no longer supports GGML models. Third party clients and libraries are expected to still support it for a time, but many may also drop support.
+Please use the GGUF models instead.
 ### About GGML
 GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
 ## Repositories available
+* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ)
+* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGUF)
+* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGML)
 * [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
 ## Prompt template: ChatML
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q2_K.bin | Q2_K | 2 | 5.74 GB| 8.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_S.bin | Q3_K_S | 3 | 5.87 GB| 8.37 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_M.bin | Q3_K_M | 3 | 6.53 GB| 9.03 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q3_K_L.bin | Q3_K_L | 3 | 7.14 GB| 9.64 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q4_0.bin | Q4_0 | 4 | 7.32 GB| 9.82 GB | Original quant method, 4-bit. |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_S.bin | Q4_K_S | 4 | 7.56 GB| 10.06 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q4_K_M.bin | Q4_K_M | 4 | 8.06 GB| 10.56 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q4_1.bin | Q4_1 | 4 | 8.14 GB| 10.64 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q5_0.bin | Q5_0 | 5 | 8.95 GB| 11.45 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_S.bin | Q5_K_S | 5 | 9.15 GB| 11.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q5_K_M.bin | Q5_K_M | 5 | 9.40 GB| 11.90 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q5_1.bin | Q5_1 | 5 | 9.76 GB| 12.26 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q6_K.bin | Q6_K | 6 | 10.83 GB| 13.33 GB | New k-quant method. Uses GGML_TYPE_Q8_K for all tensors - 6-bit quantization |
+| codellama-13b-oasst-sft-v10.ggmlv3.Q8_0.bin | Q8_0 | 8 | 13.83 GB| 16.33 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 For compatibility with latest llama.cpp, please use GGUF files instead.
 ```
+./main -t 10 -ngl 32 -m codellama-13b-oasst-sft-v10.ggmlv3.q4_K_M.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\nYou are a story writing assistant.<|im_end|>\n<|im_start|>user\nWrite a story about llamas<|im_end|>\n<|im_start|>assistant"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
 <!-- footer end -->
+# Original model card: OpenAssistant's CodeLlama 13B SFT v10
 # Open-Assistant CodeLlama 13B SFT v10