danielhanchen commited on
Commit
3083b08
·
verified ·
1 Parent(s): 90bbbcf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -10
README.md CHANGED
@@ -17,12 +17,34 @@ tags:
17
  ## Instructions to run this model in llama.cpp:
18
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
19
  1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
20
- 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp
21
- 3. It's best to use `--min-p 0.05 or 0.1` to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.
22
- 4. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ```bash
24
  ./llama.cpp/llama-cli \
25
- --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
26
  --cache-type-k q4_0 \
27
  --threads 12 -no-cnv --prio 2 \
28
  --temp 0.6 \
@@ -44,7 +66,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dyn
44
  4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
45
  ```bash
46
  ./llama.cpp/llama-cli \
47
- --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
48
  --cache-type-k q4_0 \
49
  --threads 12 -no-cnv --prio 2 \
50
  --n-gpu-layers 7 \
@@ -56,16 +78,16 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dyn
56
  5. If you want to merge the weights together, use this script:
57
  ```
58
  ./llama.cpp/llama-gguf-split --merge \
59
- DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
60
  merged_file.gguf
61
  ```
62
 
63
  | MoE Bits | Type | Disk Size | Accuracy | Link | Details |
64
  | -------- | -------- | ------------ | ------------ | ---------------------| ---------- |
65
- | 1.58bit | IQ1_S | **131GB** | Fair | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S) | MoE all 1.56bit. `down_proj` in MoE mixture of 2.06/1.56bit |
66
- | 1.73bit | IQ1_M | **158GB** | Good | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M) | MoE all 1.56bit. `down_proj` in MoE left at 2.06bit |
67
- | 2.22bit | IQ2_XXS | **183GB** | Better | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ2_XXS) | MoE all 2.06bit. `down_proj` in MoE mixture of 2.5/2.06bit |
68
- | 2.51bit | Q2_K_XL | **212GB** | Best | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-Q2_K_XL) | MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit |
69
 
70
  # Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
71
  We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
 
17
  ## Instructions to run this model in llama.cpp:
18
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
19
  1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
20
+ 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
21
+ ```bash
22
+ apt-get update
23
+ apt-get install build-essential cmake curl libcurl4-openssl-dev -y
24
+ git clone https://github.com/ggerganov/llama.cpp
25
+ cmake llama.cpp -B llama.cpp/build \
26
+ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
27
+ cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli
28
+ cp llama.cpp/build/bin/llama-* llama.cpp
29
+ ```
30
+ 3. It's best to use `--min-p 0.05` to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.
31
+ 4. Download the model via:
32
+ ```python
33
+ # pip install huggingface_hub hf_transfer
34
+ # import os # Optional for faster downloading
35
+ # os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
36
+
37
+ from huggingface_hub import snapshot_download
38
+ snapshot_download(
39
+ repo_id = "unsloth/DeepSeek-R1-GGUF",
40
+ local_dir = "DeepSeek-R1-GGUF",
41
+ allow_patterns = ["*UD-IQ1_S*"], # Select quant type UD-IQ1_S for 1.58bit
42
+ )
43
+ ```
44
+ 6. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
45
  ```bash
46
  ./llama.cpp/llama-cli \
47
+ --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
48
  --cache-type-k q4_0 \
49
  --threads 12 -no-cnv --prio 2 \
50
  --temp 0.6 \
 
66
  4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
67
  ```bash
68
  ./llama.cpp/llama-cli \
69
+ --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
70
  --cache-type-k q4_0 \
71
  --threads 12 -no-cnv --prio 2 \
72
  --n-gpu-layers 7 \
 
78
  5. If you want to merge the weights together, use this script:
79
  ```
80
  ./llama.cpp/llama-gguf-split --merge \
81
+ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
82
  merged_file.gguf
83
  ```
84
 
85
  | MoE Bits | Type | Disk Size | Accuracy | Link | Details |
86
  | -------- | -------- | ------------ | ------------ | ---------------------| ---------- |
87
+ | 1.58bit | UD-IQ1_S | **131GB** | Fair | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S) | MoE all 1.56bit. `down_proj` in MoE mixture of 2.06/1.56bit |
88
+ | 1.73bit | UD-IQ1_M | **158GB** | Good | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M) | MoE all 1.56bit. `down_proj` in MoE left at 2.06bit |
89
+ | 2.22bit | UD-IQ2_XXS | **183GB** | Better | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ2_XXS) | MoE all 2.06bit. `down_proj` in MoE mixture of 2.5/2.06bit |
90
+ | 2.51bit | UD-Q2_K_XL | **212GB** | Best | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-Q2_K_XL) | MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit |
91
 
92
  # Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
93
  We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb