danielhanchen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -17,12 +17,34 @@ tags:
|
|
17 |
## Instructions to run this model in llama.cpp:
|
18 |
Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
|
19 |
1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
|
20 |
-
2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp
|
21 |
-
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
```bash
|
24 |
./llama.cpp/llama-cli \
|
25 |
-
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
26 |
--cache-type-k q4_0 \
|
27 |
--threads 12 -no-cnv --prio 2 \
|
28 |
--temp 0.6 \
|
@@ -44,7 +66,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dyn
|
|
44 |
4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
|
45 |
```bash
|
46 |
./llama.cpp/llama-cli \
|
47 |
-
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
48 |
--cache-type-k q4_0 \
|
49 |
--threads 12 -no-cnv --prio 2 \
|
50 |
--n-gpu-layers 7 \
|
@@ -56,16 +78,16 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dyn
|
|
56 |
5. If you want to merge the weights together, use this script:
|
57 |
```
|
58 |
./llama.cpp/llama-gguf-split --merge \
|
59 |
-
DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
60 |
merged_file.gguf
|
61 |
```
|
62 |
|
63 |
| MoE Bits | Type | Disk Size | Accuracy | Link | Details |
|
64 |
| -------- | -------- | ------------ | ------------ | ---------------------| ---------- |
|
65 |
-
| 1.58bit | IQ1_S | **131GB** | Fair | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S) | MoE all 1.56bit. `down_proj` in MoE mixture of 2.06/1.56bit |
|
66 |
-
| 1.73bit | IQ1_M | **158GB** | Good | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M) | MoE all 1.56bit. `down_proj` in MoE left at 2.06bit |
|
67 |
-
| 2.22bit | IQ2_XXS | **183GB** | Better | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ2_XXS) | MoE all 2.06bit. `down_proj` in MoE mixture of 2.5/2.06bit |
|
68 |
-
| 2.51bit | Q2_K_XL | **212GB** | Best | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-Q2_K_XL) | MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit |
|
69 |
|
70 |
# Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
|
71 |
We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
|
|
|
17 |
## Instructions to run this model in llama.cpp:
|
18 |
Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
|
19 |
1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
|
20 |
+
2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
|
21 |
+
```bash
|
22 |
+
apt-get update
|
23 |
+
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
|
24 |
+
git clone https://github.com/ggerganov/llama.cpp
|
25 |
+
cmake llama.cpp -B llama.cpp/build \
|
26 |
+
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
|
27 |
+
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli
|
28 |
+
cp llama.cpp/build/bin/llama-* llama.cpp
|
29 |
+
```
|
30 |
+
3. It's best to use `--min-p 0.05` to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.
|
31 |
+
4. Download the model via:
|
32 |
+
```python
|
33 |
+
# pip install huggingface_hub hf_transfer
|
34 |
+
# import os # Optional for faster downloading
|
35 |
+
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
|
36 |
+
|
37 |
+
from huggingface_hub import snapshot_download
|
38 |
+
snapshot_download(
|
39 |
+
repo_id = "unsloth/DeepSeek-R1-GGUF",
|
40 |
+
local_dir = "DeepSeek-R1-GGUF",
|
41 |
+
allow_patterns = ["*UD-IQ1_S*"], # Select quant type UD-IQ1_S for 1.58bit
|
42 |
+
)
|
43 |
+
```
|
44 |
+
6. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
|
45 |
```bash
|
46 |
./llama.cpp/llama-cli \
|
47 |
+
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
48 |
--cache-type-k q4_0 \
|
49 |
--threads 12 -no-cnv --prio 2 \
|
50 |
--temp 0.6 \
|
|
|
66 |
4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
|
67 |
```bash
|
68 |
./llama.cpp/llama-cli \
|
69 |
+
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
70 |
--cache-type-k q4_0 \
|
71 |
--threads 12 -no-cnv --prio 2 \
|
72 |
--n-gpu-layers 7 \
|
|
|
78 |
5. If you want to merge the weights together, use this script:
|
79 |
```
|
80 |
./llama.cpp/llama-gguf-split --merge \
|
81 |
+
DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
82 |
merged_file.gguf
|
83 |
```
|
84 |
|
85 |
| MoE Bits | Type | Disk Size | Accuracy | Link | Details |
|
86 |
| -------- | -------- | ------------ | ------------ | ---------------------| ---------- |
|
87 |
+
| 1.58bit | UD-IQ1_S | **131GB** | Fair | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S) | MoE all 1.56bit. `down_proj` in MoE mixture of 2.06/1.56bit |
|
88 |
+
| 1.73bit | UD-IQ1_M | **158GB** | Good | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M) | MoE all 1.56bit. `down_proj` in MoE left at 2.06bit |
|
89 |
+
| 2.22bit | UD-IQ2_XXS | **183GB** | Better | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ2_XXS) | MoE all 2.06bit. `down_proj` in MoE mixture of 2.5/2.06bit |
|
90 |
+
| 2.51bit | UD-Q2_K_XL | **212GB** | Best | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-Q2_K_XL) | MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit |
|
91 |
|
92 |
# Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
|
93 |
We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
|