--- base_model: perplexity-ai/r1-1776 license: mit language: - en tags: - deepseek - deepseek_v3 - transformer - GGUF --- # huihui-ai/r1-1776-GGUF This model converted from r1-1776 to GGUF, **Even GPUs with a minimum memory of 8 GB** can try it. **GGUF: Q2_K, Q3_K_M, Q4_K_M, Q8_0 all support.** ## BF16 to f16.gguf 1. Download [perplexity-ai/r1-1776](https://huggingface.co/perplexity-ai/r1-1776) model, requires approximately 1.21TB of space. ``` cd /home/admin/models huggingface-cli download perplexity-ai/r1-1776 --local-dir ./perplexity-ai/r1-1776 ``` 1. Use the [llama.cpp](https://github.com/ggerganov/llama.cpp) conversion program to convert r1-1776 to gguf format, requires an additional approximately 1.22 TB of space. ``` python convert_hf_to_gguf.py /home/admin/models/perplexity-ai/r1-1776 --outfile /home/admin/models/perplexity-ai/r1-1776/ggml-model-f16.gguf --outtype f16 ``` 2. Use the [llama.cpp](https://github.com/ggerganov/llama.cpp) quantitative program to quantitative model (llama-quantize needs to be compiled.), other [quant option](https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp). Convert first Q2_K, requires an additional approximately 227 GB of space. ``` llama-quantize /home/admin/models/perplexity-ai/r1-1776/ggml-model-f16.gguf /home/admin/models/perplexity-ai/r1-1776/ggml-model-Q2_K.gguf Q2_K ``` 3. Use llama-cli to test. ``` llama-cli -m /home/admin/models/perplexity-ai/r1-1776/ggml-model-Q2_K.gguf -n 2048 ``` ## Use with ollama You can use [huihui_ai/perplexity-ai-r1](https://ollama.com/huihui_ai/perplexity-ai-r1) directly ``` ollama run huihui_ai/perplexity-ai-r1:671b-q2_K ``` or [huihui_ai/perplexity-ai-r1:671b-q3_K_M](https://ollama.com/huihui_ai/perplexity-ai-r1:671b-q3_K_M) ``` ollama run perplexity-ai-r1:671b-q3_K_M ``` ### Modefile The Model file is based on ggml-model-Q2_K.gguf. A single GPU with 24GB of memory can hold 4 layers of data, and num_gpu is set to 4 If there are 8 GPUs with 24GB of GPU memory each, num_gpu can be 32. The value of this parameter can be set to ollama. But the model suggests setting the minimum value of num_gpu to 1, which can be changed by setting parameters later. The specific parameters can be changed according to your own tests. The value of `num_gpu` can be adjusted based on the number of GPUs and the GPU memory size available. 1. Modify Modelfile ``` FROM perplexity-ai/r1-1776/ggml-model-Q2_K.gguf TEMPLATE """{{- if .System }}{{ .System }}{{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1}} {{- if eq .Role "user" }}<|User|>{{ .Content }} {{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }} {{- end }} {{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }} {{- end }}""" PARAMETER stop <|begin▁of▁sentence|> PARAMETER stop <|end▁of▁sentence|> PARAMETER stop <|User|> PARAMETER stop <|Assistant|> PARAMETER num_gpu 1 ``` 3. Use ollama create to then create the quantized model. ``` ollama create -f Modelfile huihui_ai/perplexity-ai-r1:671b-q2_K ``` ![image/png](png/00.png) 4. Run model ``` ollama run huihui_ai/perplexity-ai-r1:671b-q2_K ``` ![image/png](png/01.png) 5. Set parameters before asking the question. The /set parameter for ollama can cause a model reload on each occasion. "num_thread" refers to the number of cores in your computer, and it's recommended to use half of that, Otherwise, the CPU will be at 100%. "num_ctx" for ollama refers to the number of context slots or the number of contexts the model can maintain during inference. ``` /set parameter num_thread 32 ``` ``` /set parameter num_ctx 2048 ``` ![image/png](png/02-1.png) If it's an 8-card (GPU, 24GB)configuration, you also need to set the num_gpu parameter. ``` /set parameter num_gpu 32 ``` ![image/png](png/02.png) The above three parameters should be set one at a time, and do not send them all to Ollama at once. 6. [Q2.K.gguf](https://huggingface.co/huihui-ai/r1-1776-GGUF/tree/main/Q2_K-GGUF) is now available for download. If you want to merge the weights together, use this script: ``` llama-gguf-split --merge Q2_K-GGUF/r1-1776-Q2_K-00001-of-00005.gguf r1-1776-q2_K.gguf ``` 7. Q3_K_M, Q4_K_M, Q8_0 also supports it, and it will likely need at least 12GB of memory. We will upload q3_K_M, q4_K_M shortly 8. Q8_0 also supports it, and it will likely need at least 24GB of memory. We will upload Q8_0 shortly ### Donation If you like it, please click 'like' and follow us for more updates. ##### Your donation helps us continue our further development and improvement, a cup of coffee can do it. - bitcoin: ``` bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge ```