---
base_model: perplexity-ai/r1-1776
license: mit
language:
- en
tags:
- deepseek
- deepseek_v3
- transformer
- GGUF
---
# huihui-ai/r1-1776-GGUF
This model converted from r1-1776 to GGUF, **Even GPUs with a minimum memory of 8 GB** can try it.

**GGUF: Q2_K, Q3_K_M, Q4_K_M, Q8_0 all support.**

## BF16 to f16.gguf
1. Download [perplexity-ai/r1-1776](https://huggingface.co/perplexity-ai/r1-1776) model, requires approximately 1.21TB of space.
```
cd /home/admin/models
huggingface-cli download perplexity-ai/r1-1776 --local-dir ./perplexity-ai/r1-1776
```

1. Use the [llama.cpp](https://github.com/ggerganov/llama.cpp) conversion program to convert r1-1776 to gguf format, requires an additional approximately 1.22 TB of space.
```
python convert_hf_to_gguf.py /home/admin/models/perplexity-ai/r1-1776 --outfile /home/admin/models/perplexity-ai/r1-1776/ggml-model-f16.gguf --outtype f16
```
2. Use the [llama.cpp](https://github.com/ggerganov/llama.cpp) quantitative program to quantitative model (llama-quantize needs to be compiled.),
other [quant option](https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp). 
Convert first Q2_K, requires an additional approximately 227 GB of space.
```
llama-quantize /home/admin/models/perplexity-ai/r1-1776/ggml-model-f16.gguf  /home/admin/models/perplexity-ai/r1-1776/ggml-model-Q2_K.gguf Q2_K
```
3. Use llama-cli to test.
```
llama-cli -m /home/admin/models/perplexity-ai/r1-1776/ggml-model-Q2_K.gguf -n 2048
```

## Use with ollama

You can use [huihui_ai/perplexity-ai-r1](https://ollama.com/huihui_ai/perplexity-ai-r1) directly
```
ollama run huihui_ai/perplexity-ai-r1:671b-q2_K
```
or [huihui_ai/perplexity-ai-r1:671b-q3_K_M](https://ollama.com/huihui_ai/perplexity-ai-r1:671b-q3_K_M) 
```
ollama run perplexity-ai-r1:671b-q3_K_M
```


### Modefile
The Model file is based on ggml-model-Q2_K.gguf.

A single GPU with 24GB of memory can hold 4 layers of data, and num_gpu is set to 4

If there are 8 GPUs with 24GB of GPU memory each, num_gpu can be 32. The value of this parameter can be set to ollama.

But the model suggests setting the minimum value of num_gpu to 1, which can be changed by setting parameters later.

The specific parameters can be changed according to your own tests.

The value of `num_gpu` can be adjusted based on the number of GPUs and the GPU memory size available.

1. Modify Modelfile

```
FROM perplexity-ai/r1-1776/ggml-model-Q2_K.gguf
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<｜User｜>{{ .Content }}
{{- else if eq .Role "assistant" }}<｜Assistant｜>{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<｜Assistant｜>{{- end }}
{{- end }}"""
PARAMETER stop <｜begin▁of▁sentence｜>
PARAMETER stop <｜end▁of▁sentence｜>
PARAMETER stop <｜User｜>
PARAMETER stop <｜Assistant｜>
PARAMETER num_gpu 1

```

3. Use ollama create to then create the quantized model.
```
ollama create -f Modelfile huihui_ai/perplexity-ai-r1:671b-q2_K
```
![image/png](png/00.png)

4. Run model

```
ollama run huihui_ai/perplexity-ai-r1:671b-q2_K
```
![image/png](png/01.png)

5. Set parameters before asking the question.

The /set parameter for ollama can cause a model reload on each occasion.

"num_thread" refers to the number of cores in your computer, and it's recommended to use half of that, Otherwise, the CPU will be at 100%.

"num_ctx" for ollama refers to the number of context slots or the number of contexts the model can maintain during inference.

```
/set parameter num_thread 32
```
```
/set parameter num_ctx 2048
```
![image/png](png/02-1.png)
If it's an 8-card (GPU, 24GB)configuration, you also need to set the num_gpu parameter.

```
/set parameter num_gpu 32
```
![image/png](png/02.png)

The above three parameters should be set one at a time, and do not send them all to Ollama at once.

6. [Q2.K.gguf](https://huggingface.co/huihui-ai/r1-1776-GGUF/tree/main/Q2_K-GGUF) is now available for download.
If you want to merge the weights together, use this script:
```
llama-gguf-split --merge Q2_K-GGUF/r1-1776-Q2_K-00001-of-00005.gguf r1-1776-q2_K.gguf

```

7. Q3_K_M, Q4_K_M, Q8_0 also supports it, and it will likely need at least 12GB of memory. We will upload q3_K_M, q4_K_M shortly

8. Q8_0 also supports it, and it will likely need at least 24GB of memory. We will upload Q8_0 shortly


### Donation

If you like it, please click 'like' and follow us for more updates.

##### Your donation helps us continue our further development and improvement, a cup of coffee can do it.
- bitcoin:
```
  bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
```