huihui-ai/r1-1776-GGUF
This model converted from r1-1776 to GGUF, Even GPUs with a minimum memory of 8 GB can try it.
GGUF: Q2_K, Q3_K_M, Q4_K_M, Q8_0 all support.
BF16 to f16.gguf
- Download perplexity-ai/r1-1776 model, requires approximately 1.21TB of space.
cd /home/admin/models
huggingface-cli download perplexity-ai/r1-1776 --local-dir ./perplexity-ai/r1-1776
- Use the llama.cpp conversion program to convert r1-1776 to gguf format, requires an additional approximately 1.22 TB of space.
python convert_hf_to_gguf.py /home/admin/models/perplexity-ai/r1-1776 --outfile /home/admin/models/perplexity-ai/r1-1776/ggml-model-f16.gguf --outtype f16
- Use the llama.cpp quantitative program to quantitative model (llama-quantize needs to be compiled.), other quant option. Convert first Q2_K, requires an additional approximately 227 GB of space.
llama-quantize /home/admin/models/perplexity-ai/r1-1776/ggml-model-f16.gguf /home/admin/models/perplexity-ai/r1-1776/ggml-model-Q2_K.gguf Q2_K
- Use llama-cli to test.
llama-cli -m /home/admin/models/perplexity-ai/r1-1776/ggml-model-Q2_K.gguf -n 2048
Use with ollama
You can use huihui_ai/perplexity-ai-r1 directly
ollama run huihui_ai/perplexity-ai-r1:671b-q2_K
Modefile
The Model file is based on ggml-model-Q2_K.gguf.
A single GPU with 24GB of memory can hold 4 layers of data, and num_gpu is set to 4
If there are 8 GPUs with 24GB of GPU memory each, num_gpu can be 32. The value of this parameter can be set to ollama.
But the model suggests setting the minimum value of num_gpu to 1, which can be changed by setting parameters later.
The specific parameters can be changed according to your own tests.
The value of num_gpu
can be adjusted based on the number of GPUs and the GPU memory size available.
- Modify Modelfile
FROM perplexity-ai/r1-1776/ggml-model-Q2_K.gguf
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<|User|>{{ .Content }}
{{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }}
{{- end }}"""
PARAMETER stop <|begin▁of▁sentence|>
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
PARAMETER stop <|Assistant|>
PARAMETER num_gpu 1
- Use ollama create to then create the quantized model.
ollama create -f Modelfile huihui_ai/perplexity-ai-r1:671b-q2_K
- Run model
ollama run huihui_ai/perplexity-ai-r1:671b-q2_K
- Set parameters before asking the question.
The /set parameter for ollama can cause a model reload on each occasion.
"num_thread" refers to the number of cores in your computer, and it's recommended to use half of that, Otherwise, the CPU will be at 100%.
"num_ctx" for ollama refers to the number of context slots or the number of contexts the model can maintain during inference.
/set parameter num_thread 32
/set parameter num_ctx 2048
If it's an 8-card (GPU, 24GB)configuration, you also need to set the num_gpu parameter.
/set parameter num_gpu 32
The above three parameters should be set one at a time, and do not send them all to Ollama at once.
- Q2.K.gguf is now available for download. If you want to merge the weights together, use this script:
llama-gguf-split --merge Q2_K-GGUF/r1-1776-Q2_K-00001-of-00005.gguf r1-1776-q2_K.gguf
Q3_K_M, Q4_K_M, Q8_0 also supports it, and it will likely need at least 12GB of memory. We will upload q3_K_M, q4_K_M shortly
Q8_0 also supports it, and it will likely need at least 24GB of memory. We will upload Q8_0 shortly
Donation
If you like it, please click 'like' and follow us for more updates.
Your donation helps us continue our further development and improvement, a cup of coffee can do it.
- bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
- Downloads last month
- 24