Running this model using vLLM Docker
The instruction in Use This Model
in the corner from vLLM says to run this.
docker run --runtime nvidia --gpus all \
--name my_vllm_container \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model unsloth/DeepSeek-R1-GGUF
How do I choose which quantization to run?
I posted my steps to (in theory) get this working here. However, it appears that @shimmyshimmer has removed the mention of using vLLM from the original blog post due to the current lack of practical support for DeepSeek GGUF files in vLLM.
Fwiw, I also didn't have luck running this model in oobabooga/text-generation-webui. Looks like that tool uses an older version of llama.cpp
via llama-cpp-python which hasn't had a release since last year, meaning it's not up-to-date with the new llama.cpp
changes to support DeepSeek models.
You can run it with GPUStack (https://github.com/gpustack/gpustack), it contains llama-box which is based on llama.cpp and has up-to-date changes.
code from https://docs.vllm.ai/en/latest/features/quantization/gguf.html
from vllm import LLM, SamplingParams
# In this script, we demonstrate how to pass input to the chat method:
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF",
tokenizer="Qwen/Qwen2.5-32B-Instruct")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
modify model and tokenizer
raise a error:
OSERROR: It looks like the config file at 'xxxx/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf' is not a valid JSON file.
Question
this gguf not supported by vllm?
Follow the Github issue here:
Should be supported soon once it's pushed: https://github.com/vllm-project/vllm/issues/12573