--- license: mit tags: - deepseek - int4 - vllm - llmcompressor base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B library_name: transformers --- # DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 ## Model Overview - **Model Architecture:** Qwen2ForCausalLM - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** INT4 - **Release Date:** 2/4/2025 - **Version:** 1.0 - **Model Developers:** Neural Magic Quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B). ### Model Optimizations This model was obtained by quantizing the weights of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-group scheme, with group size 128. The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. ## Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from transformers import AutoTokenizer from vllm import LLM, SamplingParams number_gpus = 1 model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16" tokenizer = AutoTokenizer.from_pretrained(model_name) sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id]) llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True) messages_list = [ [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}], ] prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list] outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params) generated_text = [output.outputs[0].text for output in outputs] print(generated_text) ``` vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.transformers import oneshot # Load model model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B" model_name = model_stub.split("/")[-1] num_samples = 2048 max_seq_len = 8192 tokenizer = AutoTokenizer.from_pretrained(model_stub) model = AutoModelForCausalLM.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", ) def preprocess_fn(example): return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)} ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train") ds = ds.map(preprocess_fn) # Configure the quantization algorithm and scheme recipe = QuantizationModifier( targets="Linear", scheme="W4A16", ignore=["lm_head"], dampening_frac=0.01, ) # Apply quantization oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=max_seq_len, num_calibration_samples=num_samples, ) # Save to disk in compressed-tensors format save_path = model_name + "-quantized.w4a16 model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ``` ## Evaluation The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), using the following commands: OpenLLM Leaderboard V1: ``` lm_eval \ --model vllm \ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \ --tasks openllm \ --write_out \ --batch_size auto \ --output_path output_dir \ --show_config ``` OpenLLM Leaderboard V2: ``` lm_eval \ --model vllm \ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \ --apply_chat_template \ --fewshot_as_multiturn \ --tasks leaderboard \ --write_out \ --batch_size auto \ --output_path output_dir \ --show_config ``` ### Accuracy
Category | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | Recovery |
---|---|---|---|---|
OpenLLM V1 | ARC-Challenge (Acc-Norm, 25-shot) | 58.79 | 58.28 | 99.1% |
GSM8K (Strict-Match, 5-shot) | 87.04 | 87.34 | 100.4% | |
HellaSwag (Acc-Norm, 10-shot) | 81.51 | 80.42 | 98.7% | |
MMLU (Acc, 5-shot) | 74.46 | 73.32 | 98.5% | |
TruthfulQA (MC2, 0-shot) | 54.77 | 55.29 | 101.0% | |
Winogrande (Acc, 5-shot) | 69.38 | 70.48 | 101.6% | |
Average Score | 70.99 | 70.85 | 99.8% | |
OpenLLM V2 | IFEval (Inst Level Strict Acc, 0-shot) | 43.05 | 34.90 | 81.1% |
BBH (Acc-Norm, 3-shot) | 47.16 | 45.36 | 96.2% | |
Math-Hard (Exact-Match, 4-shot) | 0.00 | 0.00 | --- | |
GPQA (Acc-Norm, 0-shot) | 35.07 | 34.90 | 99.5% | |
MUSR (Acc-Norm, 0-shot) | 45.14 | 44.20 | 97.9% | |
MMLU-Pro (Acc, 5-shot) | 34.86 | 35.09 | 100.7% | |
Average Score | 34.21 | 32.41 | 94.7% | |
Coding | HumanEval (pass@1) | 78.90 | 79.00 | 100.1% |
HumanEval (pass@10) | 89.80 | 89.70 | 99.9% | |
HumanEval+ (pass@10) | 72.60 | 72.80 | 100.3% | |
HumanEval+ (pass@10) | 84.90 | 84.00 | 98.8% |
Instruction Following 256 / 128 |
Multi-turn Chat 512 / 256 |
Docstring Generation 768 / 128 |
RAG 1024 / 128 |
Code Completion 256 / 1024 |
Code Fixing 1024 / 1024 |
Large Summarization 4096 / 512 |
Large RAG 10240 / 1536 |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hardware | Model | Average cost reduction | Latency (s) | QPD | Latency (s) | QPD | Latency (s) | QPD | Latency (s) | QPD | Latency (s) | QPD | Latency (s) | QPD | Latency (s) | QPD | Latency (s) | QPD |
A6000x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 5.4 | 837 | 10.7 | 419 | 5.5 | 813 | 5.6 | 805 | 42.2 | 107 | 42.8 | 105 | 22.9 | 197 | 71.7 | 63 |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.59 | 3.3 | 1345 | 6.7 | 673 | 3.4 | 1315 | 3.5 | 1296 | 26.5 | 170 | 26.8 | 168 | 14.5 | 310 | 48.3 | 93 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 2.51 | 2.0 | 2275 | 4.0 | 1127 | 2.2 | 2072 | 2.3 | 1945 | 15.3 | 294 | 15.9 | 283 | 9.9 | 456 | 36.6 | 123 | |
A100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 2.6 | 765 | 5.2 | 383 | 2.7 | 746 | 2.7 | 732 | 20.8 | 97 | 21.2 | 95 | 11.3 | 179 | 36.7 | 55 |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.34 | 1.9 | 1072 | 3.8 | 533 | 1.9 | 1045 | 1.9 | 1032 | 14.8 | 136 | 15.2 | 132 | 8.1 | 248 | 39.6 | 51 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 1.93 | 1.2 | 1627 | 2.5 | 810 | 1.3 | 1530 | 1.4 | 1474 | 9.7 | 208 | 10.2 | 197 | 5.8 | 348 | 37.6 | 53 | |
H100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 1.6 | 672 | 3.3 | 334 | 1.7 | 662 | 1.7 | 652 | 12.8 | 85 | 13.0 | 84 | 7.0 | 155 | 25.2 | 43 |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic | 1.33 | 1.2 | 925 | 2.3 | 467 | 1.2 | 908 | 1.2 | 896 | 9.3 | 118 | 9.5 | 115 | 5.2 | 210 | 23.9 | 46 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 1.37 | 1.2 | 944 | 2.3 | 474 | 1.2 | 931 | 1.2 | 907 | 9.1 | 121 | 9.2 | 119 | 5.1 | 214 | 22.5 | 49 |
Instruction Following 256 / 128 |
Multi-turn Chat 512 / 256 |
Docstring Generation 768 / 128 |
RAG 1024 / 128 |
Code Completion 256 / 1024 |
Code Fixing 1024 / 1024 |
Large Summarization 4096 / 512 |
Large RAG 10240 / 1536 |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hardware | Model | Average cost reduction | Maximum throughput (QPS) | QPD | Maximum throughput (QPS) | QPD | Maximum throughput (QPS) | QPD | Maximum throughput (QPS) | QPD | Maximum throughput (QPS) | QPD | Maximum throughput (QPS) | QPD | Maximum throughput (QPS) | QPD | Maximum throughput (QPS) | QPD |
A6000x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 13.7 | 30785 | 5.5 | 12327 | 6.5 | 14517 | 5.1 | 11439 | 2.0 | 4434 | 1.3 | 2982 | 0.6 | 1462 | 0.2 | 371 |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.44 | 21.4 | 48181 | 8.2 | 18421 | 9.8 | 22051 | 7.8 | 17462 | 2.8 | 6281 | 1.7 | 3758 | 1.0 | 2335 | 0.2 | 419 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 0.98 | 12.7 | 28540 | 5.7 | 12796 | 5.4 | 12218 | 3.7 | 8401 | 2.5 | 5583 | 1.3 | 2987 | 0.7 | 1489 | 0.2 | 368 | |
A100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 15.6 | 31306 | 7.1 | 14192 | 7.7 | 15435 | 6.0 | 11971 | 2.4 | 4878 | 1.6 | 3298 | 0.9 | 1862 | 0.2 | 355 |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8 | 1.31 | 20.8 | 41907 | 9.3 | 18724 | 10.5 | 21043 | 8.4 | 16886 | 3.0 | 5975 | 1.9 | 3917 | 1.2 | 2481 | 0.2 | 464 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 0.94 | 14.0 | 28146 | 6.5 | 13042 | 6.5 | 12987 | 5.1 | 10194 | 2.6 | 5269 | 1.5 | 2925 | 0.9 | 1849 | 0.2 | 382 | |
H100x1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | --- | 31.4 | 34404 | 14.1 | 15482 | 16.6 | 18149 | 13.3 | 14572 | 4.7 | 5099 | 2.6 | 2849 | 1.9 | 2060 | 0.3 | 347 |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic | 1.31 | 40.9 | 44729 | 18.5 | 20260 | 22.1 | 24165 | 18.1 | 19779 | 5.7 | 6246 | 3.4 | 3681 | 2.5 | 2746 | 0.4 | 474 | |
neuralmagic/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16 | 1.12 | 33.3 | 36387 | 15.0 | 16453 | 17.6 | 19241 | 14.2 | 15576 | 4.6 | 5034 | 3.0 | 3292 | 2.2 | 2412 | 0.4 | 481 |