msr2000 simon-mo commited on
Commit
9ffbf9c
1 Parent(s): dd31960

Update README.md with vLLM Support (#8)

Browse files

- Update README.md with vLLM Support (87663416be779d980e4aa102edc328b85d7802f6)


Co-authored-by: Simon Mo <[email protected]>

Files changed (1) hide show
  1. README.md +9 -4
README.md CHANGED
@@ -230,8 +230,9 @@ DeepSeek-V3 can be deployed locally using the following hardware and open-source
230
  2. **SGLang**: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes.
231
  3. **LMDeploy**: Enables efficient FP8 and BF16 inference for local and cloud deployment.
232
  4. **TensorRT-LLM**: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon.
233
- 5. **AMD GPU**: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes.
234
- 6. **Huawei Ascend NPU**: Supports running DeepSeek-V3 on Huawei Ascend devices.
 
235
 
236
  Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation.
237
 
@@ -303,11 +304,15 @@ For comprehensive step-by-step instructions on running DeepSeek-V3 with LMDeploy
303
 
304
  [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) now supports the DeepSeek-V3 model, offering precision options such as BF16 and INT4/INT8 weight-only. Support for FP8 is currently in progress and will be released soon. You can access the custom branch of TRTLLM specifically for DeepSeek-V3 support through the following link to experience the new features directly: https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3.
305
 
306
- ### 6.5 Recommended Inference Functionality with AMD GPUs
 
 
 
 
307
 
308
  In collaboration with the AMD team, we have achieved Day-One support for AMD GPUs using SGLang, with full compatibility for both FP8 and BF16 precision. For detailed guidance, please refer to the [SGLang instructions](#63-inference-with-lmdeploy-recommended).
309
 
310
- ### 6.6 Recommended Inference Functionality with Huawei Ascend NPUs
311
  The [MindIE](https://www.hiascend.com/en/software/mindie) framework from the Huawei Ascend community has successfully adapted the BF16 version of DeepSeek-V3. For step-by-step guidance on Ascend NPUs, please follow the [instructions here](https://modelers.cn/models/MindIE/deepseekv3).
312
 
313
 
 
230
  2. **SGLang**: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes.
231
  3. **LMDeploy**: Enables efficient FP8 and BF16 inference for local and cloud deployment.
232
  4. **TensorRT-LLM**: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon.
233
+ 5. **vLLM**: Support DeekSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism.
234
+ 6. **AMD GPU**: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes.
235
+ 7. **Huawei Ascend NPU**: Supports running DeepSeek-V3 on Huawei Ascend devices.
236
 
237
  Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation.
238
 
 
304
 
305
  [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) now supports the DeepSeek-V3 model, offering precision options such as BF16 and INT4/INT8 weight-only. Support for FP8 is currently in progress and will be released soon. You can access the custom branch of TRTLLM specifically for DeepSeek-V3 support through the following link to experience the new features directly: https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3.
306
 
307
+ ### 6.5 Inference with vLLM (recommended)
308
+
309
+ [vLLM](https://github.com/vllm-project/vllm) v0.6.6 supports DeepSeek-V3 inference for FP8 and BF16 modes on both NVIDIA and AMD GPUs. Aside from standard techniques, vLLM offers _pipeline parallelism_ allowing you to run this model on multiple machines connected by networks. For detailed guidance, please refer to the [vLLM instructions](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). Please feel free to follow [the enhancement plan](https://github.com/vllm-project/vllm/issues/11539) as well.
310
+
311
+ ### 6.6 Recommended Inference Functionality with AMD GPUs
312
 
313
  In collaboration with the AMD team, we have achieved Day-One support for AMD GPUs using SGLang, with full compatibility for both FP8 and BF16 precision. For detailed guidance, please refer to the [SGLang instructions](#63-inference-with-lmdeploy-recommended).
314
 
315
+ ### 6.7 Recommended Inference Functionality with Huawei Ascend NPUs
316
  The [MindIE](https://www.hiascend.com/en/software/mindie) framework from the Huawei Ascend community has successfully adapted the BF16 version of DeepSeek-V3. For step-by-step guidance on Ascend NPUs, please follow the [instructions here](https://modelers.cn/models/MindIE/deepseekv3).
317
 
318