how to achieve 2500 tps throughput?

#8
by muziyongshixin - opened

I use the following command to deploy the model but I can only achieve about 900 TPS on 16 A100, which is much slower than the reported performance. is there anything wrong with my setting? does the reported throughput include the prefill tokens?
And I found the llmuses throughput number is different from the number in sglang log, do you know where is the gap?
I checked the instances' infiniBand speed, which is 100Gb/s, will this be a bottleneck?

#master
python3 -m sglang.launch_server \
    --model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
    HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8  --quantization w8a8_int8
#cluster
python3 -m sglang.launch_server \
    --model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
    HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8 --quantization w8a8_int8

benchmark command:

llmuses perf \
--url 'http://localhost:30000/v1/chat/completions' \
--parallel 1024 \
--model 'deepseek-r1' \
--log-every-n-query 10 \
--read-timeout=1000 \
--dataset-path '/data/liyongzhi/vllm/open_qa.jsonl' \
-n 1024 \
--max-prompt-length 10000 \
--api openai \
--temperature 0 \
--dataset openqa

the llmuses final report is below:

Benchmarking summary: 
 Time taken for tests: 1000.524 seconds
 Expected number of requests: 1024
 Number of concurrency: 1024
 Total requests: 587
 Succeed requests: 587
 Failed requests: 0
 Average QPS: 0.587
 Average latency: 613.003
 Throughput(average output tokens per second): 537.683
 Average time to first token: 613.003
 Average input tokens per request: 24.491
 Average output tokens per request: 916.465
 Average time per output token: 0.00186
 Average package per request: 1.000
 Average package latency: 613.003
 Percentile of time to first token: 
     p50: 595.3598
     p66: 694.9941
     p75: 792.9841
     p80: 876.5501
     p90: 944.2382
     p95: 975.8679
     p98: 992.5245
     p99: 994.9762
 Percentile of request latency: 
     p50: 595.3598
     p66: 694.9941
     p75: 792.9841
     p80: 876.5501
     p90: 944.2382
     p95: 975.8679
     p98: 992.5245
     p99: 994.9762

and the peak performance of sglang log is here:

[2025-03-10 22:53:09] INFO:     127.0.0.1:60906 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:09 TP0] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 3, token usage: 0.46, #running-req: 398, #queue-req: 222,                                                                                    
[2025-03-10 22:53:10] INFO:     127.0.0.1:60412 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:10 TP0] Prefill batch. #new-seq: 2, #new-token: 28, #cached-token: 5, token usage: 0.46, #running-req: 398, #queue-req: 220,                                                                                    
[2025-03-10 22:53:10] INFO:     127.0.0.1:60904 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:11 TP0] Prefill batch. #new-seq: 2, #new-token: 44, #cached-token: 5, token usage: 0.45, #running-req: 399, #queue-req: 218,                                                                                    
[2025-03-10 22:53:13] INFO:     127.0.0.1:60794 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:14 TP0] Prefill batch. #new-seq: 1, #new-token: 12, #cached-token: 2, token usage: 0.46, #running-req: 400, #queue-req: 217,                                                                                    
[2025-03-10 22:53:18] INFO:     127.0.0.1:60930 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:18 TP0] Prefill batch. #new-seq: 1, #new-token: 37, #cached-token: 2, token usage: 0.47, #running-req: 400, #queue-req: 216,                                                                                    
[2025-03-10 22:53:20] INFO:     127.0.0.1:60862 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:22] INFO:     127.0.0.1:60942 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:26 TP0] Decode batch. #running-req: 399, #token: 149592, token usage: 0.48, gen throughput (token/s): 834.06, largest-len: 0, #queue-req: 216,                                                                  
[2025-03-10 22:53:28] INFO:     127.0.0.1:60788 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:28] INFO:     127.0.0.1:60652 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:29] INFO:     127.0.0.1:60776 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:36] INFO:     127.0.0.1:60654 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:39] INFO:     127.0.0.1:60780 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:40] INFO:     127.0.0.1:60586 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:42] INFO:     127.0.0.1:60726 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:43 TP0] Decode batch. #running-req: 392, #token: 156301, token usage: 0.50, gen throughput (token/s): 908.57, largest-len: 0, #queue-req: 216,                                                                  
[2025-03-10 22:53:44] INFO:     127.0.0.1:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:46] INFO:     127.0.0.1:60524 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:48] INFO:     127.0.0.1:60798 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:52] INFO:     127.0.0.1:60894 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:58] INFO:     127.0.0.1:60142 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:53:58] INFO:     127.0.0.1:32844 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:00] INFO:     127.0.0.1:60864 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:00 TP0] Decode batch. #running-req: 385, #token: 161554, token usage: 0.52, gen throughput (token/s): 911.98, largest-len: 0, #queue-req: 216,                                                                  
[2025-03-10 22:54:00] INFO:     127.0.0.1:60834 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:01] INFO:     127.0.0.1:60546 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:03] INFO:     127.0.0.1:60576 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:04] INFO:     127.0.0.1:60918 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:08] INFO:     127.0.0.1:60664 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:10] INFO:     127.0.0.1:32890 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:13] INFO:     127.0.0.1:60722 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:13] INFO:     127.0.0.1:60626 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:14] INFO:     127.0.0.1:32984 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                                                     
[2025-03-10 22:54:17 TP0] Decode batch. #running-req: 376, #token: 167461, token usage: 0.54, gen throughput (token/s): 906.96, largest-len: 0, #queue-req: 216,   

Sign up or log in to comment