how to achieve 2500 tps throughput?
#8
by
muziyongshixin
- opened
I use the following command to deploy the model but I can only achieve about 900 TPS on 16 A100, which is much slower than the reported performance. is there anything wrong with my setting? does the reported throughput include the prefill tokens?
And I found the llmuses throughput number is different from the number in sglang log, do you know where is the gap?
I checked the instances' infiniBand speed, which is 100Gb/s, will this be a bottleneck?
#master
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8 --quantization w8a8_int8
#cluster
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8 --quantization w8a8_int8
benchmark command:
llmuses perf \
--url 'http://localhost:30000/v1/chat/completions' \
--parallel 1024 \
--model 'deepseek-r1' \
--log-every-n-query 10 \
--read-timeout=1000 \
--dataset-path '/data/liyongzhi/vllm/open_qa.jsonl' \
-n 1024 \
--max-prompt-length 10000 \
--api openai \
--temperature 0 \
--dataset openqa
the llmuses final report is below:
Benchmarking summary:
Time taken for tests: 1000.524 seconds
Expected number of requests: 1024
Number of concurrency: 1024
Total requests: 587
Succeed requests: 587
Failed requests: 0
Average QPS: 0.587
Average latency: 613.003
Throughput(average output tokens per second): 537.683
Average time to first token: 613.003
Average input tokens per request: 24.491
Average output tokens per request: 916.465
Average time per output token: 0.00186
Average package per request: 1.000
Average package latency: 613.003
Percentile of time to first token:
p50: 595.3598
p66: 694.9941
p75: 792.9841
p80: 876.5501
p90: 944.2382
p95: 975.8679
p98: 992.5245
p99: 994.9762
Percentile of request latency:
p50: 595.3598
p66: 694.9941
p75: 792.9841
p80: 876.5501
p90: 944.2382
p95: 975.8679
p98: 992.5245
p99: 994.9762
and the peak performance of sglang log is here:
[2025-03-10 22:53:09] INFO: 127.0.0.1:60906 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:09 TP0] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 3, token usage: 0.46, #running-req: 398, #queue-req: 222,
[2025-03-10 22:53:10] INFO: 127.0.0.1:60412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:10 TP0] Prefill batch. #new-seq: 2, #new-token: 28, #cached-token: 5, token usage: 0.46, #running-req: 398, #queue-req: 220,
[2025-03-10 22:53:10] INFO: 127.0.0.1:60904 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:11 TP0] Prefill batch. #new-seq: 2, #new-token: 44, #cached-token: 5, token usage: 0.45, #running-req: 399, #queue-req: 218,
[2025-03-10 22:53:13] INFO: 127.0.0.1:60794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:14 TP0] Prefill batch. #new-seq: 1, #new-token: 12, #cached-token: 2, token usage: 0.46, #running-req: 400, #queue-req: 217,
[2025-03-10 22:53:18] INFO: 127.0.0.1:60930 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:18 TP0] Prefill batch. #new-seq: 1, #new-token: 37, #cached-token: 2, token usage: 0.47, #running-req: 400, #queue-req: 216,
[2025-03-10 22:53:20] INFO: 127.0.0.1:60862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:22] INFO: 127.0.0.1:60942 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:26 TP0] Decode batch. #running-req: 399, #token: 149592, token usage: 0.48, gen throughput (token/s): 834.06, largest-len: 0, #queue-req: 216,
[2025-03-10 22:53:28] INFO: 127.0.0.1:60788 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:28] INFO: 127.0.0.1:60652 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:29] INFO: 127.0.0.1:60776 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:36] INFO: 127.0.0.1:60654 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:39] INFO: 127.0.0.1:60780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:40] INFO: 127.0.0.1:60586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:42] INFO: 127.0.0.1:60726 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:43 TP0] Decode batch. #running-req: 392, #token: 156301, token usage: 0.50, gen throughput (token/s): 908.57, largest-len: 0, #queue-req: 216,
[2025-03-10 22:53:44] INFO: 127.0.0.1:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:46] INFO: 127.0.0.1:60524 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:48] INFO: 127.0.0.1:60798 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:52] INFO: 127.0.0.1:60894 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:58] INFO: 127.0.0.1:60142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:53:58] INFO: 127.0.0.1:32844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:00] INFO: 127.0.0.1:60864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:00 TP0] Decode batch. #running-req: 385, #token: 161554, token usage: 0.52, gen throughput (token/s): 911.98, largest-len: 0, #queue-req: 216,
[2025-03-10 22:54:00] INFO: 127.0.0.1:60834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:01] INFO: 127.0.0.1:60546 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:03] INFO: 127.0.0.1:60576 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:04] INFO: 127.0.0.1:60918 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:08] INFO: 127.0.0.1:60664 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:10] INFO: 127.0.0.1:32890 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:13] INFO: 127.0.0.1:60722 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:13] INFO: 127.0.0.1:60626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:14] INFO: 127.0.0.1:32984 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-03-10 22:54:17 TP0] Decode batch. #running-req: 376, #token: 167461, token usage: 0.54, gen throughput (token/s): 906.96, largest-len: 0, #queue-req: 216,