Request for Detailed Benchmarking Setup with TensorRT-LLM on B200
#6
by
StardusterLiu
- opened
Hi,
I’m trying to benchmark DeepSeek-R1-FP4 on B200 using TensorRT-LLM following this guide: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/deepseek_v3/README.md, but the performance I’m getting with trtllm-bench
is far below the reported numbers. The docs don’t provide enough details on the exact setup, making it hard to debug or reproduce results.
A few key issues:
- No clear info on batch size, sequence length, precision settings, or KV cache config used in official benchmarks. Unclear if there are any specific optimizations or tuning required.
- I’m not sure if the latest main branch of TensorRT-LLM supports compiling R1 models to TensorRT engines or if it still only works with the PyTorch backend. I noticed that there is a
deepseek
branch suggests R1-FP8 TensorRT engine support in the 0.18 branch? - The model card only mentions it needs TensorRT LLM built from source with the latest main branch, without referencing this deepseed branch or clarifying the required steps for FP8/FP4 inference.
Would be great if NVIDIA could share a more detailed benchmarking guide, including exact configs and expected numbers under different conditions. This would help ensure reproducibility and a fair evaluation of B200’s performance.
Thanks.