Request for Detailed Benchmarking Setup with TensorRT-LLM on B200

#6
by StardusterLiu - opened

Hi,

I’m trying to benchmark DeepSeek-R1-FP4 on B200 using TensorRT-LLM following this guide: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/deepseek_v3/README.md, but the performance I’m getting with trtllm-bench is far below the reported numbers. The docs don’t provide enough details on the exact setup, making it hard to debug or reproduce results.

A few key issues:

  • No clear info on batch size, sequence length, precision settings, or KV cache config used in official benchmarks. Unclear if there are any specific optimizations or tuning required.
  • I’m not sure if the latest main branch of TensorRT-LLM supports compiling R1 models to TensorRT engines or if it still only works with the PyTorch backend. I noticed that there is a deepseek branch suggests R1-FP8 TensorRT engine support in the 0.18 branch?
  • The model card only mentions it needs TensorRT LLM built from source with the latest main branch, without referencing this deepseed branch or clarifying the required steps for FP8/FP4 inference.

Would be great if NVIDIA could share a more detailed benchmarking guide, including exact configs and expected numbers under different conditions. This would help ensure reproducibility and a fair evaluation of B200’s performance.

Thanks.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment