nvidia/DeepSeek-R1-FP4 · Request for Detailed Benchmarking Setup with TensorRT-LLM on B200

Hi,

I’m trying to benchmark DeepSeek-R1-FP4 on B200 using TensorRT-LLM following this guide: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/deepseek_v3/README.md, but the performance I’m getting with trtllm-bench is far below the reported numbers. The docs don’t provide enough details on the exact setup, making it hard to debug or reproduce results.

A few key issues:

No clear info on batch size, sequence length, precision settings, or KV cache config used in official benchmarks. Unclear if there are any specific optimizations or tuning required.
I’m not sure if the latest main branch of TensorRT-LLM supports compiling R1 models to TensorRT engines or if it still only works with the PyTorch backend. I noticed that there is a deepseek branch suggests R1-FP8 TensorRT engine support in the 0.18 branch?
The model card only mentions it needs TensorRT LLM built from source with the latest main branch, without referencing this deepseed branch or clarifying the required steps for FP8/FP4 inference.

Would be great if NVIDIA could share a more detailed benchmarking guide, including exact configs and expected numbers under different conditions. This would help ensure reproducibility and a fair evaluation of B200’s performance.

Thanks.