@@ -13,20 +13,23 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
13
13
14
14
## Table of Contents
15
15
16
- - [ Table of Contents] ( #table-of-contents )
17
- - [ Hardware Requirements] ( #hardware-requirements )
18
- - [ Downloading the Model Weights] ( #downloading-the-model-weights )
19
- - [ Quick Start] ( #quick-start )
20
- - [ Multi-Token Prediction (MTP)] ( #multi-token-prediction-mtp )
21
- - [ Run evaluation on GPQA dataset] ( #run-evaluation-on-gpqa-dataset )
22
- - [ Serving] ( #serving )
23
- - [ Advanced Usages] ( #advanced-usages )
24
- - [ Multi-node] ( #multi-node )
25
- - [ mpirun] ( #mpirun )
26
- - [ Slurm] ( #slurm )
27
- - [ FlashMLA] ( #flashmla )
28
- - [ DeepGEMM] ( #deepgemm )
29
- - [ Notes and Troubleshooting] ( #notes-and-troubleshooting )
16
+ - [ DeepSeek‑V3 and DeepSeek-R1] ( #deepseekv3-and-deepseek-r1 )
17
+ - [ Table of Contents] ( #table-of-contents )
18
+ - [ Hardware Requirements] ( #hardware-requirements )
19
+ - [ Downloading the Model Weights] ( #downloading-the-model-weights )
20
+ - [ Quick Start] ( #quick-start )
21
+ - [ Run a single inference] ( #run-a-single-inference )
22
+ - [ Multi-Token Prediction (MTP)] ( #multi-token-prediction-mtp )
23
+ - [ Run evaluation on GPQA dataset] ( #run-evaluation-on-gpqa-dataset )
24
+ - [ Serving] ( #serving )
25
+ - [ Advanced Usages] ( #advanced-usages )
26
+ - [ Multi-node] ( #multi-node )
27
+ - [ mpirun] ( #mpirun )
28
+ - [ Slurm] ( #slurm )
29
+ - [ Example: Multi-node benchmark on GB200 Slurm cluster] ( #example-multi-node-benchmark-on-gb200-slurm-cluster )
30
+ - [ FlashMLA] ( #flashmla )
31
+ - [ DeepGEMM] ( #deepgemm )
32
+ - [ Notes and Troubleshooting] ( #notes-and-troubleshooting )
30
33
31
34
32
35
## Hardware Requirements
@@ -267,6 +270,86 @@ trtllm-llmapi-launch trtllm-bench --model deepseek-ai/DeepSeek-V3 --model_path /
267
270
bash -c " trtllm-llmapi-launch trtllm-bench --model deepseek-ai/DeepSeek-V3 --model_path <YOUR_MODEL_DIR> throughput --backend pytorch --max_batch_size 161 --max_num_tokens 1160 --dataset /workspace/dataset.txt --tp 16 --ep 4 --kv_cache_free_gpu_mem_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml"
268
271
```
269
272
273
+ #### Example: Multi-node benchmark on GB200 Slurm cluster
274
+
275
+ Step 1: Prepare dataset and ` extra-llm-api-config.yml ` .
276
+ ``` bash
277
+ python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
278
+ --tokenizer=/lustre/fsw/gtc_inference/common/DeepSeek-R1-nvfp4_allmoe \
279
+ --stdout token-norm-dist --num-requests=49152 \
280
+ --input-mean=1024 --output-mean=2048 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
281
+
282
+ cat > /path/to/TensorRT-LLM/extra-llm-api-config.yml << EOF
283
+ pytorch_backend_config:
284
+ use_cuda_graph: true
285
+ cuda_graph_padding_enabled: true
286
+ cuda_graph_batch_sizes:
287
+ - 1
288
+ - 2
289
+ - 4
290
+ - 8
291
+ - 16
292
+ - 32
293
+ - 64
294
+ - 128
295
+ - 256
296
+ - 384
297
+ print_iter_log: true
298
+ enable_overlap_scheduler: true
299
+ enable_attention_dp: true
300
+ EOF
301
+ ```
302
+
303
+ Step 2: Prepare ` benchmark.slurm ` .
304
+ ``` bash
305
+ #! /bin/bash
306
+ # SBATCH --nodes=2
307
+ # SBATCH --ntasks=8
308
+ # SBATCH --ntasks-per-node=4
309
+ # SBATCH --partition=<partition>
310
+ # SBATCH --account=<account>
311
+ # SBATCH --time=02:00:00
312
+ # SBATCH --job-name=<job_name>
313
+
314
+ srun --container-image=${container_image} --container-mounts=${mount_dir} :${mount_dir} --mpi=pmix \
315
+ --output ${logdir} /bench_%j_%t.srun.out \
316
+ bash benchmark.sh
317
+ ```
318
+
319
+ Step 3: Prepare ` benchmark.sh ` .
320
+ ``` bash
321
+ #! /bin/bash
322
+ cd /path/to/TensorRT-LLM
323
+ # pip install build/tensorrt_llm*.whl
324
+ if [ $SLURM_LOCALID == 0 ]; then
325
+ pip install build/tensorrt_llm* .whl
326
+ echo " Install dependencies on rank 0."
327
+ else
328
+ echo " Sleep 60 seconds on other ranks."
329
+ sleep 60
330
+ fi
331
+
332
+ export PATH=${HOME} /.local/bin:${PATH}
333
+ export PYTHONPATH=/path/to/TensorRT-LLM
334
+ DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1 # optional
335
+
336
+ trtllm-llmapi-launch trtllm-bench \
337
+ --model deepseek-ai/DeepSeek-R1 \
338
+ --model_path $DS_R1_NVFP4_MODEL_PATH \
339
+ throughput --backend pytorch \
340
+ --num_requests 49152 \
341
+ --max_batch_size 384 --max_num_tokens 1536 \
342
+ --concurrency 3072 \
343
+ --dataset /path/to/dataset.txt \
344
+ --tp 8 --pp 1 --ep 8 --kv_cache_free_gpu_mem_fraction 0.85 \
345
+ --extra_llm_api_options ./extra-llm-api-config.yml --warmup 0
346
+ ```
347
+
348
+ Step 4: Submit the job to Slurm cluster to launch the benchmark by executing:
349
+ ```
350
+ sbatch --nodes=2 --ntasks=8 --ntasks-per-node=4 benchmark.slurm
351
+ ```
352
+
270
353
### FlashMLA
271
354
TensorRT-LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1.
272
355
0 commit comments