Skip to content

Commit ab229d2

Browse files
committed
Add example section: "Example: Multi-node benchmark on GB200"
Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]>
1 parent af67bf0 commit ab229d2

File tree

1 file changed

+97
-14
lines changed

1 file changed

+97
-14
lines changed

examples/deepseek_v3/README.md

+97-14
Original file line numberDiff line numberDiff line change
@@ -13,20 +13,23 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
1313

1414
## Table of Contents
1515

16-
- [Table of Contents](#table-of-contents)
17-
- [Hardware Requirements](#hardware-requirements)
18-
- [Downloading the Model Weights](#downloading-the-model-weights)
19-
- [Quick Start](#quick-start)
20-
- [Multi-Token Prediction (MTP)](#multi-token-prediction-mtp)
21-
- [Run evaluation on GPQA dataset](#run-evaluation-on-gpqa-dataset)
22-
- [Serving](#serving)
23-
- [Advanced Usages](#advanced-usages)
24-
- [Multi-node](#multi-node)
25-
- [mpirun](#mpirun)
26-
- [Slurm](#slurm)
27-
- [FlashMLA](#flashmla)
28-
- [DeepGEMM](#deepgemm)
29-
- [Notes and Troubleshooting](#notes-and-troubleshooting)
16+
- [DeepSeek‑V3 and DeepSeek-R1](#deepseekv3-and-deepseek-r1)
17+
- [Table of Contents](#table-of-contents)
18+
- [Hardware Requirements](#hardware-requirements)
19+
- [Downloading the Model Weights](#downloading-the-model-weights)
20+
- [Quick Start](#quick-start)
21+
- [Run a single inference](#run-a-single-inference)
22+
- [Multi-Token Prediction (MTP)](#multi-token-prediction-mtp)
23+
- [Run evaluation on GPQA dataset](#run-evaluation-on-gpqa-dataset)
24+
- [Serving](#serving)
25+
- [Advanced Usages](#advanced-usages)
26+
- [Multi-node](#multi-node)
27+
- [mpirun](#mpirun)
28+
- [Slurm](#slurm)
29+
- [Example: Multi-node benchmark on GB200 Slurm cluster](#example-multi-node-benchmark-on-gb200-slurm-cluster)
30+
- [FlashMLA](#flashmla)
31+
- [DeepGEMM](#deepgemm)
32+
- [Notes and Troubleshooting](#notes-and-troubleshooting)
3033

3134

3235
## Hardware Requirements
@@ -267,6 +270,86 @@ trtllm-llmapi-launch trtllm-bench --model deepseek-ai/DeepSeek-V3 --model_path /
267270
bash -c "trtllm-llmapi-launch trtllm-bench --model deepseek-ai/DeepSeek-V3 --model_path <YOUR_MODEL_DIR> throughput --backend pytorch --max_batch_size 161 --max_num_tokens 1160 --dataset /workspace/dataset.txt --tp 16 --ep 4 --kv_cache_free_gpu_mem_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml"
268271
```
269272

273+
#### Example: Multi-node benchmark on GB200 Slurm cluster
274+
275+
Step 1: Prepare dataset and `extra-llm-api-config.yml`.
276+
```bash
277+
python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
278+
--tokenizer=/lustre/fsw/gtc_inference/common/DeepSeek-R1-nvfp4_allmoe \
279+
--stdout token-norm-dist --num-requests=49152 \
280+
--input-mean=1024 --output-mean=2048 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
281+
282+
cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
283+
pytorch_backend_config:
284+
use_cuda_graph: true
285+
cuda_graph_padding_enabled: true
286+
cuda_graph_batch_sizes:
287+
- 1
288+
- 2
289+
- 4
290+
- 8
291+
- 16
292+
- 32
293+
- 64
294+
- 128
295+
- 256
296+
- 384
297+
print_iter_log: true
298+
enable_overlap_scheduler: true
299+
enable_attention_dp: true
300+
EOF
301+
```
302+
303+
Step 2: Prepare `benchmark.slurm`.
304+
```bash
305+
#!/bin/bash
306+
#SBATCH --nodes=2
307+
#SBATCH --ntasks=8
308+
#SBATCH --ntasks-per-node=4
309+
#SBATCH --partition=<partition>
310+
#SBATCH --account=<account>
311+
#SBATCH --time=02:00:00
312+
#SBATCH --job-name=<job_name>
313+
314+
srun --container-image=${container_image} --container-mounts=${mount_dir}:${mount_dir} --mpi=pmix \
315+
--output ${logdir}/bench_%j_%t.srun.out \
316+
bash benchmark.sh
317+
```
318+
319+
Step 3: Prepare `benchmark.sh`.
320+
```bash
321+
#!/bin/bash
322+
cd /path/to/TensorRT-LLM
323+
# pip install build/tensorrt_llm*.whl
324+
if [ $SLURM_LOCALID == 0 ];then
325+
pip install build/tensorrt_llm*.whl
326+
echo "Install dependencies on rank 0."
327+
else
328+
echo "Sleep 60 seconds on other ranks."
329+
sleep 60
330+
fi
331+
332+
export PATH=${HOME}/.local/bin:${PATH}
333+
export PYTHONPATH=/path/to/TensorRT-LLM
334+
DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1 # optional
335+
336+
trtllm-llmapi-launch trtllm-bench \
337+
--model deepseek-ai/DeepSeek-R1 \
338+
--model_path $DS_R1_NVFP4_MODEL_PATH \
339+
throughput --backend pytorch \
340+
--num_requests 49152 \
341+
--max_batch_size 384 --max_num_tokens 1536 \
342+
--concurrency 3072 \
343+
--dataset /path/to/dataset.txt \
344+
--tp 8 --pp 1 --ep 8 --kv_cache_free_gpu_mem_fraction 0.85 \
345+
--extra_llm_api_options ./extra-llm-api-config.yml --warmup 0
346+
```
347+
348+
Step 4: Submit the job to Slurm cluster to launch the benchmark by executing:
349+
```
350+
sbatch --nodes=2 --ntasks=8 --ntasks-per-node=4 benchmark.slurm
351+
```
352+
270353
### FlashMLA
271354
TensorRT-LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1.
272355

0 commit comments

Comments
 (0)