Skip to content

Commit f9959ac

Browse files
ZhanruiSunChDomBrown
authored andcommitted
chore: bump version to 0.19.0 (NVIDIA#3598)
Signed-off-by: ZhanruiSunCh <[email protected]> test: add test cases for 0.19 release (NVIDIA#3608) * fix test name Signed-off-by: Ivy Zhang <[email protected]> * add quickstart test for nemotron-ultra Signed-off-by: Ivy Zhang <[email protected]> * add rcca multi-node test case for deepseek-v3 Signed-off-by: Ivy Zhang <[email protected]> * add rcca info Signed-off-by: Ivy Zhang <[email protected]> --------- Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Ivy Zhang <[email protected]> squash (NVIDIA#3642) Signed-off-by: Enwei Zhu <[email protected]> fix: nvbugs/5187237: fix deterministic mode crash (NVIDIA#3448) * nvbugs/5187237 nvbugs/5112075: fix deterministic mode error * remove waive Signed-off-by: Xiwen Yu <[email protected]> * Revert "remove waive" This reverts commit 0bf5486. Signed-off-by: Xiwen Yu <[email protected]> * revert ar fusion Signed-off-by: Xiwen Yu <[email protected]> --------- Signed-off-by: Xiwen Yu <[email protected]> update fp8 doc (NVIDIA#3647) Signed-off-by: taoli <[email protected]> Co-authored-by: taoli <[email protected]> tests: change qa perf test to trtllm-bench (NVIDIA#3619) Signed-off-by: Ruodi <[email protected]> Co-authored-by: Larry <[email protected]> fix: FP8 quantized lm_head (NvBug 5214229) (NVIDIA#3567) Signed-off-by: Enwei Zhu <[email protected]> infra: Add PR approval protection for the release branch (NVIDIA#3634) Signed-off-by: Yanchao Lu <[email protected]> fix: nvbugs/5231298: pytorch allreduce issue (NVIDIA#3673) Signed-off-by: Xiwen Yu <[email protected]> Fix: nvbugs/5222698 variable not defined (NVIDIA#3630) * Fix: nvbugs/5222698 variable not defined Signed-off-by: Zongfei Jing <[email protected]> * Tidy code Signed-off-by: Zongfei Jing <[email protected]> --------- Signed-off-by: Zongfei Jing <[email protected]> test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (NVIDIA#3685) Signed-off-by: nv-guomingz <[email protected]> test:restore fp8 kv cache testing for L0 (NVIDIA#3671) Signed-off-by: nv-guomingz <[email protected]> doc: Update DeepSeek perf docs (NVIDIA#3693) * Update DeepSeek perf docs Signed-off-by: Kaiyu Xie <[email protected]> * update Signed-off-by: Kaiyu Xie <[email protected]> * Apply suggestions from code review Co-authored-by: Copilot <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]> --------- Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: Copilot <[email protected]> tests: waive test_llm_multi_node (NVIDIA#3664) Signed-off-by: junq <[email protected]> fix: update test_user_buffers_mm_add_prologue atol (NVIDIA#3711) Signed-off-by: Jin Li <[email protected]> Fix: cherry-pick hmac encryption from main branch (NVIDIA#3635) * security fix cherry-pick changes from main Signed-off-by: Yibin Li <[email protected]> * fix hmac in remote mpi session (NVIDIA#3649) Signed-off-by: Yan Chunwei <[email protected]> --------- Signed-off-by: Yibin Li <[email protected]> Signed-off-by: Yan Chunwei <[email protected]> Co-authored-by: Yan Chunwei <[email protected]> Un-waive DS-V3-Lite tests. (NVIDIA#3621) Signed-off-by: Tracin <[email protected]> fix: FP8 kv accuracy (NVIDIA#3675) * fix FP8 kv accuracy Signed-off-by: Dylan Chen <[email protected]> * update doc Signed-off-by: Dylan Chen <[email protected]> --------- Signed-off-by: Dylan Chen <[email protected]> Fix script options for engines. (NVIDIA#3622) Signed-off-by: Tracin <[email protected]> unwaive multi-node test (NVIDIA#3721) Signed-off-by: Superjomn <[email protected]> chore : Split more tests out of gpt tests (NVIDIA#3524) (NVIDIA#3674) Signed-off-by: peaceh <[email protected]> doc:add torch examples link into torch backend documentation (NVIDIA#3749) Signed-off-by: nv-guomingz <[email protected]> Co-authored-by: nv-guomingz <[email protected]> test: Get Eagle tests working (NVIDIA#3593) (NVIDIA#3722) Signed-off-by: Balaram Buddharaju <[email protected]> Co-authored-by: brb-nv <[email protected]> Waive L0 test (NVIDIA#3756) Signed-off-by: Yiqing Yan <[email protected]> waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (NVIDIA#3656) Signed-off-by: Ruodi <[email protected]> Signed-off-by: Larry <[email protected]> Co-authored-by: Larry <[email protected]> Update ds v3 parameters in stress test. (NVIDIA#3676) waive gemma on L20 (NVIDIA#3766) Signed-off-by: Ivy Zhang <[email protected]> https://nvbugs/5141291: Fix convert.py script for Qwen model. (NVIDIA#3758) Include Qwen2VLDecoderLayer in the smooth_qwen2_model function. Signed-off-by: Yukun He <[email protected]> fix: PP4 fixes and cleanup (NVIDIA#3688) Signed-off-by: Anurag Mukkara <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> remove benchmark test list (NVIDIA#3643) Signed-off-by: Ivy Zhang <[email protected]> skip disagg deepseek test if sm!=90 (NVIDIA#3720) Signed-off-by: Chuang Zhu <[email protected]> test: skip failed cases on B200 (NVIDIA#3710) * add skip condition to tests Signed-off-by: xinhe-nv <[email protected]> * fix error Signed-off-by: xinhe-nv <[email protected]> --------- Signed-off-by: xinhe-nv <[email protected]> test: [nvbug: 5234494] skip_pre_ada for fp8 cases (NVIDIA#3718) * skip_pre_ada for fp8 cases Signed-off-by: Ivy Zhang <[email protected]> * update Signed-off-by: Ivy Zhang <[email protected]> * update after rebase Signed-off-by: Ivy Zhang <[email protected]> --------- Signed-off-by: Ivy Zhang <[email protected]> add know issue to deepseek doc. (NVIDIA#3800) Signed-off-by: Fanrong Li <[email protected]> Fix ModelOpt Mixtral AWQ OOM (NVIDIA#3714) (NVIDIA#3761) Signed-off-by: Barry Kang <[email protected]> Co-authored-by: Larry <[email protected]> Waive L0 tests (NVIDIA#3826) Signed-off-by: Yiqing Yan <[email protected]> fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (NVIDIA#3793) * Reduce memory usage in fused moe op associated with AutoTuning. * Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens. * Add free_memory logic of workspace in min_latency_mode fused moe path. Signed-off-by: Yukun He <[email protected]> * Fix fused_moe fallback issue. (NVIDIA#3652) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. Signed-off-by: Yukun He <[email protected]> --------- Signed-off-by: Yukun He <[email protected]> [doc] Better document for Draft-Target-Model (DTM) speculative decoding (NVIDIA#3797) Signed-off-by: wili-65535 <[email protected]> Signed-off-by: Dom Brown <[email protected]>
1 parent 1d51788 commit f9959ac

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+474
-190
lines changed

.github/CODEOWNERS

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# This file defines code ownership rules for the repository.
2+
# The rule below requires that any PR to release/**/* branches must be approved by at least one member
3+
# of the NVIDIA/trt-llm-release-branch-approval team, regardless of who else approves the PR.
4+
# Without approval from a member of this team, PRs cannot be merged to release branches.
5+
* @NVIDIA/trt-llm-release-branch-approval

cpp/tensorrt_llm/common/attentionOp.cpp

+9-4
Original file line numberDiff line numberDiff line change
@@ -915,7 +915,8 @@ int AttentionOp::mlaGeneration(
915915
params.quant_scale_kv = generation_params.kv_scale_orig_quant;
916916
params.dequant_scale_q = generation_params.kv_scale_quant_orig;
917917
params.dequant_scale_kv = generation_params.kv_scale_quant_orig;
918-
params.host_bmm1_scale = 1 / (sqrt((float) (mMLAParams.qk_nope_head_dim + mMLAParams.qk_rope_head_dim)));
918+
params.host_bmm1_scale
919+
= 1 / (mQScaling * sqrt((float) (mMLAParams.qk_nope_head_dim + mMLAParams.qk_rope_head_dim)));
919920

920921
invokeMLARopeGeneration<T>(params, kv_cache_buffer, stream);
921922
sync_check_cuda_error(stream);
@@ -1001,9 +1002,13 @@ int AttentionOp::mlaGeneration(
10011002
tllmRunnerParams.mSfStartTokenIdx = generation_params.start_token_idx_sf;
10021003

10031004
// Scales for quantization
1004-
static constexpr int bmm1_scale_offset = 1;
1005-
tllmRunnerParams.outputScalePtr = reinterpret_cast<float const*>(params.bmm2_scale);
1006-
tllmRunnerParams.scaleSoftmaxLog2Ptr = reinterpret_cast<float const*>(params.bmm1_scale) + bmm1_scale_offset;
1005+
if (mFP8GenerationMLA)
1006+
{
1007+
static constexpr int bmm1_scale_offset = 1;
1008+
tllmRunnerParams.outputScalePtr = reinterpret_cast<float const*>(params.bmm2_scale);
1009+
tllmRunnerParams.scaleSoftmaxLog2Ptr
1010+
= reinterpret_cast<float const*>(params.bmm1_scale) + bmm1_scale_offset;
1011+
}
10071012

10081013
TLLM_CHECK_WITH_INFO(mTllmGenFMHARunner.get(), "mTllmGenFMHARunner not initialized.");
10091014
mTllmGenFMHARunner->run(tllmRunnerParams);

cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu

+2-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ __global__ void lamport_initialize_kernel(float* ptr, int size)
2828

2929
void lamport_initialize(void* ptr, int bytes, cudaStream_t stream)
3030
{
31-
lamport_initialize_kernel<<<bytes / 128, 128, 0, stream>>>(reinterpret_cast<float*>(ptr), bytes / sizeof(float));
31+
int grid_size = (bytes + 127) / 128;
32+
lamport_initialize_kernel<<<grid_size, 128, 0, stream>>>(reinterpret_cast<float*>(ptr), bytes / sizeof(float));
3233
}
3334

3435
Workspace::Workspace(int rank, int tp_size, int max_token_num, int hidden_dim,

cpp/tensorrt_llm/kernels/customAllReduceKernels.cu

+4
Original file line numberDiff line numberDiff line change
@@ -1989,6 +1989,10 @@ void residualRmsNorm(
19891989
void lamportInitialize(void* buffer, size_t size, nvinfer1::DataType dataType, cudaStream_t stream)
19901990
{
19911991
sync_check_cuda_error(stream);
1992+
if (size == 0)
1993+
{
1994+
return;
1995+
}
19921996
switch (dataType)
19931997
{
19941998
case nvinfer1::DataType::kFLOAT:

cpp/tensorrt_llm/thop/moeOp.cpp

+17-14
Original file line numberDiff line numberDiff line change
@@ -163,17 +163,12 @@ class FusedMoeRunner : public torch::CustomClassHolder
163163
torch::optional<c10::ArrayRef<int64_t>> profile_ids)
164164
{
165165
// Free the profile workspace to save memory
166-
if (mProfileWorkspace != nullptr)
167-
{
168-
auto const cu_free_status = cudaFree(mProfileWorkspace);
169-
TORCH_CHECK(
170-
cu_free_status == cudaSuccess, "Can't free profile workspace for MoE GEMM profile before runMoe.");
171-
mProfileWorkspace = nullptr;
172-
}
166+
freeProfileWorkspace();
173167

174168
std::lock_guard<std::mutex> lock(mMutex);
175169

176170
TORCH_CHECK(cluster_size == 1 && cluster_rank == 0, "smart_router is supported in min_latency mode");
171+
177172
CHECK_INPUT(input, mActivationDtype)
178173
CHECK_INPUT(token_selected_experts, at::ScalarType::Int)
179174
if (token_final_scales)
@@ -251,6 +246,9 @@ class FusedMoeRunner : public torch::CustomClassHolder
251246
{
252247
std::lock_guard<std::mutex> lock(mMutex);
253248

249+
// Free the profile workspace to save memory
250+
freeProfileWorkspace();
251+
254252
CHECK_INPUT(input, mActivationDtype)
255253
CHECK_INPUT(token_selected_experts, at::ScalarType::Int)
256254
if (token_final_scales)
@@ -381,13 +379,7 @@ class FusedMoeRunner : public torch::CustomClassHolder
381379
hidden_size, inter_size, GROUP_SIZE, tensorrt_llm::ActivationType::Swiglu, USE_BIAS, USE_LORA,
382380
min_latency_mode, parallelism_config);
383381

384-
if (mProfileWorkspace != nullptr)
385-
{
386-
auto const cu_free_status = cudaFree(mProfileWorkspace);
387-
TORCH_CHECK(cu_free_status == cudaSuccess,
388-
"Can't free profile workspace for MoE GEMM profile during memory reallocation.");
389-
mProfileWorkspace = nullptr;
390-
}
382+
freeProfileWorkspace();
391383
size_t profile_workspace_size = mProfiler->getWorkspaceSize(num_rows);
392384
auto const cu_malloc_status = cudaMalloc(&mProfileWorkspace, profile_workspace_size);
393385
TORCH_CHECK(cu_malloc_status == cudaSuccess, "Can't allocate profile workspace for MoE GEMM profile.");
@@ -422,6 +414,17 @@ class FusedMoeRunner : public torch::CustomClassHolder
422414
using Profile = tensorrt_llm::cutlass_extensions::CutlassGemmConfig;
423415
std::vector<Profile> mAllProfiles;
424416

417+
void freeProfileWorkspace()
418+
{
419+
if (mProfileWorkspace != nullptr)
420+
{
421+
auto const cu_free_status = cudaFree(mProfileWorkspace);
422+
TORCH_CHECK(cu_free_status == cudaSuccess,
423+
"Can't free profile workspace for MoE GEMM profile during memory reallocation.");
424+
mProfileWorkspace = nullptr;
425+
}
426+
}
427+
425428
void setRunnerProfiles(torch::optional<c10::ArrayRef<int64_t>> profile_ids)
426429
{
427430
if (mUseFp8BlockScaling)

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

+49
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,33 @@ NVIDIA has announced world-record DeepSeek-R1 inference performance at NVIDIA GT
44

55
In this blog, we share the configurations and procedures about how to reproduce the number on both B200 and H200 with PyTorch workflow.
66

7+
## Table of Contents
8+
9+
- [How to get best performance on DeepSeek-R1 in TensorRT-LLM](#how-to-get-best-performance-on-deepseek-r1-in-tensorrt-llm)
10+
- [Table of Contents](#table-of-contents)
11+
- [Prerequisites: Install TensorRT-LLM and download models](#prerequisites-install-tensorrt-llm-and-download-models)
12+
- [1. Download TensorRT-LLM](#1-download-tensorrt-llm)
13+
- [2. Download the DeepSeek R1 models](#2-download-the-deepseek-r1-models)
14+
- [3. Build and run TensorRT-LLM container](#3-build-and-run-tensorrt-llm-container)
15+
- [4. Compile and Install TensorRT-LLM](#4-compile-and-install-tensorrt-llm)
16+
- [5. Optional: Tune GPU clocks](#5-optional-tune-gpu-clocks)
17+
- [6. Dataset preparation](#6-dataset-preparation)
18+
- [Reproducing steps](#reproducing-steps)
19+
- [B200 min-latency](#b200-min-latency)
20+
- [Expected Results](#expected-results)
21+
- [B200 max-throughput](#b200-max-throughput)
22+
- [Benchmark](#benchmark)
23+
- [Expected Result Format](#expected-result-format)
24+
- [H200 min-latency](#h200-min-latency)
25+
- [Expected Result Format](#expected-result-format-1)
26+
- [H200 max-throughput](#h200-max-throughput)
27+
- [Expected Result Format](#expected-result-format-2)
28+
- [Exploring more ISL/OSL combinations](#exploring-more-islosl-combinations)
29+
- [WIP: Enable more features by default](#wip-enable-more-features-by-default)
30+
- [WIP: Chunked context support on DeepSeek models](#wip-chunked-context-support-on-deepseek-models)
31+
- [Out of memory issues](#out-of-memory-issues)
32+
33+
734
## Prerequisites: Install TensorRT-LLM and download models
835

936
This section can be skipped if you already have TensorRT-LLM installed and have already downloaded the DeepSeek R1 model checkpoint.
@@ -324,3 +351,25 @@ Total Token Throughput (tokens/sec): 15707.0888
324351
Total Latency (ms): 993548.8470
325352
Average request latency (ms): 197768.0434
326353
```
354+
355+
## Exploring more ISL/OSL combinations
356+
357+
To benchmark TensorRT-LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT-LLM is working on enhancements that can make the benchmark process smoother.
358+
### WIP: Enable more features by default
359+
360+
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
361+
362+
Note that, `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
363+
364+
Generally, you should make sure that `max_batch_size` is not too low to bottleneck the throughput, and `max_num_tokens` needs to be large enough so that it covers the max input sequence length of the samples in dataset, as mentioned in below section "WIP: Chunked context support on DeepSeek models".
365+
366+
For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
367+
368+
### WIP: Chunked context support on DeepSeek models
369+
370+
TensorRT-LLM team is actively working on chunked context support for DeepSeek models. Because of that missing feature, there is currently a limitation that `max_num_tokens` has to be at least larger than the max input sequence length of the samples in dataset.
371+
For more details on `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
372+
373+
### Out of memory issues
374+
375+
It's possible seeing OOM issues on some cases. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem.

docs/source/torch.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --ex
4141

4242
- [Architecture Overview](./torch/arch_overview.md)
4343
- [Adding a New Model](./torch/adding_new_model.md)
44+
- [Examples](../../examples/pytorch/README.md)
4445

4546
## Key Components
4647

@@ -50,4 +51,4 @@ scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --ex
5051

5152
## Known Issues
5253

53-
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
54+
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.

0 commit comments

Comments
 (0)