Release v0.20.0rc0 · NVIDIA/TensorRT-LLM

Highlights

Model Support
- Added Nemotron-H model support (#3430)
- Added Dynasor-CoT in scaffolding examples (#3501)
Features
- Added stream generation task scaffolding examples (#3527)
- Added unfused RoPE support in MLA (#3610)
- Multimodal models
  - Added support in trtllm-serve (#3590)
  - Added support in trtllm-bench, the support is limited to image only for now (#3490)
- [Experimental] The TensorRT-LLM Triton backend has supported the LLM API (triton-inference-server/tensorrtllm_backend#742)
Performance
- Optimized Large Embedding Tables in Multimodal Models (#3380)
Infra
- Dependent datasets version was upgraded to 3.1.0 (#3490)

What's Changed

chore: Unify Python NVTX call by @kaiyux in #3450
doc: genai-perf benchmark & slurm multi-node for trtllm-serve doc by @LinPoly in #3407
fix: disable KV cache reuse if using attention sink by @Funatiq in #3021
doc: Minor fixes for documents by @kaiyux in #3577
fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 by @VALLIS-NERIA in #3585
chore: Mass integration of release/0.18 by @dcampora in #3421
fix: LLM API _hf_model_dir for non-cached case by @syuoni in #3562
ci: waive test_llm_multi_node_pytorch by @Superjomn in #3592
fix: amend trtllm-bench command in the test by @Superjomn in #3563
feat: Add stream generation task scaffolding examples by @narutolhy in #3527
chore: bump version to 0.20.0rc0 by @ZhanruiSunCh in #3561
chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs by @jinyangyuan-nvidia in #3572
chore: waive test_llm_phi_quantization_1gpu by @QiJune in #3603
feat: Support cos_sin_cache in all cases. by @yuxianq in #3517
fix: add SM90 guard for FP8 Blockscale GEMM by @lucifer1004 in #3575
infra: Update user list by @niukuo in #3614
feat: Adding FP8 BMM from Codegen by @evezhier in #3541
waive test_llm_multi_node_with_postproc by @QiJune in #3628
fix: Use hmac authentication for pickle encryption by @yibinl-nvidia in #3384
Clean up linear.py, mlp.py, gated_mlp.py by @hlu1 in #3553
feat: Support CUDA graphs for EAGLE3 by @mikeiovine in #3176
feat: Nemotron-H model support by @vegaluisjose in #3430
waive test_fp8_scaled_mm by @QiJune in #3637
disable ib for ucx test by @chuangz0 in #3613
tests: change qa perf test to trtllm-bench by @ruodil in #3189
test: add quickstart test for nemotron-ultra by @crazydemo in #3596
feat: Add support for smaller hidden_dim in AR fusion kernel by @yilin-void in #3609
Fix rotary_emb param in NemotronH attention by @vegaluisjose in #3646
chore: Use ellipsis as default value to detect whether residual argument is provided by @yuxianq in #3626
feat/loraOp by @danielafrimi in #3455
Cherry-pick: update fp8 doc (#3647) by @litaotju in #3650
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3627
Waive L0 tests by @yiqingy0 in #3651
feat: allocate minimal blocks per window size by @netanel-haber in #3028
test: remove benchmark test list on main branch by @crazydemo in #3644
feat: Support unfused rope in MLA. by @yuxianq in #3610
fix: Fix fused_moe cache fallback issue. by @hyukn in #3652
fix: Correct reporting of text dtype for Llama 4. by @FrankD412 in #3494
chore: waive test_llm_multi_node by @QiJune in #3664
chore: update multi gpu trigger file list by @QiJune in #3665
feat: adding multimodal (only image for now) support in trtllm-bench by @rakib-hasan in #3490
fix sage attention headsize check error in bertAttentionPlugin.cpp by @Jackch-NV in #3660
fix: llama4: address couple of issues in llama4 attention module by @chang-l in #3491
chore: Refactor test_disaggregated.py by @Tabrizian in #3154
test: Add llama 4 to ci by @dongfengy in #3520
chore : Split more tests out of gpt tests by @peaceh-nv in #3524
infra: Add step to generate new duration file by @EmmaQiaoCh in #3298
refactor: Clean up CMakeLists.txt by @tongyuantongyu in #3479
test: Unwaive test for nvbug_5150466 by @hchings in #3552
feat: Add Dynasor-CoT in scaffolding examples by @Fsanic in #3501
feat: Integrate GPUDirect Storage (GDS) into Executor API by @DomBrown in #3582
Remove dummy forward path by @HuiGao-NV in #3669
fix: hmac in remote mpi session by @Superjomn in #3649
test: add kv cache event tests for disagg workers by @zhengd-nv in #3602
chore: enable test_ptp_quickstart_advanced_mixed_precision back by @QiJune in #3667
feat: Disaggregated router class by @pcastonguay in #3584
Updating the run.py to make the draft target model run with the LLaMa 3 1B/8B by @mayani-nv in #3615
feat: trtllm-serve multimodal support by @yechank-nvidia in #3590
chore: Waive disaggregated load balance by @Tabrizian in #3687
Clean up modeling_deepseek.py by @hlu1 in #3640
fix: Fix disaggregated load balance test by @Tabrizian in #3689
feat: Introduce feature properties for attention backend. by @yuxianq in #3659
test:update waives.txt for nvbug 5219532 by @nv-guomingz in #3672
test: Get Eagle tests working by @brb-nv in #3593
move the reset models into examples/models/core directory by @QiJune in #3555
fix: Refactor Deepseek tp_size calculation by @hlu1 in #3695
Update Nemotron Super and Ultra in Supported Models and add an example by @Naveassaf in #3632
infra: Add test list name check by @EmmaQiaoCh in #3097
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend by @hlu1 in #3387
Waive L0 tests by @yiqingy0 in #3709
fix: update test_user_buffers_mm_add_prologue atol (#3711) by @liji-nv in #3713
fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. by @yuxianq in #3679
Report number of context tokens in one iteration by @HuiGao-NV in #3691
fix: Remove ParallelConfig. by @yuxianq in #3678
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode by @katec846 in #3380
fix: fix cublas_scaled_mm by @dc3671 in #3600
chore: update FMHA cubin files by @jinyangyuan-nvidia in #3680
test: add llama3.2 ptp test case by @StanleySun639 in #3363
bug: Fix hang bug when context server doesn't have enough capacity for KV Cache by @Tabrizian in #3095
refact: use pybind block key and hasher in disagg worker test by @zhengd-nv in #3712
Fix: nvbugs/5232457 ModelOpt Mixtral AWQ OOM by @Barry-Delaney in #3714
ci: unwaive multi_node test by @Superjomn in #3715
Revert "Report number of context tokens in one iteration (#3691)" by @kaiyux in #3740
Fix/executor bugs by @byshiue in #3681
test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA by @syuoni in #3483
datasets API change : datasets.load_metric => evaluate.load by @rakib-hasan in #3741
fix: Remove unnecessary max call by @kaiyux in #3574
test: Unwaive Llama 3.1 with torch compile test by @yizhang-nv in #3475
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3683
refactor: Introduce DecoderOutputBuffers per batch by @Funatiq in #3506
fix: fnmatch usage in modeling_utils.py (https://nvbugspro.nvidia.com/bug/5234567) by @syuoni in #3754
test: change default max_batch_size to 512 in test config and dump config.json to log by @ruodil in #3657
test: waive gemma on L20 by @crazydemo in #3767
chore: remove useless allgather by @byshiue in #3751
feat: Unify two versions of allreduce custom op by @hyukn in #3032
doc: Update doc for Deepseek min latency by @zongfeijing in #3717
Add log_level for disaggregated_mpi_worker by @qiaoxj07 in #3765
infra: Add test stages for sm120 by @EmmaQiaoCh in #3533
feat: [AutoDeploy] generalizing cudagraph to multiple dynamic inputs by @lucaslie in #3589
chore: Move opencv-python import inside load_video() function by @AlessioNetti in #3768
Fixing the metric fmeasure access by @rakib-hasan in #3774

New Contributors

@dcampora made their first contribution in #3421
@narutolhy made their first contribution in #3527
@evezhier made their first contribution in #3541
@vegaluisjose made their first contribution in #3430
@rakib-hasan made their first contribution in #3490
@Jackch-NV made their first contribution in #3660
@dongfengy made their first contribution in #3520
@Fsanic made their first contribution in #3501
@mayani-nv made their first contribution in #3615
@Naveassaf made their first contribution in #3632
@katec846 made their first contribution in #3380
@StanleySun639 made their first contribution in #3363
@Barry-Delaney made their first contribution in #3714
@yizhang-nv made their first contribution in #3475
@qiaoxj07 made their first contribution in #3765
@AlessioNetti made their first contribution in #3768

Full Changelog: v0.19.0rc0...v0.20.0rc0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.20.0rc0

Highlights

What's Changed

New Contributors

Contributors