v0.20.0rc0
Pre-release
Pre-release
Highlights
- Model Support
- Features
- Added stream generation task scaffolding examples (#3527)
- Added unfused RoPE support in MLA (#3610)
- Multimodal models
- [Experimental] The TensorRT-LLM Triton backend has supported the LLM API (triton-inference-server/tensorrtllm_backend#742)
- Performance
- Optimized Large Embedding Tables in Multimodal Models (#3380)
- Infra
- Dependent
datasets
version was upgraded to 3.1.0 (#3490)
- Dependent
What's Changed
- chore: Unify Python NVTX call by @kaiyux in #3450
- doc: genai-perf benchmark & slurm multi-node for trtllm-serve doc by @LinPoly in #3407
- fix: disable KV cache reuse if using attention sink by @Funatiq in #3021
- doc: Minor fixes for documents by @kaiyux in #3577
- fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 by @VALLIS-NERIA in #3585
- chore: Mass integration of release/0.18 by @dcampora in #3421
- fix: LLM API _hf_model_dir for non-cached case by @syuoni in #3562
- ci: waive test_llm_multi_node_pytorch by @Superjomn in #3592
- fix: amend trtllm-bench command in the test by @Superjomn in #3563
- feat: Add stream generation task scaffolding examples by @narutolhy in #3527
- chore: bump version to 0.20.0rc0 by @ZhanruiSunCh in #3561
- chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs by @jinyangyuan-nvidia in #3572
- chore: waive test_llm_phi_quantization_1gpu by @QiJune in #3603
- feat: Support cos_sin_cache in all cases. by @yuxianq in #3517
- fix: add SM90 guard for FP8 Blockscale GEMM by @lucifer1004 in #3575
- infra: Update user list by @niukuo in #3614
- feat: Adding FP8 BMM from Codegen by @evezhier in #3541
- waive test_llm_multi_node_with_postproc by @QiJune in #3628
- fix: Use hmac authentication for pickle encryption by @yibinl-nvidia in #3384
- Clean up linear.py, mlp.py, gated_mlp.py by @hlu1 in #3553
- feat: Support CUDA graphs for EAGLE3 by @mikeiovine in #3176
- feat: Nemotron-H model support by @vegaluisjose in #3430
- waive test_fp8_scaled_mm by @QiJune in #3637
- disable ib for ucx test by @chuangz0 in #3613
- tests: change qa perf test to trtllm-bench by @ruodil in #3189
- test: add quickstart test for nemotron-ultra by @crazydemo in #3596
- feat: Add support for smaller hidden_dim in AR fusion kernel by @yilin-void in #3609
- Fix rotary_emb param in NemotronH attention by @vegaluisjose in #3646
- chore: Use ellipsis as default value to detect whether residual argument is provided by @yuxianq in #3626
- feat/loraOp by @danielafrimi in #3455
- Cherry-pick: update fp8 doc (#3647) by @litaotju in #3650
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3627
- Waive L0 tests by @yiqingy0 in #3651
- feat: allocate minimal blocks per window size by @netanel-haber in #3028
- test: remove benchmark test list on main branch by @crazydemo in #3644
- feat: Support unfused rope in MLA. by @yuxianq in #3610
- fix: Fix fused_moe cache fallback issue. by @hyukn in #3652
- fix: Correct reporting of text dtype for Llama 4. by @FrankD412 in #3494
- chore: waive test_llm_multi_node by @QiJune in #3664
- chore: update multi gpu trigger file list by @QiJune in #3665
- feat: adding multimodal (only image for now) support in trtllm-bench by @rakib-hasan in #3490
- fix sage attention headsize check error in bertAttentionPlugin.cpp by @Jackch-NV in #3660
- fix: llama4: address couple of issues in llama4 attention module by @chang-l in #3491
- chore: Refactor test_disaggregated.py by @Tabrizian in #3154
- test: Add llama 4 to ci by @dongfengy in #3520
- chore : Split more tests out of gpt tests by @peaceh-nv in #3524
- infra: Add step to generate new duration file by @EmmaQiaoCh in #3298
- refactor: Clean up CMakeLists.txt by @tongyuantongyu in #3479
- test: Unwaive test for nvbug_5150466 by @hchings in #3552
- feat: Add Dynasor-CoT in scaffolding examples by @Fsanic in #3501
- feat: Integrate GPUDirect Storage (GDS) into Executor API by @DomBrown in #3582
- Remove dummy forward path by @HuiGao-NV in #3669
- fix: hmac in remote mpi session by @Superjomn in #3649
- test: add kv cache event tests for disagg workers by @zhengd-nv in #3602
- chore: enable test_ptp_quickstart_advanced_mixed_precision back by @QiJune in #3667
- feat: Disaggregated router class by @pcastonguay in #3584
- Updating the run.py to make the draft target model run with the LLaMa 3 1B/8B by @mayani-nv in #3615
- feat: trtllm-serve multimodal support by @yechank-nvidia in #3590
- chore: Waive disaggregated load balance by @Tabrizian in #3687
- Clean up modeling_deepseek.py by @hlu1 in #3640
- fix: Fix disaggregated load balance test by @Tabrizian in #3689
- feat: Introduce feature properties for attention backend. by @yuxianq in #3659
- test:update waives.txt for nvbug 5219532 by @nv-guomingz in #3672
- test: Get Eagle tests working by @brb-nv in #3593
- move the reset models into
examples/models/core
directory by @QiJune in #3555 - fix: Refactor Deepseek tp_size calculation by @hlu1 in #3695
- Update Nemotron Super and Ultra in Supported Models and add an example by @Naveassaf in #3632
- infra: Add test list name check by @EmmaQiaoCh in #3097
- feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend by @hlu1 in #3387
- Waive L0 tests by @yiqingy0 in #3709
- fix: update test_user_buffers_mm_add_prologue atol (#3711) by @liji-nv in #3713
- fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. by @yuxianq in #3679
- Report number of context tokens in one iteration by @HuiGao-NV in #3691
- fix: Remove ParallelConfig. by @yuxianq in #3678
- feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode by @katec846 in #3380
- fix: fix cublas_scaled_mm by @dc3671 in #3600
- chore: update FMHA cubin files by @jinyangyuan-nvidia in #3680
- test: add llama3.2 ptp test case by @StanleySun639 in #3363
- bug: Fix hang bug when context server doesn't have enough capacity for KV Cache by @Tabrizian in #3095
- refact: use pybind block key and hasher in disagg worker test by @zhengd-nv in #3712
- Fix: nvbugs/5232457 ModelOpt Mixtral AWQ OOM by @Barry-Delaney in #3714
- ci: unwaive multi_node test by @Superjomn in #3715
- Revert "Report number of context tokens in one iteration (#3691)" by @kaiyux in #3740
- Fix/executor bugs by @byshiue in #3681
- test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA by @syuoni in #3483
- datasets API change : datasets.load_metric => evaluate.load by @rakib-hasan in #3741
- fix: Remove unnecessary max call by @kaiyux in #3574
- test: Unwaive Llama 3.1 with torch compile test by @yizhang-nv in #3475
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3683
- refactor: Introduce DecoderOutputBuffers per batch by @Funatiq in #3506
- fix: fnmatch usage in modeling_utils.py (https://nvbugspro.nvidia.com/bug/5234567) by @syuoni in #3754
- test: change default max_batch_size to 512 in test config and dump config.json to log by @ruodil in #3657
- test: waive gemma on L20 by @crazydemo in #3767
- chore: remove useless allgather by @byshiue in #3751
- feat: Unify two versions of allreduce custom op by @hyukn in #3032
- doc: Update doc for Deepseek min latency by @zongfeijing in #3717
- Add log_level for disaggregated_mpi_worker by @qiaoxj07 in #3765
- infra: Add test stages for sm120 by @EmmaQiaoCh in #3533
- feat: [AutoDeploy] generalizing cudagraph to multiple dynamic inputs by @lucaslie in #3589
- chore: Move opencv-python import inside load_video() function by @AlessioNetti in #3768
- Fixing the metric fmeasure access by @rakib-hasan in #3774
New Contributors
- @dcampora made their first contribution in #3421
- @narutolhy made their first contribution in #3527
- @evezhier made their first contribution in #3541
- @vegaluisjose made their first contribution in #3430
- @rakib-hasan made their first contribution in #3490
- @Jackch-NV made their first contribution in #3660
- @dongfengy made their first contribution in #3520
- @Fsanic made their first contribution in #3501
- @mayani-nv made their first contribution in #3615
- @Naveassaf made their first contribution in #3632
- @katec846 made their first contribution in #3380
- @StanleySun639 made their first contribution in #3363
- @Barry-Delaney made their first contribution in #3714
- @yizhang-nv made their first contribution in #3475
- @qiaoxj07 made their first contribution in #3765
- @AlessioNetti made their first contribution in #3768
Full Changelog: v0.19.0rc0...v0.20.0rc0