Release v0.19.0rc0 · NVIDIA/TensorRT-LLM

Model Support
- Added Llama 4 support. (#3302)
- Added support for Phi‑4‑MM (#3296)
- Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md. (#3247)
- Added Qwen2.5‑VL support for PyTorch workflow and refactored Qwen2‑VL (#3156)
Features
- Added FP8 support for SM120 architecture (#3248)
- Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343)
- Made the scaffolding Controller more generic (#3416)
- Breaking change: Added individual gatherContext support for each additional output (#3374)
- Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
- Added Qwen2 MoE support for PyTorch flow (#3369)
- Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager (#3092)
- Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging (#3417)
- Applied the PyTorch workflow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators (#3151)
- Introduced a UserBuffers allocator for PyTorch flow (#3257)
- Supported aborting disconnected requests (#3214)
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
- Added an option to run disaggregated serving without context servers (#3243)
- Enhanced RoPE support in AutoDeploy (#3115)
- Fixed and improved allreduce and fusion kernels (#3064)
- Added DeepSeek-V3 support in AutoDeploy (#3281)
- Enhanced the integrated robustness of scaffolding via init.py (#3312)
API
- Added numNodes to ParallelConfig (#3346)
- Redesigned the multi‑stream API for DeepSeek (#3459)
Bug fixes
- Fixed a wrong import of KvCacheConfig in examples/gpqa_llmapi.py (#3369)
- Fixed the test name (#3534)
- Fixed max_seq_len in executor_config (#3487)
- Removed a duplicated line of code (#3523)
- Disabled kv cache reuse for the prompt tuning test (#3474)
- Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
- Added kv memory size per token calculation in the draft model (#3497)
- Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
- Fixed PP for Llama (#3449)
- Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
- Fixed disaggregation MTP with overlap (#3406)
- Stopped memory estimation in start_attention (#3485)
- Allowed the context_and_generation request type in disaggregated overlap (#3489)
- Fixed the partial match issue (#3413)
- Fixed Eagle decoding (#3456)
- Fixed the py_decoding_iter update in the decoder (#3297)
- Fixed the beam search diversity issue (#3375)
- Updated ucxx to avoid occasional segfaults when profiling (#3420)
- Fixed redrafter sampling (#3278)
- Fixed mllama end‑to‑end PyTorch flow (#3397)
- Reverted an extra CMake variable (#3351)
- Fixed issues with the fused MoE path (#3435)
- Fixed conflicting test names (#3316)
- Fixed failing DeepSeek-V3 unit tests (#3385)
- Fixed missing bias addition for FP4Linear (#3361)
- Fixed the runtime error in test_deepseek_allreduce.py (#3226)
- Fixed speculative decoding and multimodal input support (#3276)
- Fixed PyTorch nvsmall via PyExecutor and improved TP support (#3238)
- Fixed the p‑tuning test bug (#3326)
Performance
- Cached sin and cos in the model instead of using a global LRU cache (#3378)
- Deallocated tensors after use in MLA (#3286)
- Enabled DeepGEMM by default (#3341)
- Added a thread leak check and fixed thread/memory leak issues (#3270)
- Used cudaMalloc to allocate kvCache (#3303)
- Made ipc_periodically the default responses_handler (breaking change) (#3102)
- Used NVRTC for DeepGEMM JIT compilation (#3239)
- Optimized quantization kernels used in DeepSeek on Hopper (#3466)
Documentation
- Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
- Documented disaggregation performance tuning (#3516)
- Updated the perf‑benchmarking documentation for GPU configuration (#3458)
- Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
- Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
- Updated the README for disaggregated serving (#3323)
- Updated instructions to enable FP8 MLA for Deepseek. (#3488)

Full change log: 5aeef6d...258ae9c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.19.0rc0