v0.19.0rc0
Pre-release
Pre-release
- Model Support
- Features
- Added FP8 support for SM120 architecture (#3248)
- Registered
ENABLE_MULTI_DEVICE
andENABLE_UCX
as CMake options (#3343) - Made the scaffolding Controller more generic (#3416)
- Breaking change: Added individual gatherContext support for each additional output (#3374)
- Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
- Added Qwen2 MoE support for PyTorch flow (#3369)
- Enabled
PyExecutor
inference flow to estimatemax_num_tokens
forkv_cache_manager
(#3092) - Added
TLLM_OVERRIDE_LAYER_NUM
andTLLM_TRACE_MODEL_FORWARD
environment variables for debugging (#3417) - Applied the PyTorch workflow compatible
AutoTuner
to both Fused MoE and NVFP4 Linear operators (#3151) - Introduced a
UserBuffers
allocator for PyTorch flow (#3257) - Supported aborting disconnected requests (#3214)
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
- Added an option to run disaggregated serving without context servers (#3243)
- Enhanced RoPE support in AutoDeploy (#3115)
- Fixed and improved allreduce and fusion kernels (#3064)
- Added DeepSeek-V3 support in AutoDeploy (#3281)
- Enhanced the integrated robustness of scaffolding via
init.py
(#3312)
- API
- Bug fixes
- Fixed a wrong import of
KvCacheConfig
inexamples/gpqa_llmapi.py
(#3369) - Fixed the test name (#3534)
- Fixed
max_seq_len
inexecutor_config
(#3487) - Removed a duplicated line of code (#3523)
- Disabled kv cache reuse for the prompt tuning test (#3474)
- Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
- Added kv memory size per token calculation in the draft model (#3497)
- Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
- Fixed PP for Llama (#3449)
- Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
- Fixed disaggregation MTP with overlap (#3406)
- Stopped memory estimation in start_attention (#3485)
- Allowed the
context_and_generation
request type in disaggregated overlap (#3489) - Fixed the partial match issue (#3413)
- Fixed Eagle decoding (#3456)
- Fixed the
py_decoding_iter
update in the decoder (#3297) - Fixed the beam search diversity issue (#3375)
- Updated ucxx to avoid occasional segfaults when profiling (#3420)
- Fixed redrafter sampling (#3278)
- Fixed mllama end‑to‑end PyTorch flow (#3397)
- Reverted an extra CMake variable (#3351)
- Fixed issues with the fused MoE path (#3435)
- Fixed conflicting test names (#3316)
- Fixed failing DeepSeek-V3 unit tests (#3385)
- Fixed missing bias addition for
FP4Linear
(#3361) - Fixed the runtime error in
test_deepseek_allreduce.py
(#3226) - Fixed speculative decoding and multimodal input support (#3276)
- Fixed PyTorch nvsmall via
PyExecutor
and improved TP support (#3238) - Fixed the p‑tuning test bug (#3326)
- Fixed a wrong import of
- Performance
- Cached sin and cos in the model instead of using a global LRU cache (#3378)
- Deallocated tensors after use in MLA (#3286)
- Enabled DeepGEMM by default (#3341)
- Added a thread leak check and fixed thread/memory leak issues (#3270)
- Used cudaMalloc to allocate kvCache (#3303)
- Made ipc_periodically the default responses_handler (breaking change) (#3102)
- Used NVRTC for DeepGEMM JIT compilation (#3239)
- Optimized quantization kernels used in DeepSeek on Hopper (#3466)
- Documentation
- Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
- Documented disaggregation performance tuning (#3516)
- Updated the perf‑benchmarking documentation for GPU configuration (#3458)
- Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
- Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
- Updated the README for disaggregated serving (#3323)
- Updated instructions to enable FP8 MLA for Deepseek. (#3488)
Full change log: 5aeef6d...258ae9c.