Skip to content

v0.19.0rc0

Pre-release
Pre-release
Compare
Choose a tag to compare
@kaiyux kaiyux released this 18 Apr 23:19
· 147 commits to main since this release
258ae9c
  • Model Support
    • Added Llama 4 support. (#3302)
    • Added support for Phi‑4‑MM (#3296)
    • Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md. (#3247)
    • Added Qwen2.5‑VL support for PyTorch workflow and refactored Qwen2‑VL (#3156)
  • Features
    • Added FP8 support for SM120 architecture (#3248)
    • Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343)
    • Made the scaffolding Controller more generic (#3416)
    • Breaking change: Added individual gatherContext support for each additional output (#3374)
    • Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
    • Added Qwen2 MoE support for PyTorch flow (#3369)
    • Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager (#3092)
    • Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging (#3417)
    • Applied the PyTorch workflow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators (#3151)
    • Introduced a UserBuffers allocator for PyTorch flow (#3257)
    • Supported aborting disconnected requests (#3214)
    • Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
    • Added an option to run disaggregated serving without context servers (#3243)
    • Enhanced RoPE support in AutoDeploy (#3115)
    • Fixed and improved allreduce and fusion kernels (#3064)
    • Added DeepSeek-V3 support in AutoDeploy (#3281)
    • Enhanced the integrated robustness of scaffolding via init.py (#3312)
  • API
    • Added numNodes to ParallelConfig (#3346)
    • Redesigned the multi‑stream API for DeepSeek (#3459)
  • Bug fixes
    • Fixed a wrong import of KvCacheConfig in examples/gpqa_llmapi.py (#3369)
    • Fixed the test name (#3534)
    • Fixed max_seq_len in executor_config (#3487)
    • Removed a duplicated line of code (#3523)
    • Disabled kv cache reuse for the prompt tuning test (#3474)
    • Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
    • Added kv memory size per token calculation in the draft model (#3497)
    • Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
    • Fixed PP for Llama (#3449)
    • Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
    • Fixed disaggregation MTP with overlap (#3406)
    • Stopped memory estimation in start_attention (#3485)
    • Allowed the context_and_generation request type in disaggregated overlap (#3489)
    • Fixed the partial match issue (#3413)
    • Fixed Eagle decoding (#3456)
    • Fixed the py_decoding_iter update in the decoder (#3297)
    • Fixed the beam search diversity issue (#3375)
    • Updated ucxx to avoid occasional segfaults when profiling (#3420)
    • Fixed redrafter sampling (#3278)
    • Fixed mllama end‑to‑end PyTorch flow (#3397)
    • Reverted an extra CMake variable (#3351)
    • Fixed issues with the fused MoE path (#3435)
    • Fixed conflicting test names (#3316)
    • Fixed failing DeepSeek-V3 unit tests (#3385)
    • Fixed missing bias addition for FP4Linear (#3361)
    • Fixed the runtime error in test_deepseek_allreduce.py (#3226)
    • Fixed speculative decoding and multimodal input support (#3276)
    • Fixed PyTorch nvsmall via PyExecutor and improved TP support (#3238)
    • Fixed the p‑tuning test bug (#3326)
  • Performance
    • Cached sin and cos in the model instead of using a global LRU cache (#3378)
    • Deallocated tensors after use in MLA (#3286)
    • Enabled DeepGEMM by default (#3341)
    • Added a thread leak check and fixed thread/memory leak issues (#3270)
    • Used cudaMalloc to allocate kvCache (#3303)
    • Made ipc_periodically the default responses_handler (breaking change) (#3102)
    • Used NVRTC for DeepGEMM JIT compilation (#3239)
    • Optimized quantization kernels used in DeepSeek on Hopper (#3466)
  • Documentation
    • Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
    • Documented disaggregation performance tuning (#3516)
    • Updated the perf‑benchmarking documentation for GPU configuration (#3458)
    • Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
    • Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
    • Updated the README for disaggregated serving (#3323)
    • Updated instructions to enable FP8 MLA for Deepseek. (#3488)

Full change log: 5aeef6d...258ae9c.