Releases: NVIDIA/TensorRT-LLM
v0.20.0rc0
Highlights
- Model Support
- Features
- Added stream generation task scaffolding examples (#3527)
- Added unfused RoPE support in MLA (#3610)
- Multimodal models
- [Experimental] The TensorRT-LLM Triton backend has supported the LLM API (triton-inference-server/tensorrtllm_backend#742)
- Performance
- Optimized Large Embedding Tables in Multimodal Models (#3380)
- Infra
- Dependent
datasets
version was upgraded to 3.1.0 (#3490)
- Dependent
What's Changed
- chore: Unify Python NVTX call by @kaiyux in #3450
- doc: genai-perf benchmark & slurm multi-node for trtllm-serve doc by @LinPoly in #3407
- fix: disable KV cache reuse if using attention sink by @Funatiq in #3021
- doc: Minor fixes for documents by @kaiyux in #3577
- fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 by @VALLIS-NERIA in #3585
- chore: Mass integration of release/0.18 by @dcampora in #3421
- fix: LLM API _hf_model_dir for non-cached case by @syuoni in #3562
- ci: waive test_llm_multi_node_pytorch by @Superjomn in #3592
- fix: amend trtllm-bench command in the test by @Superjomn in #3563
- feat: Add stream generation task scaffolding examples by @narutolhy in #3527
- chore: bump version to 0.20.0rc0 by @ZhanruiSunCh in #3561
- chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs by @jinyangyuan-nvidia in #3572
- chore: waive test_llm_phi_quantization_1gpu by @QiJune in #3603
- feat: Support cos_sin_cache in all cases. by @yuxianq in #3517
- fix: add SM90 guard for FP8 Blockscale GEMM by @lucifer1004 in #3575
- infra: Update user list by @niukuo in #3614
- feat: Adding FP8 BMM from Codegen by @evezhier in #3541
- waive test_llm_multi_node_with_postproc by @QiJune in #3628
- fix: Use hmac authentication for pickle encryption by @yibinl-nvidia in #3384
- Clean up linear.py, mlp.py, gated_mlp.py by @hlu1 in #3553
- feat: Support CUDA graphs for EAGLE3 by @mikeiovine in #3176
- feat: Nemotron-H model support by @vegaluisjose in #3430
- waive test_fp8_scaled_mm by @QiJune in #3637
- disable ib for ucx test by @chuangz0 in #3613
- tests: change qa perf test to trtllm-bench by @ruodil in #3189
- test: add quickstart test for nemotron-ultra by @crazydemo in #3596
- feat: Add support for smaller hidden_dim in AR fusion kernel by @yilin-void in #3609
- Fix rotary_emb param in NemotronH attention by @vegaluisjose in #3646
- chore: Use ellipsis as default value to detect whether residual argument is provided by @yuxianq in #3626
- feat/loraOp by @danielafrimi in #3455
- Cherry-pick: update fp8 doc (#3647) by @litaotju in #3650
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3627
- Waive L0 tests by @yiqingy0 in #3651
- feat: allocate minimal blocks per window size by @netanel-haber in #3028
- test: remove benchmark test list on main branch by @crazydemo in #3644
- feat: Support unfused rope in MLA. by @yuxianq in #3610
- fix: Fix fused_moe cache fallback issue. by @hyukn in #3652
- fix: Correct reporting of text dtype for Llama 4. by @FrankD412 in #3494
- chore: waive test_llm_multi_node by @QiJune in #3664
- chore: update multi gpu trigger file list by @QiJune in #3665
- feat: adding multimodal (only image for now) support in trtllm-bench by @rakib-hasan in #3490
- fix sage attention headsize check error in bertAttentionPlugin.cpp by @Jackch-NV in #3660
- fix: llama4: address couple of issues in llama4 attention module by @chang-l in #3491
- chore: Refactor test_disaggregated.py by @Tabrizian in #3154
- test: Add llama 4 to ci by @dongfengy in #3520
- chore : Split more tests out of gpt tests by @peaceh-nv in #3524
- infra: Add step to generate new duration file by @EmmaQiaoCh in #3298
- refactor: Clean up CMakeLists.txt by @tongyuantongyu in #3479
- test: Unwaive test for nvbug_5150466 by @hchings in #3552
- feat: Add Dynasor-CoT in scaffolding examples by @Fsanic in #3501
- feat: Integrate GPUDirect Storage (GDS) into Executor API by @DomBrown in #3582
- Remove dummy forward path by @HuiGao-NV in #3669
- fix: hmac in remote mpi session by @Superjomn in #3649
- test: add kv cache event tests for disagg workers by @zhengd-nv in #3602
- chore: enable test_ptp_quickstart_advanced_mixed_precision back by @QiJune in #3667
- feat: Disaggregated router class by @pcastonguay in #3584
- Updating the run.py to make the draft target model run with the LLaMa 3 1B/8B by @mayani-nv in #3615
- feat: trtllm-serve multimodal support by @yechank-nvidia in #3590
- chore: Waive disaggregated load balance by @Tabrizian in #3687
- Clean up modeling_deepseek.py by @hlu1 in #3640
- fix: Fix disaggregated load balance test by @Tabrizian in #3689
- feat: Introduce feature properties for attention backend. by @yuxianq in #3659
- test:update waives.txt for nvbug 5219532 by @nv-guomingz in #3672
- test: Get Eagle tests working by @brb-nv in #3593
- move the reset models into
examples/models/core
directory by @QiJune in #3555 - fix: Refactor Deepseek tp_size calculation by @hlu1 in #3695
- Update Nemotron Super and Ultra in Supported Models and add an example by @Naveassaf in #3632
- infra: Add test list name check by @EmmaQiaoCh in #3097
- feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend by @hlu1 in #3387
- Waive L0 tests by @yiqingy0 in #3709
- fix: update test_user_buffers_mm_add_prologue atol (#3711) by @liji-nv in #3713
- fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. by @yuxianq in #3679
- Report number of context tokens in one iteration by @HuiGao-NV in #3691
- fix: Remove ParallelConfig. by @yuxianq in #3678
- feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode by @katec846 in #3380
- fix: fix cublas_scaled_mm by @dc3671 in #3600
- chore: update FMHA cubin files by @jinyangyuan-nvidia in #3680
- test: add llama3.2 ptp test case by @StanleySun639 in #3363
- bug: Fix hang bug when context server doesn't have enough capacity for KV Cache by @Tabrizian in #3095
- refact: use pybind block key and hasher in disagg worker test by @zhengd-nv in #3712
- Fix: nvbugs/5232457 ModelOpt Mixtral AWQ OOM by @Barry-Delaney in #3714
- ci: unwaive multi_node test by @Superjomn in https://github.com/...
v0.19.0rc0
- Model Support
- Features
- Added FP8 support for SM120 architecture (#3248)
- Registered
ENABLE_MULTI_DEVICE
andENABLE_UCX
as CMake options (#3343) - Made the scaffolding Controller more generic (#3416)
- Breaking change: Added individual gatherContext support for each additional output (#3374)
- Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
- Added Qwen2 MoE support for PyTorch flow (#3369)
- Enabled
PyExecutor
inference flow to estimatemax_num_tokens
forkv_cache_manager
(#3092) - Added
TLLM_OVERRIDE_LAYER_NUM
andTLLM_TRACE_MODEL_FORWARD
environment variables for debugging (#3417) - Applied the PyTorch workflow compatible
AutoTuner
to both Fused MoE and NVFP4 Linear operators (#3151) - Introduced a
UserBuffers
allocator for PyTorch flow (#3257) - Supported aborting disconnected requests (#3214)
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
- Added an option to run disaggregated serving without context servers (#3243)
- Enhanced RoPE support in AutoDeploy (#3115)
- Fixed and improved allreduce and fusion kernels (#3064)
- Added DeepSeek-V3 support in AutoDeploy (#3281)
- Enhanced the integrated robustness of scaffolding via
init.py
(#3312)
- API
- Bug fixes
- Fixed a wrong import of
KvCacheConfig
inexamples/gpqa_llmapi.py
(#3369) - Fixed the test name (#3534)
- Fixed
max_seq_len
inexecutor_config
(#3487) - Removed a duplicated line of code (#3523)
- Disabled kv cache reuse for the prompt tuning test (#3474)
- Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
- Added kv memory size per token calculation in the draft model (#3497)
- Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
- Fixed PP for Llama (#3449)
- Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
- Fixed disaggregation MTP with overlap (#3406)
- Stopped memory estimation in start_attention (#3485)
- Allowed the
context_and_generation
request type in disaggregated overlap (#3489) - Fixed the partial match issue (#3413)
- Fixed Eagle decoding (#3456)
- Fixed the
py_decoding_iter
update in the decoder (#3297) - Fixed the beam search diversity issue (#3375)
- Updated ucxx to avoid occasional segfaults when profiling (#3420)
- Fixed redrafter sampling (#3278)
- Fixed mllama end‑to‑end PyTorch flow (#3397)
- Reverted an extra CMake variable (#3351)
- Fixed issues with the fused MoE path (#3435)
- Fixed conflicting test names (#3316)
- Fixed failing DeepSeek-V3 unit tests (#3385)
- Fixed missing bias addition for
FP4Linear
(#3361) - Fixed the runtime error in
test_deepseek_allreduce.py
(#3226) - Fixed speculative decoding and multimodal input support (#3276)
- Fixed PyTorch nvsmall via
PyExecutor
and improved TP support (#3238) - Fixed the p‑tuning test bug (#3326)
- Fixed a wrong import of
- Performance
- Cached sin and cos in the model instead of using a global LRU cache (#3378)
- Deallocated tensors after use in MLA (#3286)
- Enabled DeepGEMM by default (#3341)
- Added a thread leak check and fixed thread/memory leak issues (#3270)
- Used cudaMalloc to allocate kvCache (#3303)
- Made ipc_periodically the default responses_handler (breaking change) (#3102)
- Used NVRTC for DeepGEMM JIT compilation (#3239)
- Optimized quantization kernels used in DeepSeek on Hopper (#3466)
- Documentation
- Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
- Documented disaggregation performance tuning (#3516)
- Updated the perf‑benchmarking documentation for GPU configuration (#3458)
- Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
- Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
- Updated the README for disaggregated serving (#3323)
- Updated instructions to enable FP8 MLA for Deepseek. (#3488)
Full change log: 5aeef6d...258ae9c.
TensorRT-LLM Release 0.18.2
Key Features and Enhancements
- This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.
TensorRT-LLM Release 0.18.1
Key Features and Enhancements
- The 0.18.x series of releases builds upon the 0.17.0 release, focusing exclusively on dependency updates without incorporating features from the previous 0.18.0.dev pre-releases. These features will be included in future stable releases.
Infrastructure Changes
- The dependent
transformers
package version is updated to 4.48.3.
TensorRT-LLM Release 0.18.0
Hi,
We are very pleased to announce the 0.18.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Features that were previously available in the 0.18.0.dev pre-releases are not included in this release.
- [BREAKING CHANGE] Windows platform support is deprecated as of v0.18.0. All Windows-related code and functionality will be completely removed in future releases.
Known Issues
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.03-py3
. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.03-py3
. - The dependent TensorRT version is updated to 10.9.
- The dependent CUDA version is updated to 12.8.1.
- The dependent NVIDIA ModelOpt version is updated to 0.25 for Linux platform.
TensorRT-LLM Release 0.17.0
Hi,
We are very pleased to announce the 0.17.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Blackwell support
- NOTE: pip installation is not supported for TRT-LLM 0.17 on Blackwell platforms only. Instead, it is recommended that the user build from source using NVIDIA NGC 25.01 PyTorch container.
- Added support for B200.
- Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
- Added NVFP4 Gemm support for Llama and Mixtral models.
- Added NVFP4 support for the
LLM
API andtrtllm-bench
command. - GB200 NVL is not fully supported.
- Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
- PyTorch workflow
- The PyTorch workflow is an experimental feature in
tensorrt_llm._torch
. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow. - Added support for H100/H200/B200.
- Added support for Llama models, Mixtral, QWen, Vila.
- Added support for FP16/BF16/FP8/NVFP4 Gemm and fused Mixture-Of-Experts (MOE), FP16/BF16/FP8 KVCache.
- Added custom context and decoding attention kernels support via PyTorch custom op.
- Added support for chunked context (default off).
- Added CudaGraph support for decoding only.
- Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
- The PyTorch workflow is an experimental feature in
- Added FP8 context FMHA support for the W4A8 quantization workflow.
- Added ModelOpt quantized checkpoint support for the
LLM
API. - Added FP8 support for the Llama-3.2 VLM model. Refer to the “MLLaMA” section in
examples/multimodal/README.md
. - Added PDL support for
userbuffer
based AllReduce-Norm fusion kernel. - Added runtime support for seamless lookahead decoding.
- Added token-aligned arbitrary output tensors support for the C++
executor
API.
API Changes
- [BREAKING CHANGE] KV cache reuse is enabled automatically when
paged_context_fmha
is enabled. - Added
--concurrency
support for thethroughput
subcommand oftrtllm-bench
.
Fixed Issues
- Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in #2484.
- Added NVIDIA H200 GPU into the
cluster_key
for auto parallelism feature. (#2552) - Fixed a typo in the
__post_init__
function ofLLmArgs
Class. Thanks for the contribution from @topenkoff in #2691. - Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
- Fixed Deepseek-V2 model accuracy.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.01-py3
. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.01-py3
. - The dependent TensorRT version is updated to 10.8.0.
- The dependent CUDA version is updated to 12.8.0.
- The dependent ModelOpt version is updated to 0.23 for Linux platform, while 0.17 is still used on Windows platform.
Known Issues
- Need
--extra-index-url https://pypi.nvidia.com
when runningpip install tensorrt-llm
due to new third-party dependencies. - The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related GitHub issue.
TensorRT-LLM Release 0.16.0
Hi,
We are very pleased to announce the 0.16.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Added guided decoding support with XGrammar backend.
- Added quantization support for RecurrentGemma. Refer to
examples/recurrentgemma/README.md
. - Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at
examples/llama/README.md
. - Added W4A8 quantization support to BF16 models on Ada (SM89).
- Added PDL support for the FP8 GEMM plugins.
- Added a runtime
max_num_tokens
dynamic tuning feature, which can be enabled by setting--enable_max_num_tokens_tuning
togptManagerBenchmark
. - Added typical acceptance support for EAGLE.
- Supported chunked context and sliding window attention to be enabled together.
- Added head size 64 support for the XQA kernel.
- Added the following features to the LLM API:
- Lookahead decoding.
- DeepSeek V1 support.
- Medusa support.
max_num_tokens
andmax_batch_size
arguments to control the runtime parameters.extended_runtime_perf_knob_config
to enable various performance configurations.
- Added LogN scaling support for Qwen models.
- Added
AutoAWQ
checkpoints support for Qwen. Refer to the “INT4-AWQ” section inexamples/qwen/README.md
. - Added
AutoAWQ
andAutoGPTQ
Hugging Face checkpoints support for LLaMA. (#2458) - Added
allottedTimeMs
to the C++Request
class to support per-request timeout. - [BREAKING CHANGE] Removed NVIDIA V100 GPU support.
API Changes
- [BREAKING CHANGE] Removed
enable_xqa
argument fromtrtllm-build
. - [BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
- [BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag
--use_embedding_sharing
from convert checkpoints scripts. - [BREAKING CHANGE] The
if __name__ == "__main__"
entry point is required for both single-GPU and multi-GPU cases when using theLLM
API. - [BREAKING CHANGE] Cancelled requests now return empty results.
- Added the
enable_chunked_prefill
flag to theLlmArgs
of theLLM
API. - Integrated BERT and RoBERTa models to the
trtllm-build
command.
Model Updates
- Added Qwen2-VL support. Refer to the “Qwen2-VL” section of
examples/multimodal/README.md
. - Added multimodal evaluation examples. Refer to
examples/multimodal
. - Added Stable Diffusion XL support. Refer to
examples/sdxl/README.md
. Thanks for the contribution from @Zars19 in #1514.
Fixed Issues
- Fixed unnecessary batch logits post processor calls. (#2439)
- Fixed a typo in the error message. (#2473)
- Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in #2485.
- Fixed
sampling_params
to only be setup ifend_id
is None andtokenizer
is not None in theLLM
API. Thanks to the contribution from @mfuntowicz in #2573.
Infrastructure Changes
- Updated the base Docker image for TensorRT-LLM to
nvcr.io/nvidia/pytorch:24.11-py3
. - Updated the base Docker image for TensorRT-LLM Backend to
nvcr.io/nvidia/tritonserver:24.11-py3
. - Updated to TensorRT v10.7.
- Updated to CUDA v12.6.3.
- Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
- Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.
Known Issues
- There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by
export NCCL_P2P_LEVEL=SYS
.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.15.0 Release
Hi,
We are very pleased to announce the 0.15.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Added support for EAGLE. Refer to
examples/eagle/README.md
. - Added functional support for GH200 systems.
- Added AutoQ (mixed precision) support.
- Added a
trtllm-serve
command to start a FastAPI based server. - Added FP8 support for Nemotron NAS 51B. Refer to
examples/nemotron_nas/README.md
. - Added INT8 support for GPTQ quantization.
- Added TensorRT native support for INT8 Smooth Quantization.
- Added quantization support for Exaone model. Refer to
examples/exaone/README.md
. - Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in
examples/medusa/README.md
. - Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
- Added support for
Qwen2ForSequenceClassification
model architecture. - Added Python plugin support to simplify plugin development efforts. Refer to
examples/python_plugin/README.md
. - Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from @AlessioNetti in #2366.
- Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in
docs/source/performance/perf-best-practices.md
for information about the required conditions for embedding sharing. - Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
- Extended the maximum supported
beam_width
to256
. - Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to
examples/multimodal/README.md
. - Added support for prompt-lookup speculative decoding. Refer to
examples/prompt_lookup/README.md
. - Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in
examples/llama/README.md
. - Added a C++ example for fast logits using the
executor
API. Refer to “executorExampleFastLogits” section inexamples/cpp/executor/README.md
. - [BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
- Added the following enhancements to the LLM API:
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
LLM.generate
toLLM.__init__
for better generation performance without warmup. - Added
n
andbest_of
arguments to theSamplingParams
class. These arguments enable returning multiple generations for a single request. - Added
ignore_eos
,detokenize
,skip_special_tokens
,spaces_between_special_tokens
, andtruncate_prompt_tokens
arguments to theSamplingParams
class. These arguments enable more control over the tokenizer behavior. - Added support for incremental detokenization to improve the detokenization performance for streaming generation.
- Added the
enable_prompt_adapter
argument to theLLM
class and theprompt_adapter_request
argument for theLLM.generate
method. These arguments enable prompt tuning.
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
- Added support for a
gpt_variant
argument to theexamples/gpt/convert_checkpoint.py
file. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in #2352.
API Changes
- [BREAKING CHANGE] Moved the flag
builder_force_num_profiles
intrtllm-build
command to theBUILDER_FORCE_NUM_PROFILES
environment variable. - [BREAKING CHANGE] Modified defaults for
BuildConfig
class so that they are aligned with thetrtllm-build
command. - [BREAKING CHANGE] Removed Python bindings of
GptManager
. - [BREAKING CHANGE]
auto
is used as the default value for--dtype
option in quantize and checkpoints conversion scripts. - [BREAKING CHANGE] Deprecated
gptManager
API path ingptManagerBenchmark
. - [BREAKING CHANGE] Deprecated the
beam_width
andnum_return_sequences
arguments to theSamplingParams
class in the LLM API. Use then
,best_of
anduse_beam_search
arguments instead. - Exposed
--trust_remote_code
argument to the OpenAI API server. (#2357)
Model Updates
- Added support for Llama 3.2 and llama 3.2-Vision model. Refer to
examples/mllama/README.md
for more details on the llama 3.2-Vision model. - Added support for Deepseek-v2. Refer to
examples/deepseek_v2/README.md
. - Added support for Cohere Command R models. Refer to
examples/commandr/README.md
. - Added support for Falcon 2, refer to
examples/falcon/README.md
, thanks to the contribution from @puneeshkhanna in #1926. - Added support for InternVL2. Refer to
examples/multimodal/README.md
. - Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (#2388)
- Added support for Minitron. Refer to
examples/nemotron
. - Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in
examples/gpt/README.md
. - Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in
examples/multimodal/README.md
.
Fixed Issues
- Fixed a slice error in forward function. (#1480)
- Fixed an issue that appears when building BERT. (#2373)
- Fixed an issue that model is not loaded when building BERT. (2379)
- Fixed the broken executor examples. (#2294)
- Fixed the issue that the kernel
moeTopK()
cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug. - Fixed an assertion failure on
crossKvCacheFraction
. (#2419) - Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
- Fixed a PDL typo in
docs/source/performance/perf-benchmarking.md
, thanks @MARD1NO for pointing it out in #2425.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.10-py3
. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.10-py3
. - The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.
- The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.
Documentation
- Added a copy button for code snippets in the documentation. (#2288)
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.14.0 Release
Hi,
We are very pleased to announce the 0.14.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Enhanced the
LLM
class in the LLM API.- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for
finish_reason
andstop_reason
.
- Added FP8 support for CodeLlama.
- Added
__repr__
methods for classModule
, thanks to the contribution from @1ytic in #2191. - Added BFloat16 support for fused gated MLP.
- Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
- Improved
customAllReduce
performance. - Draft model now can copy logits directly over MPI to the target model's process in
orchestrator
mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference. - NVIDIA Volta GPU support is deprecated and will be removed in a future release.
API Changes
- [BREAKING CHANGE] The default
max_batch_size
of thetrtllm-build
command is set to2048
. - [BREAKING CHANGE] Remove
builder_opt
from theBuildConfig
class and thetrtllm-build
command. - Add logits post-processor support to the
ModelRunnerCpp
class. - Added
isParticipant
method to the C++Executor
API to check if the current process is a participant in the executor instance.
Model Updates
- Added support for NemotronNas, see
examples/nemotron_nas/README.md
. - Added support for Deepseek-v1, see
examples/deepseek_v1/README.md
. - Added support for Phi-3.5 models, see
examples/phi/README.md
.
Fixed Issues
- Fixed a typo in
tensorrt_llm/models/model_weights_loader.py
, thanks to the contribution from @wangkuiyi in #2152. - Fixed duplicated import module in
tensorrt_llm/runtime/generation.py
, thanks to the contribution from @lkm2835 in #2182. - Enabled
share_embedding
for the models that have nolm_head
in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232. - Fixed
kv_cache_type
issue in the Python benchmark, thanks to the contribution from @qingquansong in #2219. - Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in #2243.
- Fixed an issue surrounding
trtllm-build --fast-build
with fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135. - Fixed missing
use_fused_mlp
when constructingBuildConfig
from dict, thanks for the fix from @ethnzhng in #2081. - Fixed lookahead batch layout for
numNewTokensCumSum
. (#2263)
Infrastructure Changes
- The dependent ModelOpt version is updated to v0.17.
Documentation
- @Sherlock113 added a tech blog to the latest news in #2169, thanks for the contribution.
Known Issues
- Replit Code is not supported with the transformers 4.45+
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.13.0 Release
Hi,
We are very pleased to announce the 0.13.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported lookahead decoding (experimental), see
docs/source/speculative_decoding.md
. - Added some enhancements to the
ModelWeightsLoader
(a unified checkpoint converter, seedocs/source/architecture/model-weights-loader.md
).- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on
*.bin
and*.pth
.
- Supported OpenAI Whisper in C++ runtime.
- Added some enhancements to the
LLM
class.- Supported LoRA.
- Supported engine building using dummy weights.
- Supported
trust_remote_code
for customized models and tokenizers downloaded from Hugging Face Hub.
- Supported beam search for streaming mode.
- Supported tensor parallelism for Mamba2.
- Supported returning generation logits for streaming mode.
- Added
curand
andbfloat16
support forReDrafter
. - Added sparse mixer normalization mode for MoE models.
- Added support for QKV scaling in FP8 FMHA.
- Supported FP8 for MoE LoRA.
- Supported KV cache reuse for P-Tuning and LoRA.
- Supported in-flight batching for CogVLM models.
- Supported LoRA for the
ModelRunnerCpp
class. - Supported
head_size=48
cases for FMHA kernels. - Added FP8 examples for DiT models, see
examples/dit/README.md
. - Supported decoder with encoder input features for the C++
executor
API.
API Changes
- [BREAKING CHANGE] Set
use_fused_mlp
toTrue
by default. - [BREAKING CHANGE] Enabled
multi_block_mode
by default. - [BREAKING CHANGE] Enabled
strongly_typed
by default inbuilder
API. - [BREAKING CHANGE] Renamed
maxNewTokens
,randomSeed
andminLength
tomaxTokens
,seed
andminTokens
following OpenAI style. - The
LLM
class- [BREAKING CHANGE] Updated
LLM.generate
arguments to includePromptInputs
andtqdm
.
- [BREAKING CHANGE] Updated
- The C++
executor
API- [BREAKING CHANGE] Added
LogitsPostProcessorConfig
. - Added
FinishReason
toResult
.
- [BREAKING CHANGE] Added
Model Updates
- Supported Gemma 2, see "Run Gemma 2" section in
examples/gemma/README.md
.
Fixed Issues
- Fixed an accuracy issue when enabling remove padding issue for cross attention. (#1999)
- Fixed the failure in converting qwen2-0.5b-instruct when using
smoothquant
. (#2087) - Matched the
exclude_modules
pattern inconvert_utils.py
to the changes inquantize.py
. (#2113) - Fixed build engine error when
FORCE_NCCL_ALL_REDUCE_STRATEGY
is set. - Fixed unexpected truncation in the quant mode of
gpt_attention
. - Fixed the hang caused by race condition when canceling requests.
- Fixed the default factory for
LoraConfig
. (#1323)
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.07-py3
. - Base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.07-py3
. - The dependent TensorRT version is updated to 10.4.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team