Releases · NVIDIA/TensorRT-LLM

23 Apr 15:42

kaiyux

v0.20.0rc0

b16a127

v0.20.0rc0 Pre-release

Pre-release

Highlights

Model Support
- Added Nemotron-H model support (#3430)
- Added Dynasor-CoT in scaffolding examples (#3501)
Features
- Added stream generation task scaffolding examples (#3527)
- Added unfused RoPE support in MLA (#3610)
- Multimodal models
  - Added support in trtllm-serve (#3590)
  - Added support in trtllm-bench, the support is limited to image only for now (#3490)
- [Experimental] The TensorRT-LLM Triton backend has supported the LLM API (triton-inference-server/tensorrtllm_backend#742)
Performance
- Optimized Large Embedding Tables in Multimodal Models (#3380)
Infra
- Dependent datasets version was upgraded to 3.1.0 (#3490)

What's Changed

chore: Unify Python NVTX call by @kaiyux in #3450
doc: genai-perf benchmark & slurm multi-node for trtllm-serve doc by @LinPoly in #3407
fix: disable KV cache reuse if using attention sink by @Funatiq in #3021
doc: Minor fixes for documents by @kaiyux in #3577
fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 by @VALLIS-NERIA in #3585
chore: Mass integration of release/0.18 by @dcampora in #3421
fix: LLM API _hf_model_dir for non-cached case by @syuoni in #3562
ci: waive test_llm_multi_node_pytorch by @Superjomn in #3592
fix: amend trtllm-bench command in the test by @Superjomn in #3563
feat: Add stream generation task scaffolding examples by @narutolhy in #3527
chore: bump version to 0.20.0rc0 by @ZhanruiSunCh in #3561
chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs by @jinyangyuan-nvidia in #3572
chore: waive test_llm_phi_quantization_1gpu by @QiJune in #3603
feat: Support cos_sin_cache in all cases. by @yuxianq in #3517
fix: add SM90 guard for FP8 Blockscale GEMM by @lucifer1004 in #3575
infra: Update user list by @niukuo in #3614
feat: Adding FP8 BMM from Codegen by @evezhier in #3541
waive test_llm_multi_node_with_postproc by @QiJune in #3628
fix: Use hmac authentication for pickle encryption by @yibinl-nvidia in #3384
Clean up linear.py, mlp.py, gated_mlp.py by @hlu1 in #3553
feat: Support CUDA graphs for EAGLE3 by @mikeiovine in #3176
feat: Nemotron-H model support by @vegaluisjose in #3430
waive test_fp8_scaled_mm by @QiJune in #3637
disable ib for ucx test by @chuangz0 in #3613
tests: change qa perf test to trtllm-bench by @ruodil in #3189
test: add quickstart test for nemotron-ultra by @crazydemo in #3596
feat: Add support for smaller hidden_dim in AR fusion kernel by @yilin-void in #3609
Fix rotary_emb param in NemotronH attention by @vegaluisjose in #3646
chore: Use ellipsis as default value to detect whether residual argument is provided by @yuxianq in #3626
feat/loraOp by @danielafrimi in #3455
Cherry-pick: update fp8 doc (#3647) by @litaotju in #3650
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #3627
Waive L0 tests by @yiqingy0 in #3651
feat: allocate minimal blocks per window size by @netanel-haber in #3028
test: remove benchmark test list on main branch by @crazydemo in #3644
feat: Support unfused rope in MLA. by @yuxianq in #3610
fix: Fix fused_moe cache fallback issue. by @hyukn in #3652
fix: Correct reporting of text dtype for Llama 4. by @FrankD412 in #3494
chore: waive test_llm_multi_node by @QiJune in #3664
chore: update multi gpu trigger file list by @QiJune in #3665
feat: adding multimodal (only image for now) support in trtllm-bench by @rakib-hasan in #3490
fix sage attention headsize check error in bertAttentionPlugin.cpp by @Jackch-NV in #3660
fix: llama4: address couple of issues in llama4 attention module by @chang-l in #3491
chore: Refactor test_disaggregated.py by @Tabrizian in #3154
test: Add llama 4 to ci by @dongfengy in #3520
chore : Split more tests out of gpt tests by @peaceh-nv in #3524
infra: Add step to generate new duration file by @EmmaQiaoCh in #3298
refactor: Clean up CMakeLists.txt by @tongyuantongyu in #3479
test: Unwaive test for nvbug_5150466 by @hchings in #3552
feat: Add Dynasor-CoT in scaffolding examples by @Fsanic in #3501
feat: Integrate GPUDirect Storage (GDS) into Executor API by @DomBrown in #3582
Remove dummy forward path by @HuiGao-NV in #3669
fix: hmac in remote mpi session by @Superjomn in #3649
test: add kv cache event tests for disagg workers by @zhengd-nv in #3602
chore: enable test_ptp_quickstart_advanced_mixed_precision back by @QiJune in #3667
feat: Disaggregated router class by @pcastonguay in #3584
Updating the run.py to make the draft target model run with the LLaMa 3 1B/8B by @mayani-nv in #3615
feat: trtllm-serve multimodal support by @yechank-nvidia in #3590
chore: Waive disaggregated load balance by @Tabrizian in #3687
Clean up modeling_deepseek.py by @hlu1 in #3640
fix: Fix disaggregated load balance test by @Tabrizian in #3689
feat: Introduce feature properties for attention backend. by @yuxianq in #3659
test:update waives.txt for nvbug 5219532 by @nv-guomingz in #3672
test: Get Eagle tests working by @brb-nv in #3593
move the reset models into examples/models/core directory by @QiJune in #3555
fix: Refactor Deepseek tp_size calculation by @hlu1 in #3695
Update Nemotron Super and Ultra in Supported Models and add an example by @Naveassaf in #3632
infra: Add test list name check by @EmmaQiaoCh in #3097
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend by @hlu1 in #3387
Waive L0 tests by @yiqingy0 in #3709
fix: update test_user_buffers_mm_add_prologue atol (#3711) by @liji-nv in #3713
fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. by @yuxianq in #3679
Report number of context tokens in one iteration by @HuiGao-NV in #3691
fix: Remove ParallelConfig. by @yuxianq in #3678
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode by @katec846 in #3380
fix: fix cublas_scaled_mm by @dc3671 in #3600
chore: update FMHA cubin files by @jinyangyuan-nvidia in #3680
test: add llama3.2 ptp test case by @StanleySun639 in #3363
bug: Fix hang bug when context server doesn't have enough capacity for KV Cache by @Tabrizian in #3095
refact: use pybind block key and hasher in disagg worker test by @zhengd-nv in #3712
Fix: nvbugs/5232457 ModelOpt Mixtral AWQ OOM by @Barry-Delaney in #3714
ci: unwaive multi_node test by @Superjomn in https://github.com/...

Contributors

Superjomn, dcampora, and 58 other contributors

Assets 2

0 Join discussion

18 Apr 23:19

kaiyux

v0.19.0rc0

258ae9c

v0.19.0rc0 Pre-release

Pre-release

Model Support
- Added Llama 4 support. (#3302)
- Added support for Phi‑4‑MM (#3296)
- Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md. (#3247)
- Added Qwen2.5‑VL support for PyTorch workflow and refactored Qwen2‑VL (#3156)
Features
- Added FP8 support for SM120 architecture (#3248)
- Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343)
- Made the scaffolding Controller more generic (#3416)
- Breaking change: Added individual gatherContext support for each additional output (#3374)
- Added trtllm‑gen FP4 GEMM for the PyTorch workflow (#3423)
- Added Qwen2 MoE support for PyTorch flow (#3369)
- Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager (#3092)
- Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging (#3417)
- Applied the PyTorch workflow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators (#3151)
- Introduced a UserBuffers allocator for PyTorch flow (#3257)
- Supported aborting disconnected requests (#3214)
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (#3190)
- Added an option to run disaggregated serving without context servers (#3243)
- Enhanced RoPE support in AutoDeploy (#3115)
- Fixed and improved allreduce and fusion kernels (#3064)
- Added DeepSeek-V3 support in AutoDeploy (#3281)
- Enhanced the integrated robustness of scaffolding via init.py (#3312)
API
- Added numNodes to ParallelConfig (#3346)
- Redesigned the multi‑stream API for DeepSeek (#3459)
Bug fixes
- Fixed a wrong import of KvCacheConfig in examples/gpqa_llmapi.py (#3369)
- Fixed the test name (#3534)
- Fixed max_seq_len in executor_config (#3487)
- Removed a duplicated line of code (#3523)
- Disabled kv cache reuse for the prompt tuning test (#3474)
- Fixed the issue of a first‑generation token being returned twice in streaming (#3427)
- Added kv memory size per token calculation in the draft model (#3497)
- Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (#3462)
- Fixed PP for Llama (#3449)
- Updated the default excluded_modules value for the fp8rowwise recipe (#3477)
- Fixed disaggregation MTP with overlap (#3406)
- Stopped memory estimation in start_attention (#3485)
- Allowed the context_and_generation request type in disaggregated overlap (#3489)
- Fixed the partial match issue (#3413)
- Fixed Eagle decoding (#3456)
- Fixed the py_decoding_iter update in the decoder (#3297)
- Fixed the beam search diversity issue (#3375)
- Updated ucxx to avoid occasional segfaults when profiling (#3420)
- Fixed redrafter sampling (#3278)
- Fixed mllama end‑to‑end PyTorch flow (#3397)
- Reverted an extra CMake variable (#3351)
- Fixed issues with the fused MoE path (#3435)
- Fixed conflicting test names (#3316)
- Fixed failing DeepSeek-V3 unit tests (#3385)
- Fixed missing bias addition for FP4Linear (#3361)
- Fixed the runtime error in test_deepseek_allreduce.py (#3226)
- Fixed speculative decoding and multimodal input support (#3276)
- Fixed PyTorch nvsmall via PyExecutor and improved TP support (#3238)
- Fixed the p‑tuning test bug (#3326)
Performance
- Cached sin and cos in the model instead of using a global LRU cache (#3378)
- Deallocated tensors after use in MLA (#3286)
- Enabled DeepGEMM by default (#3341)
- Added a thread leak check and fixed thread/memory leak issues (#3270)
- Used cudaMalloc to allocate kvCache (#3303)
- Made ipc_periodically the default responses_handler (breaking change) (#3102)
- Used NVRTC for DeepGEMM JIT compilation (#3239)
- Optimized quantization kernels used in DeepSeek on Hopper (#3466)
Documentation
- Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (#3519)
- Documented disaggregation performance tuning (#3516)
- Updated the perf‑benchmarking documentation for GPU configuration (#3458)
- Updated the README and added a benchmarking blog for DeepSeek‑R1 (#3232)
- Updated the documentation for using Draft‑Target‑Model (DTM) (#3366)
- Updated the README for disaggregated serving (#3323)
- Updated instructions to enable FP8 MLA for Deepseek. (#3488)

Full change log: 5aeef6d...258ae9c.

Assets 2

16 Apr 06:47

kaiyux

v0.18.2

5aec7af

TensorRT-LLM Release 0.18.2 Latest

Latest

Key Features and Enhancements

This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.

Assets 2

0 Join discussion

09 Apr 01:11

kaiyux

v0.18.1

62f3c95

TensorRT-LLM Release 0.18.1

Key Features and Enhancements

The 0.18.x series of releases builds upon the 0.17.0 release, focusing exclusively on dependency updates without incorporating features from the previous 0.18.0.dev pre-releases. These features will be included in future stable releases.

Infrastructure Changes

The dependent transformers package version is updated to 4.48.3.

Assets 2

3 Join discussion

02 Apr 09:08

kaiyux

v0.18.0

3c04620

TensorRT-LLM Release 0.18.0

Hi,

We are very pleased to announce the 0.18.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Features that were previously available in the 0.18.0.dev pre-releases are not included in this release.
[BREAKING CHANGE] Windows platform support is deprecated as of v0.18.0. All Windows-related code and functionality will be completely removed in future releases.

Known Issues

The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.03-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.03-py3.
The dependent TensorRT version is updated to 10.9.
The dependent CUDA version is updated to 12.8.1.
The dependent NVIDIA ModelOpt version is updated to 0.25 for Linux platform.

Assets 2

2 Join discussion

07 Feb 05:59

latency1024

v0.17.0

258c754

TensorRT-LLM Release 0.17.0

Hi,

We are very pleased to announce the 0.17.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Blackwell support
- NOTE: pip installation is not supported for TRT-LLM 0.17 on Blackwell platforms only. Instead, it is recommended that the user build from source using NVIDIA NGC 25.01 PyTorch container.
- Added support for B200.
- Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
- Added NVFP4 Gemm support for Llama and Mixtral models.
- Added NVFP4 support for the LLM API and trtllm-bench command.
- GB200 NVL is not fully supported.
- Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
PyTorch workflow
- The PyTorch workflow is an experimental feature in tensorrt_llm._torch. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow.
- Added support for H100/H200/B200.
- Added support for Llama models, Mixtral, QWen, Vila.
- Added support for FP16/BF16/FP8/NVFP4 Gemm and fused Mixture-Of-Experts (MOE), FP16/BF16/FP8 KVCache.
- Added custom context and decoding attention kernels support via PyTorch custom op.
- Added support for chunked context (default off).
- Added CudaGraph support for decoding only.
- Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
Added FP8 context FMHA support for the W4A8 quantization workflow.
Added ModelOpt quantized checkpoint support for the LLM API.
Added FP8 support for the Llama-3.2 VLM model. Refer to the “MLLaMA” section in examples/multimodal/README.md.
Added PDL support for userbuffer based AllReduce-Norm fusion kernel.
Added runtime support for seamless lookahead decoding.
Added token-aligned arbitrary output tensors support for the C++ executor API.

API Changes

[BREAKING CHANGE] KV cache reuse is enabled automatically when paged_context_fmha is enabled.
Added --concurrency support for the throughput subcommand of trtllm-bench.

Fixed Issues

Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in #2484.
Added NVIDIA H200 GPU into the cluster_key for auto parallelism feature. (#2552)
Fixed a typo in the __post_init__ function of LLmArgs Class. Thanks for the contribution from @topenkoff in #2691.
Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
Fixed Deepseek-V2 model accuracy.

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.01-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.01-py3.
The dependent TensorRT version is updated to 10.8.0.
The dependent CUDA version is updated to 12.8.0.
The dependent ModelOpt version is updated to 0.23 for Linux platform, while 0.17 is still used on Windows platform.

Known Issues

Need --extra-index-url https://pypi.nvidia.com when running pip install tensorrt-llm due to new third-party dependencies.
The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related GitHub issue.

Contributors

akhoroshev, topenkoff, and AIDC-AI

Assets 2

24 Dec 08:07

kaiyux

v0.16.0

42a7b09

TensorRT-LLM Release 0.16.0

Hi,

We are very pleased to announce the 0.16.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Added guided decoding support with XGrammar backend.
Added quantization support for RecurrentGemma. Refer to examples/recurrentgemma/README.md.
Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at examples/llama/README.md.
Added W4A8 quantization support to BF16 models on Ada (SM89).
Added PDL support for the FP8 GEMM plugins.
Added a runtime max_num_tokens dynamic tuning feature, which can be enabled by setting --enable_max_num_tokens_tuning to gptManagerBenchmark.
Added typical acceptance support for EAGLE.
Supported chunked context and sliding window attention to be enabled together.
Added head size 64 support for the XQA kernel.
Added the following features to the LLM API:
- Lookahead decoding.
- DeepSeek V1 support.
- Medusa support.
- max_num_tokens and max_batch_size arguments to control the runtime parameters.
- extended_runtime_perf_knob_config to enable various performance configurations.
Added LogN scaling support for Qwen models.
Added AutoAWQ checkpoints support for Qwen. Refer to the “INT4-AWQ” section in examples/qwen/README.md.
Added AutoAWQ and AutoGPTQ Hugging Face checkpoints support for LLaMA. (#2458)
Added allottedTimeMs to the C++ Request class to support per-request timeout.
[BREAKING CHANGE] Removed NVIDIA V100 GPU support.

API Changes

[BREAKING CHANGE] Removed enable_xqa argument from trtllm-build.
[BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
[BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag --use_embedding_sharing from convert checkpoints scripts.
[BREAKING CHANGE] The if __name__ == "__main__" entry point is required for both single-GPU and multi-GPU cases when using the LLM API.
[BREAKING CHANGE] Cancelled requests now return empty results.
Added the enable_chunked_prefill flag to the LlmArgs of the LLM API.
Integrated BERT and RoBERTa models to the trtllm-build command.

Model Updates

Added Qwen2-VL support. Refer to the “Qwen2-VL” section of examples/multimodal/README.md.
Added multimodal evaluation examples. Refer to examples/multimodal.
Added Stable Diffusion XL support. Refer to examples/sdxl/README.md. Thanks for the contribution from @Zars19 in #1514.

Fixed Issues

Fixed unnecessary batch logits post processor calls. (#2439)
Fixed a typo in the error message. (#2473)
Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in #2485.
Fixed sampling_params to only be setup if end_id is None and tokenizer is not None in the LLM API. Thanks to the contribution from @mfuntowicz in #2573.

Infrastructure Changes

Updated the base Docker image for TensorRT-LLM to nvcr.io/nvidia/pytorch:24.11-py3.
Updated the base Docker image for TensorRT-LLM Backend to nvcr.io/nvidia/tritonserver:24.11-py3.
Updated to TensorRT v10.7.
Updated to CUDA v12.6.3.
Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.

Known Issues

There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by export NCCL_P2P_LEVEL=SYS.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

Contributors

mfuntowicz, Zars19, and StarrickLiu

Assets 2

0 Join discussion

04 Dec 06:46

Shixiaowei02

v0.15.0

8f91cff

TensorRT-LLM 0.15.0 Release

Hi,

We are very pleased to announce the 0.15.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Added support for EAGLE. Refer to examples/eagle/README.md.
Added functional support for GH200 systems.
Added AutoQ (mixed precision) support.
Added a trtllm-serve command to start a FastAPI based server.
Added FP8 support for Nemotron NAS 51B. Refer to examples/nemotron_nas/README.md.
Added INT8 support for GPTQ quantization.
Added TensorRT native support for INT8 Smooth Quantization.
Added quantization support for Exaone model. Refer to examples/exaone/README.md.
Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in examples/medusa/README.md.
Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
Added support for Qwen2ForSequenceClassification model architecture.
Added Python plugin support to simplify plugin development efforts. Refer to examples/python_plugin/README.md.
Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from @AlessioNetti in #2366.
Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in docs/source/performance/perf-best-practices.md for information about the required conditions for embedding sharing.
Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
Extended the maximum supported beam_width to 256.
Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to examples/multimodal/README.md.
Added support for prompt-lookup speculative decoding. Refer to examples/prompt_lookup/README.md.
Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in examples/llama/README.md.
Added a C++ example for fast logits using the executor API. Refer to “executorExampleFastLogits” section in examples/cpp/executor/README.md.
[BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
Added the following enhancements to the LLM API:
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of LLM.generate to LLM.__init__ for better generation performance without warmup.
- Added n and best_of arguments to the SamplingParams class. These arguments enable returning multiple generations for a single request.
- Added ignore_eos, detokenize, skip_special_tokens, spaces_between_special_tokens, and truncate_prompt_tokens arguments to the SamplingParams class. These arguments enable more control over the tokenizer behavior.
- Added support for incremental detokenization to improve the detokenization performance for streaming generation.
- Added the enable_prompt_adapter argument to the LLM class and the prompt_adapter_request argument for the LLM.generate method. These arguments enable prompt tuning.
Added support for a gpt_variant argument to the examples/gpt/convert_checkpoint.py file. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in #2352.

API Changes

[BREAKING CHANGE] Moved the flag builder_force_num_profiles in trtllm-build command to the BUILDER_FORCE_NUM_PROFILES environment variable.
[BREAKING CHANGE] Modified defaults for BuildConfig class so that they are aligned with the trtllm-build command.
[BREAKING CHANGE] Removed Python bindings of GptManager.
[BREAKING CHANGE] auto is used as the default value for --dtype option in quantize and checkpoints conversion scripts.
[BREAKING CHANGE] Deprecated gptManager API path in gptManagerBenchmark.
[BREAKING CHANGE] Deprecated the beam_width and num_return_sequences arguments to the SamplingParams class in the LLM API. Use the n, best_of and use_beam_search arguments instead.
Exposed --trust_remote_code argument to the OpenAI API server. (#2357)

Model Updates

Added support for Llama 3.2 and llama 3.2-Vision model. Refer to examples/mllama/README.md for more details on the llama 3.2-Vision model.
Added support for Deepseek-v2. Refer to examples/deepseek_v2/README.md.
Added support for Cohere Command R models. Refer to examples/commandr/README.md.
Added support for Falcon 2, refer to examples/falcon/README.md, thanks to the contribution from @puneeshkhanna in #1926.
Added support for InternVL2. Refer to examples/multimodal/README.md.
Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (#2388)
Added support for Minitron. Refer to examples/nemotron.
Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in examples/gpt/README.md.
Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in examples/multimodal/README.md.

Fixed Issues

Fixed a slice error in forward function. (#1480)
Fixed an issue that appears when building BERT. (#2373)
Fixed an issue that model is not loaded when building BERT. (2379)
Fixed the broken executor examples. (#2294)
Fixed the issue that the kernel moeTopK() cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.
Fixed an assertion failure on crossKvCacheFraction. (#2419)
Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
Fixed a PDL typo in docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in #2425.

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.10-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.10-py3.
The dependent TensorRT version is updated to 10.6.
The dependent CUDA version is updated to 12.6.2.
The dependent PyTorch version is updated to 2.5.1.
The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.

Documentation

Added a copy button for code snippets in the documentation. (#2288)

Thanks,
The TensorRT-LLM Engineering Team

Contributors

AlessioNetti, puneeshkhanna, and 3 other contributors

Assets 2

0 Join discussion

01 Nov 12:01

kaiyux

v0.14.0

b088016

TensorRT-LLM 0.14.0 Release

Hi,

We are very pleased to announce the 0.14.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Enhanced the LLM class in the LLM API.
- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for finish_reason and stop_reason.
Added FP8 support for CodeLlama.
Added __repr__ methods for class Module, thanks to the contribution from @1ytic in #2191.
Added BFloat16 support for fused gated MLP.
Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
Improved customAllReduce performance.
Draft model now can copy logits directly over MPI to the target model's process in orchestrator mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.
NVIDIA Volta GPU support is deprecated and will be removed in a future release.

API Changes

[BREAKING CHANGE] The default max_batch_size of the trtllm-build command is set to 2048.
[BREAKING CHANGE] Remove builder_opt from the BuildConfig class and the trtllm-build command.
Add logits post-processor support to the ModelRunnerCpp class.
Added isParticipant method to the C++ Executor API to check if the current process is a participant in the executor instance.

Model Updates

Added support for NemotronNas, see examples/nemotron_nas/README.md.
Added support for Deepseek-v1, see examples/deepseek_v1/README.md.
Added support for Phi-3.5 models, see examples/phi/README.md.

Fixed Issues

Fixed a typo in tensorrt_llm/models/model_weights_loader.py, thanks to the contribution from @wangkuiyi in #2152.
Fixed duplicated import module in tensorrt_llm/runtime/generation.py, thanks to the contribution from @lkm2835 in #2182.
Enabled share_embedding for the models that have no lm_head in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232.
Fixed kv_cache_type issue in the Python benchmark, thanks to the contribution from @qingquansong in #2219.
Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in #2243.
Fixed an issue surrounding trtllm-build --fast-build with fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135.
Fixed missing use_fused_mlp when constructing BuildConfig from dict, thanks for the fix from @ethnzhng in #2081.
Fixed lookahead batch layout for numNewTokensCumSum. (#2263)

Infrastructure Changes

The dependent ModelOpt version is updated to v0.17.

Documentation

@Sherlock113 added a tech blog to the latest news in #2169, thanks for the contribution.

Known Issues

Replit Code is not supported with the transformers 4.45+

Thanks,
The TensorRT-LLM Engineering Team

Contributors

wangkuiyi, ZJLi2013, and 6 other contributors

Assets 2

0 Join discussion

30 Sep 08:37

Shixiaowei02

v0.13.0

201135e

TensorRT-LLM 0.13.0 Release

Hi,

We are very pleased to announce the 0.13.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Supported lookahead decoding (experimental), see docs/source/speculative_decoding.md.
Added some enhancements to the ModelWeightsLoader (a unified checkpoint converter, see docs/source/architecture/model-weights-loader.md).
- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on *.bin and *.pth.
Supported OpenAI Whisper in C++ runtime.
Added some enhancements to the LLM class.
- Supported LoRA.
- Supported engine building using dummy weights.
- Supported trust_remote_code for customized models and tokenizers downloaded from Hugging Face Hub.
Supported beam search for streaming mode.
Supported tensor parallelism for Mamba2.
Supported returning generation logits for streaming mode.
Added curand and bfloat16 support for ReDrafter.
Added sparse mixer normalization mode for MoE models.
Added support for QKV scaling in FP8 FMHA.
Supported FP8 for MoE LoRA.
Supported KV cache reuse for P-Tuning and LoRA.
Supported in-flight batching for CogVLM models.
Supported LoRA for the ModelRunnerCpp class.
Supported head_size=48 cases for FMHA kernels.
Added FP8 examples for DiT models, see examples/dit/README.md.
Supported decoder with encoder input features for the C++ executor API.

API Changes

[BREAKING CHANGE] Set use_fused_mlp to True by default.
[BREAKING CHANGE] Enabled multi_block_mode by default.
[BREAKING CHANGE] Enabled strongly_typed by default in builder API.
[BREAKING CHANGE] Renamed maxNewTokens, randomSeed and minLength to maxTokens, seed and minTokens following OpenAI style.
The LLM class
- [BREAKING CHANGE] Updated LLM.generate arguments to include PromptInputs and tqdm.
The C++ executor API
- [BREAKING CHANGE] Added LogitsPostProcessorConfig.
- Added FinishReason to Result.

Model Updates

Supported Gemma 2, see "Run Gemma 2" section in examples/gemma/README.md.

Fixed Issues

Fixed an accuracy issue when enabling remove padding issue for cross attention. (#1999)
Fixed the failure in converting qwen2-0.5b-instruct when using smoothquant. (#2087)
Matched the exclude_modules pattern in convert_utils.py to the changes in quantize.py. (#2113)
Fixed build engine error when FORCE_NCCL_ALL_REDUCE_STRATEGY is set.
Fixed unexpected truncation in the quant mode of gpt_attention.
Fixed the hang caused by race condition when canceling requests.
Fixed the default factory for LoraConfig. (#1323)

Infrastructure Changes

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
The dependent TensorRT version is updated to 10.4.0.
The dependent CUDA version is updated to 12.5.1.
The dependent PyTorch version is updated to 2.4.0.
The dependent ModelOpt version is updated to v0.15.

Thanks,
The TensorRT-LLM Engineering Team

Assets 2

0 Join discussion

Releases: NVIDIA/TensorRT-LLM

v0.20.0rc0

Highlights

What's Changed

Contributors

v0.19.0rc0

TensorRT-LLM Release 0.18.2

Key Features and Enhancements

TensorRT-LLM Release 0.18.1

Key Features and Enhancements

Infrastructure Changes

TensorRT-LLM Release 0.18.0

Key Features and Enhancements

Known Issues

Infrastructure Changes

TensorRT-LLM Release 0.17.0

Key Features and Enhancements

API Changes

Fixed Issues

Infrastructure Changes

Known Issues

Contributors

TensorRT-LLM Release 0.16.0

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

Known Issues

Contributors

TensorRT-LLM 0.15.0 Release

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

Documentation

Contributors

TensorRT-LLM 0.14.0 Release

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

Documentation

Known Issues

Contributors

TensorRT-LLM 0.13.0 Release

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes