feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode #3380

katec846 · 2025-04-08T20:59:41Z

Optimization for Large Embedding Tables in Multimodal Models

Problem Statement

In multimodal models, large embedding tables generated by the multimodal encoder consume significant GPU memory. Currently, these tables are stored entirely in GPU memory during processing, which can be inefficient and limiting, especially for models processing large batches or long sequences.

Solution

This PR implements a memory optimization strategy that offloads the multimodal model embedding table (mm_embedding_table) to CPU memory and uses a chunked prefetching mechanism during processing. This approach is only available when operating in context chunk mode, allowing for efficient memory usage without compromising performance through strategic prefetching.

Key Changes

Core Implementation

Added new buffer management system in promptTuningBuffers.h for handling prompt tuning ping-pong buffers.
Implemented chunk-based prefetching and processing in trtGptModelInflightBatching.cpp.
Added support for mm_embedding_table offloading in Python interface layers.

File Changes

C++ Changes
- promptTuningBuffers.h: Added new buffer management system with ping-pong buffers for efficient chunk handling.
- trtGptModelInflightBatching.cpp: Implemented asynchronous prefetching and chunk processing mechanisms.
- Various other C++ files: Added supporting functionality for buffer management and data transfer.
Python Changes
- model_runner_cpp.py: Added multimodal model embedding table offloading support with pinned memory allocation.
- multimodal_model_runner.py: Updated model generation process and move the embedding table to pinned memory.
- utils.py: Added new --mm_embedding_offloading argument with automatic default configuration.

Usage

The feature can be enabled using the new --mm_embedding_offloading argument. When not specified, it defaults to True if using a multimodal model with chunked context. This optimization is particularly useful for large-scale multimodal applications where GPU memory is a constraint.

katec846 · 2025-04-08T22:33:30Z

/bot run

tensorrt-cicd · 2025-04-08T22:39:12Z

PR_Github #1507 [ run ] triggered by Bot

cpp/include/tensorrt_llm/batch_manager/promptTuningBuffers.h

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h

cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h

examples/multimodal/utils.py

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp

cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp

tensorrt-cicd · 2025-04-09T05:19:14Z

PR_Github #1507 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1124 completed with status: 'FAILURE'

katec846 · 2025-04-09T06:06:20Z

/bot run

tensorrt-cicd · 2025-04-09T06:12:22Z

PR_Github #1572 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-09T08:26:18Z

PR_Github #1572 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1174 completed with status: 'FAILURE'

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp

katec846 · 2025-04-10T20:22:34Z

/bot run

tensorrt-cicd · 2025-04-10T20:27:55Z

PR_Github #1821 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-10T23:41:03Z

PR_Github #1821 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1347 completed with status: 'FAILURE'

katec846 · 2025-04-11T02:10:03Z

/bot run

tensorrt-cicd · 2025-04-11T02:15:15Z

PR_Github #1841 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-11T03:52:59Z

PR_Github #1841 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1363 completed with status: 'FAILURE'

katec846 · 2025-04-11T04:50:26Z

/bot run

tensorrt-cicd · 2025-04-11T04:56:05Z

PR_Github #1863 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-11T07:57:06Z

PR_Github #1863 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1381 completed with status: 'FAILURE'

katec846 · 2025-04-11T23:13:59Z

/bot run

tensorrt-cicd · 2025-04-11T23:19:36Z

PR_Github #1972 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-12T09:02:03Z

PR_Github #1972 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1454 completed with status: 'SUCCESS'

katec846 · 2025-04-18T23:08:23Z

/bot run

tensorrt-cicd · 2025-04-18T23:21:20Z

PR_Github #2800 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-19T01:44:16Z

PR_Github #2800 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1978 completed with status: 'FAILURE'

katec846 · 2025-04-19T03:40:59Z

/bot run

tensorrt-cicd · 2025-04-19T03:46:34Z

PR_Github #2815 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-19T05:09:49Z

PR_Github #2815 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1989 completed with status: 'FAILURE'

Signed-off-by: Kate Cheng <[email protected]>

katec846 · 2025-04-20T19:30:00Z

/bot run

tensorrt-cicd · 2025-04-20T19:35:44Z

PR_Github #2871 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-21T02:34:18Z

PR_Github #2871 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2027 completed with status: 'SUCCESS'

symphonylyh · 2025-04-21T06:19:38Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-21T06:24:54Z

PR_Github #2905 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-21T06:31:00Z

PR_Github #2905 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #2871 for commit 8348f4a

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp

katec846 self-assigned this Apr 8, 2025

katec846 requested a review from symphonylyh April 8, 2025 21:01

symphonylyh changed the title ~~Feat: Offloading Position Table to CPU in Context Chunk Mode~~ Feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode Apr 8, 2025

symphonylyh changed the title ~~Feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode~~ feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode Apr 8, 2025

symphonylyh reviewed Apr 9, 2025

View reviewed changes

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from b46fde8 to bdaafe4 Compare April 9, 2025 06:02

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from bdaafe4 to ec207c7 Compare April 10, 2025 20:15

katec846 commented Apr 10, 2025

View reviewed changes

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp Show resolved Hide resolved

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from ec207c7 to c4d812c Compare April 11, 2025 02:09

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from c4d812c to 27b7c17 Compare April 11, 2025 04:49

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch 2 times, most recently from 4293933 to 7d823ea Compare April 11, 2025 23:12

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from 7ac23f9 to 0fe0274 Compare April 18, 2025 23:08

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from 0fe0274 to a1a6970 Compare April 19, 2025 03:40

katec846 added 10 commits April 20, 2025 12:27

Feat: Offload ptable to cpu if enable_chunk_context

c266a84

Signed-off-by: Kate Cheng <[email protected]>

Feat: offload ptable to cpu for chunk context mode

6e3306f

Signed-off-by: Kate Cheng <[email protected]>

Fix and add comment

68cdadf

Signed-off-by: Kate Cheng <[email protected]>

Update Readme for multimodal and add a new param mm_embedding_offloading

cf0f6c1

Signed-off-by: Kate Cheng <[email protected]>

fix: Correct prompt table offloading condition in PromptTuningBuffers

76f44d8

Signed-off-by: Kate Cheng <[email protected]>

Clean up the code

639ee3a

Signed-off-by: Kate Cheng <[email protected]>

Add commits to explain copy from cpu <-> gpu using pinned memory

774f39d

Signed-off-by: Kate Cheng <[email protected]>

Fix namings based on comments

9b11577

Signed-off-by: Kate Cheng <[email protected]>

Fix format based on precommit

b2ef59e

Signed-off-by: Kate Cheng <[email protected]>

Modify --mm_embedding_offloading flag

8f017e2

Signed-off-by: Kate Cheng <[email protected]>

katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from a1a6970 to 8f017e2 Compare April 20, 2025 19:28

Merge branch 'main' into yunhsuanc/offload_ptable_draft

8348f4a

symphonylyh enabled auto-merge (squash) April 21, 2025 06:19

symphonylyh merged commit eeb605a into NVIDIA:main Apr 21, 2025
3 checks passed

Funatiq reviewed Apr 22, 2025

View reviewed changes

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp Show resolved Hide resolved

feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode #3380

feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode #3380

Uh oh!

Conversation

katec846 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimization for Large Embedding Tables in Multimodal Models

Problem Statement

Solution

Key Changes

Core Implementation

File Changes

Usage

Uh oh!

katec846 commented Apr 8, 2025

Uh oh!

tensorrt-cicd commented Apr 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 9, 2025

Uh oh!

katec846 commented Apr 9, 2025

Uh oh!

tensorrt-cicd commented Apr 9, 2025

Uh oh!

tensorrt-cicd commented Apr 9, 2025

Uh oh!

Uh oh!

katec846 commented Apr 10, 2025

Uh oh!

tensorrt-cicd commented Apr 10, 2025

Uh oh!

tensorrt-cicd commented Apr 10, 2025

Uh oh!

katec846 commented Apr 11, 2025

Uh oh!

tensorrt-cicd commented Apr 11, 2025

Uh oh!

tensorrt-cicd commented Apr 11, 2025

Uh oh!

katec846 commented Apr 11, 2025

Uh oh!

tensorrt-cicd commented Apr 11, 2025

Uh oh!

tensorrt-cicd commented Apr 11, 2025

Uh oh!

katec846 commented Apr 11, 2025

Uh oh!

tensorrt-cicd commented Apr 11, 2025

Uh oh!

tensorrt-cicd commented Apr 12, 2025

Uh oh!

katec846 commented Apr 18, 2025

Uh oh!

tensorrt-cicd commented Apr 18, 2025

Uh oh!

tensorrt-cicd commented Apr 19, 2025

Uh oh!

katec846 commented Apr 19, 2025

Uh oh!

tensorrt-cicd commented Apr 19, 2025

Uh oh!

tensorrt-cicd commented Apr 19, 2025

Uh oh!

katec846 commented Apr 20, 2025

Uh oh!

tensorrt-cicd commented Apr 20, 2025

Uh oh!

tensorrt-cicd commented Apr 21, 2025

Uh oh!

symphonylyh commented Apr 21, 2025

katec846 commented Apr 8, 2025 •

edited

Loading