-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode #3380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode #3380
Conversation
/bot run |
PR_Github #1507 [ run ] triggered by Bot |
PR_Github #1507 [ run ] completed with state |
b46fde8
to
bdaafe4
Compare
/bot run |
PR_Github #1572 [ run ] triggered by Bot |
PR_Github #1572 [ run ] completed with state |
bdaafe4
to
ec207c7
Compare
/bot run |
PR_Github #1821 [ run ] triggered by Bot |
PR_Github #1821 [ run ] completed with state |
ec207c7
to
c4d812c
Compare
/bot run |
PR_Github #1841 [ run ] triggered by Bot |
PR_Github #1841 [ run ] completed with state |
c4d812c
to
27b7c17
Compare
/bot run |
PR_Github #1863 [ run ] triggered by Bot |
PR_Github #1863 [ run ] completed with state |
4293933
to
7d823ea
Compare
/bot run |
PR_Github #1972 [ run ] triggered by Bot |
PR_Github #1972 [ run ] completed with state |
PR_Github #2790 [ run ] completed with state |
7ac23f9
to
0fe0274
Compare
/bot run |
PR_Github #2800 [ run ] triggered by Bot |
PR_Github #2800 [ run ] completed with state |
0fe0274
to
a1a6970
Compare
/bot run |
PR_Github #2815 [ run ] triggered by Bot |
PR_Github #2815 [ run ] completed with state |
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
Signed-off-by: Kate Cheng <[email protected]>
a1a6970
to
8f017e2
Compare
/bot run |
PR_Github #2871 [ run ] triggered by Bot |
PR_Github #2871 [ run ] completed with state |
/bot reuse-pipeline |
PR_Github #2905 [ reuse-pipeline ] triggered by Bot |
PR_Github #2905 [ reuse-pipeline ] completed with state |
Optimization for Large Embedding Tables in Multimodal Models
Problem Statement
In multimodal models, large embedding tables generated by the multimodal encoder consume significant GPU memory. Currently, these tables are stored entirely in GPU memory during processing, which can be inefficient and limiting, especially for models processing large batches or long sequences.
Solution
This PR implements a memory optimization strategy that offloads the multimodal model embedding table (mm_embedding_table) to CPU memory and uses a chunked prefetching mechanism during processing. This approach is only available when operating in context chunk mode, allowing for efficient memory usage without compromising performance through strategic prefetching.
Key Changes
Core Implementation
promptTuningBuffers.h
for handling prompt tuning ping-pong buffers.trtGptModelInflightBatching.cpp
.File Changes
C++ Changes
promptTuningBuffers.h
: Added new buffer management system with ping-pong buffers for efficient chunk handling.trtGptModelInflightBatching.cpp
: Implemented asynchronous prefetching and chunk processing mechanisms.Python Changes
model_runner_cpp.py
: Added multimodal model embedding table offloading support with pinned memory allocation.multimodal_model_runner.py
: Updated model generation process and move the embedding table to pinned memory.utils.py
: Added new--mm_embedding_offloading
argument with automatic default configuration.Usage
The feature can be enabled using the new
--mm_embedding_offloading
argument. When not specified, it defaults to True if using a multimodal model with chunked context. This optimization is particularly useful for large-scale multimodal applications where GPU memory is a constraint.