Skip to content

feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode #3380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 21, 2025

Conversation

katec846
Copy link
Collaborator

@katec846 katec846 commented Apr 8, 2025

Optimization for Large Embedding Tables in Multimodal Models

Problem Statement

In multimodal models, large embedding tables generated by the multimodal encoder consume significant GPU memory. Currently, these tables are stored entirely in GPU memory during processing, which can be inefficient and limiting, especially for models processing large batches or long sequences.

Solution

This PR implements a memory optimization strategy that offloads the multimodal model embedding table (mm_embedding_table) to CPU memory and uses a chunked prefetching mechanism during processing. This approach is only available when operating in context chunk mode, allowing for efficient memory usage without compromising performance through strategic prefetching.

Key Changes

Core Implementation

  • Added new buffer management system in promptTuningBuffers.h for handling prompt tuning ping-pong buffers.
  • Implemented chunk-based prefetching and processing in trtGptModelInflightBatching.cpp.
  • Added support for mm_embedding_table offloading in Python interface layers.

File Changes

  1. C++ Changes

    • promptTuningBuffers.h: Added new buffer management system with ping-pong buffers for efficient chunk handling.
    • trtGptModelInflightBatching.cpp: Implemented asynchronous prefetching and chunk processing mechanisms.
    • Various other C++ files: Added supporting functionality for buffer management and data transfer.
  2. Python Changes

    • model_runner_cpp.py: Added multimodal model embedding table offloading support with pinned memory allocation.
    • multimodal_model_runner.py: Updated model generation process and move the embedding table to pinned memory.
    • utils.py: Added new --mm_embedding_offloading argument with automatic default configuration.

Usage

The feature can be enabled using the new --mm_embedding_offloading argument. When not specified, it defaults to True if using a multimodal model with chunked context. This optimization is particularly useful for large-scale multimodal applications where GPU memory is a constraint.

@katec846 katec846 self-assigned this Apr 8, 2025
@katec846 katec846 requested a review from symphonylyh April 8, 2025 21:01
@symphonylyh symphonylyh changed the title Feat: Offloading Position Table to CPU in Context Chunk Mode Feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode Apr 8, 2025
@katec846
Copy link
Collaborator Author

katec846 commented Apr 8, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1507 [ run ] triggered by Bot

@symphonylyh symphonylyh changed the title Feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode Apr 8, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #1507 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1124 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from b46fde8 to bdaafe4 Compare April 9, 2025 06:02
@katec846
Copy link
Collaborator Author

katec846 commented Apr 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1572 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1572 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1174 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from bdaafe4 to ec207c7 Compare April 10, 2025 20:15
@katec846
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1821 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1821 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1347 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from ec207c7 to c4d812c Compare April 11, 2025 02:09
@katec846
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1841 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1841 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1363 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from c4d812c to 27b7c17 Compare April 11, 2025 04:49
@katec846
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1863 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1863 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1381 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch 2 times, most recently from 4293933 to 7d823ea Compare April 11, 2025 23:12
@katec846
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1972 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1972 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1454 completed with status: 'SUCCESS'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2790 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1973 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from 7ac23f9 to 0fe0274 Compare April 18, 2025 23:08
@katec846
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2800 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2800 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1978 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from 0fe0274 to a1a6970 Compare April 19, 2025 03:40
@katec846
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2815 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2815 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1989 completed with status: 'FAILURE'

@katec846 katec846 force-pushed the yunhsuanc/offload_ptable_draft branch from a1a6970 to 8f017e2 Compare April 20, 2025 19:28
@katec846
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2871 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2871 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2027 completed with status: 'SUCCESS'

@symphonylyh
Copy link
Collaborator

/bot reuse-pipeline

@symphonylyh symphonylyh enabled auto-merge (squash) April 21, 2025 06:19
@tensorrt-cicd
Copy link
Collaborator

PR_Github #2905 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2905 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #2871 for commit 8348f4a

@symphonylyh symphonylyh merged commit eeb605a into NVIDIA:main Apr 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants