Skip to content

feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend #3387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 21, 2025

Conversation

hlu1
Copy link
Collaborator

@hlu1 hlu1 commented Apr 8, 2025

TRT-LLM gen cut-off commits:

  • routing kernel : 8339307c73fd8e30eea9ee0d540b2b9af2425708
  • fp8 MOE cubins: 121b8bcbbade35125763e20a4845c79d8a2c7175
  • fp4 MOE cubins: a9f6863d2257f06b6955a29cde4ee4e139e0e129 (most likely)

Added features:

  • A new FusedMOE backend, "TRTLLM" for fp8 Deepseek recipe and nvfp4 which invoke the trtllm-gen MOE kernels.
  • E2E enablement with fp8 Deepseek checkpoint. The bmms in MLA are in bf16. The other gemm/MOE are in fp8.

Remaining tasks:

  • Routing kernel and cubins currently correspond to different trtllmgen commits. They need to be updated to top of tree.
  • Change type of scoreBias from float to bfloat16 in the routing kernel and verify the accuracy: auto scoreBias = float{scoreSigmoid + float{biasVal}};. This must be done before updating the routing kernel to top of tree.
  • Enable TRT-LLM Gen MOE for Deepseek model in CI
  • Remove w1, w3 swap in fp4FC1 when the cubins are updated next time (Done internally already)
  • Read tile config from kernel instead of hardcoding it to 128
  • FP8 e2e with R1 is currently broken. There are runtime errors during warming up.

nvfp4 accuracy results:
MMLU: 86.91
gpqa_diamond (3 runs): 0.6717171717171717, 0.7373737373737373, 0.6919191919191919

@hlu1 hlu1 force-pushed the user/haolu/gtc_min_latency branch 3 times, most recently from 0046575 to 7dc86e1 Compare April 9, 2025 01:53
@hlu1 hlu1 force-pushed the user/haolu/gtc_min_latency branch from 7dc86e1 to b5b4929 Compare April 9, 2025 02:04
@hlu1
Copy link
Collaborator Author

hlu1 commented Apr 9, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1523 [ run ] triggered by Bot

@hlu1 hlu1 requested review from jdemouth-nvidia and QiJune April 9, 2025 02:11
@hlu1 hlu1 self-assigned this Apr 9, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #1523 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1137 completed with status: 'FAILURE'

@hlu1 hlu1 requested a review from zongfeijing April 9, 2025 04:38
@hlu1 hlu1 force-pushed the user/haolu/gtc_min_latency branch from b5b4929 to 2a3e1e4 Compare April 9, 2025 06:13
@QiJune QiJune requested a review from HuiGao-NV April 9, 2025 06:58
Copy link
Collaborator

@nekorobov nekorobov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Only few nits. The API of the MoE cubins is a subject for change in the future in any case

@hlu1 hlu1 force-pushed the user/haolu/gtc_min_latency branch 2 times, most recently from 96d515b to 0e399dc Compare April 9, 2025 18:42
@hlu1
Copy link
Collaborator Author

hlu1 commented Apr 9, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1639 [ run ] triggered by Bot

@hlu1 hlu1 force-pushed the user/haolu/gtc_min_latency branch from 0e399dc to d36e34a Compare April 9, 2025 18:52
@tensorrt-cicd
Copy link
Collaborator

PR_Github #1639 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1226 completed with status: 'FAILURE'

@hlu1 hlu1 force-pushed the user/haolu/gtc_min_latency branch from d36e34a to cd4354b Compare April 9, 2025 22:17
@hlu1
Copy link
Collaborator Author

hlu1 commented Apr 9, 2025

/bot run --disable-fail-fast

@zongfeijing zongfeijing force-pushed the user/haolu/gtc_min_latency branch from 9d7765a to 8cf28e8 Compare April 19, 2025 13:42
@zongfeijing
Copy link
Collaborator

/bot run --multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2833 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2833 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2000 completed with status: 'SUCCESS'

@zongfeijing zongfeijing force-pushed the user/haolu/gtc_min_latency branch from 8cf28e8 to 3d928be Compare April 19, 2025 16:59
@zongfeijing
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2838 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2838 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2004 completed with status: 'FAILURE'

@zongfeijing
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2844 [ run ] triggered by Bot

@zongfeijing zongfeijing force-pushed the user/haolu/gtc_min_latency branch from 3d928be to a0edb8f Compare April 20, 2025 06:54
@tensorrt-cicd
Copy link
Collaborator

PR_Github #2844 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2009 completed with status: 'FAILURE'

hlu1 and others added 4 commits April 20, 2025 15:19
fix fused moe rebase bug.

Fix atol in test_fp4_gemm_quantize.py

fix fused moe rebase bug.

Fix FusedMoe.

Disable 2nd routing kernel preexit

Bump routing reduction to fp32

Disable PDL for fc1

[DEBUG] Lift token limit to 16k

[Bugfix] Token limit to 16k + fp32 routing + tanh

Make fp8 tileN 8

Fix FP8 MoE + Remove redundent temp output for FP4

[FP8-only] Avoid wasting CTAs for activation kernel

fix: unblock FP8 weightloading with trtllm-gen

Remove max_token limit for trtllm-gen path

perf: avoid type-conversion and fill_ from aten

Minor fix

Signed-off-by: Hao Lu <[email protected]>
Signed-off-by: Hao Lu <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
@zongfeijing zongfeijing force-pushed the user/haolu/gtc_min_latency branch from a0edb8f to e446901 Compare April 20, 2025 07:19
@zongfeijing
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2857 [ run ] triggered by Bot

@meowcoder22
Copy link

hi again @zongfeijing

In runner.cu, there is option called options.mUseCustomLowLatencyImpl = false;

when set to true, it require mTileN size of 32, 64, 128

however in gemmList.h, there is only kernel that support mTileN sisze of 8

@hlu1 is there any plan to include appropriate kernel here?

Otherwise what is the use of this option

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2857 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2018 completed with status: 'SUCCESS'

@Kefeng-Duan Kefeng-Duan enabled auto-merge (squash) April 21, 2025 00:22
@zongfeijing
Copy link
Collaborator

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2873 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2873 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #2857 for commit 4bd7319

@Kefeng-Duan Kefeng-Duan merged commit 31624b0 into NVIDIA:main Apr 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants