feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend #3387

hlu1 · 2025-04-08T23:56:35Z

TRT-LLM gen cut-off commits:

routing kernel : 8339307c73fd8e30eea9ee0d540b2b9af2425708
fp8 MOE cubins: 121b8bcbbade35125763e20a4845c79d8a2c7175
fp4 MOE cubins: a9f6863d2257f06b6955a29cde4ee4e139e0e129 (most likely)

Added features:

A new FusedMOE backend, "TRTLLM" for fp8 Deepseek recipe and nvfp4 which invoke the trtllm-gen MOE kernels.
E2E enablement with fp8 Deepseek checkpoint. The bmms in MLA are in bf16. The other gemm/MOE are in fp8.

Remaining tasks:

Routing kernel and cubins currently correspond to different trtllmgen commits. They need to be updated to top of tree.
Change type of scoreBias from float to bfloat16 in the routing kernel and verify the accuracy: auto scoreBias = float{scoreSigmoid + float{biasVal}};. This must be done before updating the routing kernel to top of tree.
Enable TRT-LLM Gen MOE for Deepseek model in CI
~~Remove w1, w3 swap in fp4FC1 when the cubins are updated next time (Done internally already)~~
Read tile config from kernel instead of hardcoding it to 128
FP8 e2e with R1 is currently broken. There are runtime errors during warming up.

nvfp4 accuracy results:
MMLU: 86.91
gpqa_diamond (3 runs): 0.6717171717171717, 0.7373737373737373, 0.6919191919191919

hlu1 · 2025-04-09T02:05:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-09T02:10:58Z

PR_Github #1523 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-09T03:03:32Z

PR_Github #1523 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1137 completed with status: 'FAILURE'

nekorobov

Thanks a lot! Only few nits. The API of the MoE cubins is a subject for change in the future in any case

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/trtllmGenSrc/DevKernel.cu

...ensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/trtllmGenSrc/MixtureOfExpertsInterface.cu

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

hlu1 · 2025-04-09T18:42:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-09T18:48:06Z

PR_Github #1639 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-09T21:06:16Z

PR_Github #1639 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1226 completed with status: 'FAILURE'

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmList.h

hlu1 · 2025-04-09T22:33:59Z

/bot run --disable-fail-fast

zongfeijing · 2025-04-19T13:42:50Z

/bot run --multi-gpu-test

tensorrt-cicd · 2025-04-19T13:48:38Z

PR_Github #2833 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-19T16:22:16Z

PR_Github #2833 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2000 completed with status: 'SUCCESS'

zongfeijing · 2025-04-19T16:59:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-19T17:05:24Z

PR_Github #2838 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-19T22:16:21Z

PR_Github #2838 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2004 completed with status: 'FAILURE'

zongfeijing · 2025-04-19T23:55:04Z

/bot run

tensorrt-cicd · 2025-04-20T00:00:37Z

PR_Github #2844 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-20T07:11:23Z

PR_Github #2844 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2009 completed with status: 'FAILURE'

fix fused moe rebase bug. Fix atol in test_fp4_gemm_quantize.py fix fused moe rebase bug. Fix FusedMoe. Disable 2nd routing kernel preexit Bump routing reduction to fp32 Disable PDL for fc1 [DEBUG] Lift token limit to 16k [Bugfix] Token limit to 16k + fp32 routing + tanh Make fp8 tileN 8 Fix FP8 MoE + Remove redundent temp output for FP4 [FP8-only] Avoid wasting CTAs for activation kernel fix: unblock FP8 weightloading with trtllm-gen Remove max_token limit for trtllm-gen path perf: avoid type-conversion and fill_ from aten Minor fix Signed-off-by: Hao Lu <[email protected]>

Signed-off-by: Hao Lu <[email protected]>

Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing · 2025-04-20T07:19:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-20T07:25:03Z

PR_Github #2857 [ run ] triggered by Bot

meowcoder22 · 2025-04-20T15:29:24Z

hi again @zongfeijing

In runner.cu, there is option called options.mUseCustomLowLatencyImpl = false;

when set to true, it require mTileN size of 32, 64, 128

however in gemmList.h, there is only kernel that support mTileN sisze of 8

@hlu1 is there any plan to include appropriate kernel here?

Otherwise what is the use of this option

tensorrt-cicd · 2025-04-20T16:51:36Z

PR_Github #2857 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2018 completed with status: 'SUCCESS'

zongfeijing · 2025-04-21T01:43:08Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-21T01:48:35Z

PR_Github #2873 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-21T01:54:55Z

PR_Github #2873 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #2857 for commit 4bd7319

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/trtllmGenSrc/RoutingKernel.cu

hlu1 force-pushed the user/haolu/gtc_min_latency branch 3 times, most recently from 0046575 to 7dc86e1 Compare April 9, 2025 01:53

hlu1 requested review from yibinl-nvidia, chang-l, dongfengy, Kefeng-Duan, schetlur-nv, nekorobov and dcampora April 9, 2025 02:02

hlu1 force-pushed the user/haolu/gtc_min_latency branch from 7dc86e1 to b5b4929 Compare April 9, 2025 02:04

hlu1 requested review from jdemouth-nvidia and QiJune April 9, 2025 02:11

hlu1 self-assigned this Apr 9, 2025

hlu1 requested a review from zongfeijing April 9, 2025 04:38

hlu1 force-pushed the user/haolu/gtc_min_latency branch from b5b4929 to 2a3e1e4 Compare April 9, 2025 06:13

QiJune requested a review from HuiGao-NV April 9, 2025 06:58

nekorobov reviewed Apr 9, 2025

View reviewed changes

hlu1 force-pushed the user/haolu/gtc_min_latency branch 2 times, most recently from 96d515b to 0e399dc Compare April 9, 2025 18:42

hlu1 force-pushed the user/haolu/gtc_min_latency branch from 0e399dc to d36e34a Compare April 9, 2025 18:52

hlu1 force-pushed the user/haolu/gtc_min_latency branch from d36e34a to cd4354b Compare April 9, 2025 22:17

tburt-nv reviewed Apr 9, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmList.h Outdated Show resolved Hide resolved

zongfeijing force-pushed the user/haolu/gtc_min_latency branch from 9d7765a to 8cf28e8 Compare April 19, 2025 13:42

zongfeijing force-pushed the user/haolu/gtc_min_latency branch from 8cf28e8 to 3d928be Compare April 19, 2025 16:59

zongfeijing approved these changes Apr 19, 2025

View reviewed changes

zongfeijing force-pushed the user/haolu/gtc_min_latency branch from 3d928be to a0edb8f Compare April 20, 2025 06:54

hlu1 and others added 4 commits April 20, 2025 15:19

Fix rebase issues

01ea6d7

Signed-off-by: Hao Lu <[email protected]>

Fix compile issue

7b64153

Signed-off-by: Zongfei Jing <[email protected]>

CI clean

e446901

Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing force-pushed the user/haolu/gtc_min_latency branch from a0edb8f to e446901 Compare April 20, 2025 07:19

Kefeng-Duan approved these changes Apr 21, 2025

View reviewed changes

Kefeng-Duan enabled auto-merge (squash) April 21, 2025 00:22

Merge branch 'main' into user/haolu/gtc_min_latency

4bd7319

hlu1 commented Apr 21, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/trtllmGenSrc/RoutingKernel.cu Show resolved Hide resolved

Kefeng-Duan merged commit 31624b0 into NVIDIA:main Apr 21, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend #3387

feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend #3387

hlu1 commented Apr 8, 2025 •

edited

Loading

hlu1 commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

nekorobov left a comment

hlu1 commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

hlu1 commented Apr 9, 2025

zongfeijing commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

zongfeijing commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

zongfeijing commented Apr 19, 2025

tensorrt-cicd commented Apr 20, 2025

tensorrt-cicd commented Apr 20, 2025

zongfeijing commented Apr 20, 2025

tensorrt-cicd commented Apr 20, 2025

meowcoder22 commented Apr 20, 2025

tensorrt-cicd commented Apr 20, 2025

zongfeijing commented Apr 21, 2025

tensorrt-cicd commented Apr 21, 2025

tensorrt-cicd commented Apr 21, 2025

feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend #3387

feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend #3387

Conversation

hlu1 commented Apr 8, 2025 • edited Loading

hlu1 commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

nekorobov left a comment

Choose a reason for hiding this comment

hlu1 commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

tensorrt-cicd commented Apr 9, 2025

hlu1 commented Apr 9, 2025

zongfeijing commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

zongfeijing commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

zongfeijing commented Apr 19, 2025

tensorrt-cicd commented Apr 20, 2025

tensorrt-cicd commented Apr 20, 2025

zongfeijing commented Apr 20, 2025

tensorrt-cicd commented Apr 20, 2025

meowcoder22 commented Apr 20, 2025

tensorrt-cicd commented Apr 20, 2025

zongfeijing commented Apr 21, 2025

tensorrt-cicd commented Apr 21, 2025

tensorrt-cicd commented Apr 21, 2025

hlu1 commented Apr 8, 2025 •

edited

Loading