-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend #3387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0046575
to
7dc86e1
Compare
7dc86e1
to
b5b4929
Compare
/bot run --disable-fail-fast |
PR_Github #1523 [ run ] triggered by Bot |
PR_Github #1523 [ run ] completed with state |
b5b4929
to
2a3e1e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Only few nits. The API of the MoE cubins is a subject for change in the future in any case
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/trtllmGenSrc/DevKernel.cu
Show resolved
Hide resolved
...ensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/trtllmGenSrc/MixtureOfExpertsInterface.cu
Outdated
Show resolved
Hide resolved
96d515b
to
0e399dc
Compare
/bot run --disable-fail-fast |
PR_Github #1639 [ run ] triggered by Bot |
0e399dc
to
d36e34a
Compare
PR_Github #1639 [ run ] completed with state |
d36e34a
to
cd4354b
Compare
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmList.h
Outdated
Show resolved
Hide resolved
/bot run --disable-fail-fast |
9d7765a
to
8cf28e8
Compare
/bot run --multi-gpu-test |
PR_Github #2833 [ run ] triggered by Bot |
PR_Github #2833 [ run ] completed with state |
8cf28e8
to
3d928be
Compare
/bot run --disable-fail-fast |
PR_Github #2838 [ run ] triggered by Bot |
PR_Github #2838 [ run ] completed with state |
/bot run |
PR_Github #2844 [ run ] triggered by Bot |
3d928be
to
a0edb8f
Compare
PR_Github #2844 [ run ] completed with state |
fix fused moe rebase bug. Fix atol in test_fp4_gemm_quantize.py fix fused moe rebase bug. Fix FusedMoe. Disable 2nd routing kernel preexit Bump routing reduction to fp32 Disable PDL for fc1 [DEBUG] Lift token limit to 16k [Bugfix] Token limit to 16k + fp32 routing + tanh Make fp8 tileN 8 Fix FP8 MoE + Remove redundent temp output for FP4 [FP8-only] Avoid wasting CTAs for activation kernel fix: unblock FP8 weightloading with trtllm-gen Remove max_token limit for trtllm-gen path perf: avoid type-conversion and fill_ from aten Minor fix Signed-off-by: Hao Lu <[email protected]>
Signed-off-by: Hao Lu <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
a0edb8f
to
e446901
Compare
/bot run --disable-fail-fast |
PR_Github #2857 [ run ] triggered by Bot |
hi again @zongfeijing In runner.cu, there is option called options.mUseCustomLowLatencyImpl = false; when set to true, it require mTileN size of 32, 64, 128 however in gemmList.h, there is only kernel that support mTileN sisze of 8 @hlu1 is there any plan to include appropriate kernel here? Otherwise what is the use of this option |
PR_Github #2857 [ run ] completed with state |
/bot reuse-pipeline |
PR_Github #2873 [ reuse-pipeline ] triggered by Bot |
PR_Github #2873 [ reuse-pipeline ] completed with state |
TRT-LLM gen cut-off commits:
Added features:
Remaining tasks:
auto scoreBias = float{scoreSigmoid + float{biasVal}};
. This must be done before updating the routing kernel to top of tree.Remove w1, w3 swap in fp4FC1 when the cubins are updated next time (Done internally already)nvfp4 accuracy results:
MMLU: 86.91
gpqa_diamond (3 runs): 0.6717171717171717, 0.7373737373737373, 0.6919191919191919