feat: Add support for FP8 MLA on Hopper and Blackwell. #3190

bobboli · 2025-04-01T06:44:22Z

This PR adds FP8 MLA support on Hopper and Blackwell.

Recipe: per-tensor FP8 e4m3 quantization for Q and latent KV. MLA output is in BF16. Currently no calibration and the quantization scales are simply set to 1.
Add code for Q and KV cache quantization.
trtllm-gen based FP8 MLA kernel support for Blackwell.
FP8 FlashMLA kernel support for Hopper. Default option for Hopper now.
(draft) FMHA based FP8 MLA kernel for Hopper, with accuracy issues to be investigated. This PR also includes some refactoring to the FMHA kernel management code.

Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <[email protected]>

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-04-01T06:58:59Z

/bot run

tensorrt-cicd · 2025-04-01T07:04:19Z

PR_Github #869 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-01T07:13:10Z

PR_Github #869 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #693 completed with status: 'FAILURE'

bobboli · 2025-04-01T08:18:43Z

tensorrt_llm/_torch/modules/attention.py

@@ -531,7 +531,9 @@ def forward_context(
        # Concat q(including q_pe), k + k_pe, v together as input_qkv
        input_qkv = torch.cat([q, k, v], dim=-1)

-        out_scale = getattr(self.o_proj, "inv_input_scale", None)
+        # out_scale = getattr(self.o_proj, "inv_input_scale", None)
+        out_scale = None  # Currently we use BF16 MHA for context phase


If out_scale is not None, attentionOp will assume that the attention output type is FP8. Currently we want to keep context MLA, as well as the output of generation MLA, in Bf16.

Signed-off-by: Dylan Chen <[email protected]>

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-04-01T10:13:26Z

/bot run

tensorrt-cicd · 2025-04-01T10:19:09Z

PR_Github #889 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-01T10:28:00Z

PR_Github #889 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #703 completed with status: 'FAILURE'

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-04-01T15:26:09Z

/bot run

tensorrt-cicd · 2025-04-01T15:31:49Z

PR_Github #912 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-01T16:14:37Z

PR_Github #912 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #720 completed with status: 'FAILURE'

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-04-02T02:30:39Z

/bot run

tensorrt-cicd · 2025-04-02T02:35:57Z

PR_Github #956 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-02T02:44:36Z

PR_Github #956 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #747 completed with status: 'FAILURE'

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-04-02T02:57:00Z

/bot run

tensorrt-cicd · 2025-04-02T03:02:23Z

PR_Github #957 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-02T04:07:26Z

PR_Github #957 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #748 completed with status: 'FAILURE'

This reverts commit f0c859d. Signed-off-by: Bo Li <[email protected]>

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-04-03T15:50:16Z

/bot run

tensorrt-cicd · 2025-04-03T15:55:35Z

PR_Github #1133 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-03T21:36:06Z

PR_Github #1133 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #853 completed with status: 'FAILURE'

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-04-06T14:55:15Z

/bot run

tensorrt-cicd · 2025-04-06T15:00:28Z

PR_Github #1230 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-06T19:50:58Z

PR_Github #1230 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #923 completed with status: 'FAILURE'

bobboli · 2025-04-07T01:31:57Z

/bot run

tensorrt-cicd · 2025-04-07T01:37:13Z

PR_Github #1248 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-07T06:52:41Z

PR_Github #1248 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #940 completed with status: 'SUCCESS'

QiJune · 2025-04-07T07:00:15Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-07T07:06:04Z

PR_Github #1281 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-07T07:14:11Z

PR_Github #1281 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #1248 for commit efda97d

* fp8 kv + bf16 ctx MLA + fp8 gen MLA Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <[email protected]> * Resolve conflicts. Signed-off-by: Bo Li <[email protected]> * Apply the patch of FP8 FlashMLA and resolve conflicts. Signed-off-by: Bo Li <[email protected]> * Fix compilation error. Signed-off-by: Bo Li <[email protected]> * Fix compile error. Signed-off-by: Bo Li <[email protected]> * pick blackwell support Signed-off-by: Dylan Chen <[email protected]> * Add copyright notice to fused_multihead_attention_v2.cpp. Signed-off-by: Bo Li <[email protected]> * Add license. Signed-off-by: Bo Li <[email protected]> * Add missing license. Signed-off-by: Bo Li <[email protected]> * Exclude building flashMLA kernels under sm90. Signed-off-by: Bo Li <[email protected]> * Revert "Exclude building flashMLA kernels under sm90." This reverts commit f0c859d. Signed-off-by: Bo Li <[email protected]> * Use macro to skip compiling FlashMLA for non sm90 targets. Signed-off-by: Bo Li <[email protected]> --------- Signed-off-by: Bo Li <[email protected]> Signed-off-by: Dylan Chen <[email protected]> Co-authored-by: Dylan Chen <[email protected]> Co-authored-by: Dylan Chen <[email protected]> Co-authored-by: QI JUN <[email protected]> Signed-off-by: sarattha <[email protected]>

* fp8 kv + bf16 ctx MLA + fp8 gen MLA Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <[email protected]> * Resolve conflicts. Signed-off-by: Bo Li <[email protected]> * Apply the patch of FP8 FlashMLA and resolve conflicts. Signed-off-by: Bo Li <[email protected]> * Fix compilation error. Signed-off-by: Bo Li <[email protected]> * Fix compile error. Signed-off-by: Bo Li <[email protected]> * pick blackwell support Signed-off-by: Dylan Chen <[email protected]> * Add copyright notice to fused_multihead_attention_v2.cpp. Signed-off-by: Bo Li <[email protected]> * Add license. Signed-off-by: Bo Li <[email protected]> * Add missing license. Signed-off-by: Bo Li <[email protected]> * Exclude building flashMLA kernels under sm90. Signed-off-by: Bo Li <[email protected]> * Revert "Exclude building flashMLA kernels under sm90." This reverts commit f0c859d. Signed-off-by: Bo Li <[email protected]> * Use macro to skip compiling FlashMLA for non sm90 targets. Signed-off-by: Bo Li <[email protected]> --------- Signed-off-by: Bo Li <[email protected]> Signed-off-by: Dylan Chen <[email protected]> Co-authored-by: Dylan Chen <[email protected]> Co-authored-by: Dylan Chen <[email protected]> Co-authored-by: QI JUN <[email protected]>

DylanChen-NV and others added 5 commits April 1, 2025 06:55

Resolve conflicts.

e3a7abb

Signed-off-by: Bo Li <[email protected]>

Apply the patch of FP8 FlashMLA and resolve conflicts.

f2e25fb

Signed-off-by: Bo Li <[email protected]>

Fix compilation error.

5f53616

Signed-off-by: Bo Li <[email protected]>

Fix compile error.

04b45e4

Signed-off-by: Bo Li <[email protected]>

bobboli force-pushed the mla_fp8 branch from 6adb478 to 04b45e4 Compare April 1, 2025 06:57

bobboli requested review from Tracin and DylanChen-NV April 1, 2025 07:01

bobboli commented Apr 1, 2025

View reviewed changes

DylanChen-NV and others added 2 commits April 1, 2025 08:57

pick blackwell support

75f998e

Signed-off-by: Dylan Chen <[email protected]>

Add copyright notice to fused_multihead_attention_v2.cpp.

358dffa

Signed-off-by: Bo Li <[email protected]>

Add license.

27700e8

Signed-off-by: Bo Li <[email protected]>

Add missing license.

50aa2fc

Signed-off-by: Bo Li <[email protected]>

Exclude building flashMLA kernels under sm90.

f0c859d

Signed-off-by: Bo Li <[email protected]>

bobboli force-pushed the mla_fp8 branch from 39a059c to f0c859d Compare April 2, 2025 02:56

Tracin approved these changes Apr 3, 2025

View reviewed changes

bobboli added 2 commits April 3, 2025 10:44

Revert "Exclude building flashMLA kernels under sm90."

0e84733

This reverts commit f0c859d. Signed-off-by: Bo Li <[email protected]>

Use macro to skip compiling FlashMLA for non sm90 targets.

09739ab

Signed-off-by: Bo Li <[email protected]>

Merge branch 'main' into mla_fp8

4333ab1

Signed-off-by: Bo Li <[email protected]>

Merge branch 'main' into mla_fp8

efda97d

QiJune enabled auto-merge (squash) April 7, 2025 07:00

QiJune merged commit 515dd0d into NVIDIA:main Apr 7, 2025
2 checks passed

DylanChen-NV mentioned this pull request Apr 11, 2025

feat: MLA FP8 KV Cache on Blackwell #3004

Closed

josephrocca mentioned this pull request Jun 5, 2025

[Bug]: FlashMLA V1 with FP8 KV cache not yet supported! vllm-project/vllm#18887

Open

1 task

feat: Add support for FP8 MLA on Hopper and Blackwell. #3190

feat: Add support for FP8 MLA on Hopper and Blackwell. #3190

Uh oh!

Conversation

bobboli commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bobboli commented Apr 1, 2025

Uh oh!

tensorrt-cicd commented Apr 1, 2025

Uh oh!

tensorrt-cicd commented Apr 1, 2025

Uh oh!

bobboli Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bobboli commented Apr 1, 2025

Uh oh!

tensorrt-cicd commented Apr 1, 2025

Uh oh!

tensorrt-cicd commented Apr 1, 2025

Uh oh!

bobboli commented Apr 1, 2025

Uh oh!

tensorrt-cicd commented Apr 1, 2025

Uh oh!

tensorrt-cicd commented Apr 1, 2025

Uh oh!

bobboli commented Apr 2, 2025

Uh oh!

tensorrt-cicd commented Apr 2, 2025

Uh oh!

tensorrt-cicd commented Apr 2, 2025

Uh oh!

bobboli commented Apr 2, 2025

Uh oh!

tensorrt-cicd commented Apr 2, 2025

Uh oh!

tensorrt-cicd commented Apr 2, 2025

Uh oh!

bobboli commented Apr 3, 2025

Uh oh!

tensorrt-cicd commented Apr 3, 2025

Uh oh!

tensorrt-cicd commented Apr 3, 2025

Uh oh!

bobboli commented Apr 6, 2025

Uh oh!

tensorrt-cicd commented Apr 6, 2025

Uh oh!

tensorrt-cicd commented Apr 6, 2025

Uh oh!

bobboli commented Apr 7, 2025

Uh oh!

tensorrt-cicd commented Apr 7, 2025

Uh oh!

tensorrt-cicd commented Apr 7, 2025

Uh oh!

QiJune commented Apr 7, 2025

Uh oh!

tensorrt-cicd commented Apr 7, 2025

Uh oh!

tensorrt-cicd commented Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

bobboli commented Apr 1, 2025 •

edited

Loading

bobboli Apr 1, 2025 •

edited

Loading