Skip to content

fix: fix cublas_scaled_mm #3600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 21, 2025
Merged

fix: fix cublas_scaled_mm #3600

merged 1 commit into from
Apr 21, 2025

Conversation

dc3671
Copy link
Collaborator

@dc3671 dc3671 commented Apr 16, 2025

In pytorch 2.7, cublas's workspace is modified from 32MB to 1MB on Hopper:

-   auto stream = c10::cuda::getCurrentCUDAStream();
-   size_t workspaceSize = 0;
-   auto workspace_ptr = _getWorkspace(workspaceSize);
+   size_t workspaceSize = _getWorkspaceSize();
+   auto workspace = at::empty(static_cast<int64_t>(workspaceSize),    at::TensorOptions().dtype(at::kByte).device(at::kCUDA));
const size_t default_size = sm90 ? 4096 * 8 * 1024 : 4096 * 1024 * 2 + 16 * 1024 * 8;

And this lead to different algos (especially when one choose splitK and another not) chosen by cublasLt on some GEMM shapes. So we need to align this in our tests.

In E2E test, there are other tests called torch._scaled_mm, and the workspace size variable is static in cpp code, so the modification in test_scaled_mm won't take effect and we need to add that in test_linear_fp8.py.

@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 16, 2025

/bot run --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2412 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2412 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1733 (Partly Tested) completed with status: 'FAILURE'

@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 16, 2025

/bot run --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2420 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2420 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1738 (Partly Tested) completed with status: 'FAILURE'

@dc3671 dc3671 force-pushed the fix-cublas-scaled-mm branch from 865c7b3 to 1f2dc49 Compare April 16, 2025 06:32
@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 16, 2025

/bot run --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2430 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2430 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1746 (Partly Tested) completed with status: 'FAILURE'

@dc3671 dc3671 force-pushed the fix-cublas-scaled-mm branch from 1f2dc49 to fa2f21c Compare April 16, 2025 08:56
@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 16, 2025

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2451 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2451 [ run ] completed with state FAILURE

@yiqingy0
Copy link
Collaborator

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2453 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6 [ run ] completed with state ABORTED

@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 16, 2025

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2459 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2453 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2459 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1763 (Partly Tested) completed with status: 'FAILURE'

@dc3671 dc3671 force-pushed the fix-cublas-scaled-mm branch from fa2f21c to 37b5b67 Compare April 17, 2025 06:29
@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 17, 2025

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2595 [ run ] triggered by Bot

@dc3671 dc3671 force-pushed the fix-cublas-scaled-mm branch from 37b5b67 to 1da6010 Compare April 17, 2025 08:09
@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 17, 2025

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2595 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1852 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2615 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2615 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1862 (Partly Tested) completed with status: 'SUCCESS'

@dc3671 dc3671 force-pushed the fix-cublas-scaled-mm branch from 1da6010 to 567aee8 Compare April 18, 2025 01:39
@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 18, 2025

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2703 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2703 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1928 (Partly Tested) completed with status: 'FAILURE'

@dc3671 dc3671 force-pushed the fix-cublas-scaled-mm branch from 567aee8 to e0922aa Compare April 19, 2025 05:10
@dc3671 dc3671 marked this pull request as ready for review April 19, 2025 05:11
@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 19, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2821 [ run ] triggered by Bot

@dc3671 dc3671 requested review from hlu1 and QiJune April 19, 2025 05:23
@tensorrt-cicd
Copy link
Collaborator

PR_Github #2821 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1992 completed with status: 'SUCCESS'

Signed-off-by: Zhenhuan Chen <[email protected]>
@dc3671 dc3671 force-pushed the fix-cublas-scaled-mm branch from e0922aa to 35452ee Compare April 21, 2025 06:37
@dc3671
Copy link
Collaborator Author

dc3671 commented Apr 21, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2910 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2910 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #2821 for commit 35452ee

@dc3671 dc3671 merged commit 2672f13 into NVIDIA:main Apr 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants