fix: fix cublas_scaled_mm #3600

dc3671 · 2025-04-16T04:41:05Z

In pytorch 2.7, cublas's workspace is modified from 32MB to 1MB on Hopper:

Use _getWorkspaceSize() rather than _getWorkspace() pytorch/pytorch@203a27e#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1624-R1599

-   auto stream = c10::cuda::getCurrentCUDAStream();
-   size_t workspaceSize = 0;
-   auto workspace_ptr = _getWorkspace(workspaceSize);
+   size_t workspaceSize = _getWorkspaceSize();
+   auto workspace = at::empty(static_cast<int64_t>(workspaceSize),    at::TensorOptions().dtype(at::kByte).device(at::kCUDA));

And in _getWorkspace() will use cublas's workspace size 32MB on Hopper: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CublasHandlePool.cpp#L133

const size_t default_size = sm90 ? 4096 * 8 * 1024 : 4096 * 1024 * 2 + 16 * 1024 * 8;

And this lead to different algos (especially when one choose splitK and another not) chosen by cublasLt on some GEMM shapes. So we need to align this in our tests.

In E2E test, there are other tests called torch._scaled_mm, and the workspace size variable is static in cpp code, so the modification in test_scaled_mm won't take effect and we need to add that in test_linear_fp8.py.

dc3671 · 2025-04-16T04:42:09Z

/bot run --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-16T04:47:42Z

PR_Github #2412 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T05:02:08Z

PR_Github #2412 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1733 (Partly Tested) completed with status: 'FAILURE'

dc3671 · 2025-04-16T05:08:40Z

/bot run --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-16T05:14:40Z

PR_Github #2420 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T06:28:01Z

PR_Github #2420 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1738 (Partly Tested) completed with status: 'FAILURE'

dc3671 · 2025-04-16T06:32:57Z

/bot run --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-16T06:38:31Z

PR_Github #2430 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T08:02:29Z

PR_Github #2430 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1746 (Partly Tested) completed with status: 'FAILURE'

dc3671 · 2025-04-16T08:57:00Z

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-16T09:02:41Z

PR_Github #2451 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T09:02:57Z

PR_Github #3 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T09:06:55Z

PR_Github #3 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-04-16T09:07:23Z

PR_Github #2451 [ run ] completed with state FAILURE

yiqingy0 · 2025-04-16T09:12:50Z

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-16T09:18:26Z

PR_Github #2453 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T09:18:36Z

PR_Github #6 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T09:22:36Z

PR_Github #6 [ run ] completed with state ABORTED

dc3671 · 2025-04-16T10:08:26Z

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-16T10:14:00Z

PR_Github #2459 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T10:14:37Z

PR_Github #2453 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-04-16T12:16:37Z

PR_Github #2459 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1763 (Partly Tested) completed with status: 'FAILURE'

dc3671 · 2025-04-17T06:29:31Z

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-17T06:35:28Z

PR_Github #2595 [ run ] triggered by Bot

dc3671 · 2025-04-17T08:10:04Z

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-17T08:14:17Z

PR_Github #2595 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1852 (Partly Tested) completed with status: 'FAILURE'

tensorrt-cicd · 2025-04-17T08:15:41Z

PR_Github #2615 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-17T11:08:33Z

PR_Github #2615 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1862 (Partly Tested) completed with status: 'SUCCESS'

dc3671 · 2025-04-18T01:40:13Z

/bot run --disable-fail-fast --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-04-18T01:45:53Z

PR_Github #2703 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-18T03:59:21Z

PR_Github #2703 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1928 (Partly Tested) completed with status: 'FAILURE'

dc3671 · 2025-04-19T05:11:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-19T05:17:12Z

PR_Github #2821 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-19T07:19:29Z

PR_Github #2821 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1992 completed with status: 'SUCCESS'

tests/unittest/_torch/thop/test_scaled_mm.py

Signed-off-by: Zhenhuan Chen <[email protected]>

dc3671 · 2025-04-21T06:39:13Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-21T06:44:54Z

PR_Github #2910 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-21T06:50:10Z

PR_Github #2910 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #2821 for commit 35452ee

dc3671 force-pushed the fix-cublas-scaled-mm branch from 865c7b3 to 1f2dc49 Compare April 16, 2025 06:32

dc3671 force-pushed the fix-cublas-scaled-mm branch from 1f2dc49 to fa2f21c Compare April 16, 2025 08:56

dc3671 force-pushed the fix-cublas-scaled-mm branch from fa2f21c to 37b5b67 Compare April 17, 2025 06:29

dc3671 force-pushed the fix-cublas-scaled-mm branch from 37b5b67 to 1da6010 Compare April 17, 2025 08:09

dc3671 force-pushed the fix-cublas-scaled-mm branch from 1da6010 to 567aee8 Compare April 18, 2025 01:39

dc3671 force-pushed the fix-cublas-scaled-mm branch from 567aee8 to e0922aa Compare April 19, 2025 05:10

dc3671 marked this pull request as ready for review April 19, 2025 05:11

dc3671 requested review from hlu1 and QiJune April 19, 2025 05:23

hlu1 reviewed Apr 19, 2025

View reviewed changes

tests/unittest/_torch/thop/test_scaled_mm.py Outdated Show resolved Hide resolved

hlu1 approved these changes Apr 21, 2025

View reviewed changes

fix: fix cublas_scaled_mm

35452ee

Signed-off-by: Zhenhuan Chen <[email protected]>

dc3671 force-pushed the fix-cublas-scaled-mm branch from e0922aa to 35452ee Compare April 21, 2025 06:37

dc3671 merged commit 2672f13 into NVIDIA:main Apr 21, 2025
3 checks passed

fix: fix cublas_scaled_mm #3600

fix: fix cublas_scaled_mm #3600

Conversation

dc3671 commented Apr 16, 2025 • edited Loading

dc3671 commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

dc3671 commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

dc3671 commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

dc3671 commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

yiqingy0 commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

dc3671 commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

tensorrt-cicd commented Apr 16, 2025

dc3671 commented Apr 17, 2025

tensorrt-cicd commented Apr 17, 2025

dc3671 commented Apr 17, 2025

tensorrt-cicd commented Apr 17, 2025

tensorrt-cicd commented Apr 17, 2025

tensorrt-cicd commented Apr 17, 2025

dc3671 commented Apr 18, 2025

tensorrt-cicd commented Apr 18, 2025

tensorrt-cicd commented Apr 18, 2025

dc3671 commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

tensorrt-cicd commented Apr 19, 2025

dc3671 commented Apr 21, 2025

tensorrt-cicd commented Apr 21, 2025

tensorrt-cicd commented Apr 21, 2025

dc3671 commented Apr 16, 2025 •

edited

Loading