Matmul performance on A100 #6523

QuqqU · 2025-04-17T15:04:09Z

Describe the issue

Hello, I’m currently working on writing kernels for the A100.

I ran 03-matrix-multiplication.py on an A100, but the performance was lower than I expected.

I'd like to know if this is the intended performance, and I’d appreciate any suggestions on how to improve matmul performance specifically on A100. Thanks!

triton_output_with_fp16_inputs=tensor([[-10.9531,  -4.7109,  15.6953,  ..., -28.4062,   4.3320, -26.4219],
        [ 26.8438,  10.0469,  -5.4297,  ..., -11.2969,  -8.5312,  30.7500],
        [-13.2578,  15.8516,  18.0781,  ..., -21.7656,  -8.6406,  10.2031],
        ...,
        [ 40.2812,  18.6094, -25.6094,  ...,  -2.7598,  -3.2441,  41.0000],
        [ -6.1211, -16.8281,   4.4844,  ..., -21.0312,  24.7031,  15.0234],
        [-17.0938, -19.0000,  -0.3831,  ...,  21.5469, -30.2344, -13.2188]],
       device='cuda:0', dtype=torch.float16)
torch_output_with_fp16_inputs=tensor([[-10.9531,  -4.7109,  15.6953,  ..., -28.4062,   4.3320, -26.4219],
        [ 26.8438,  10.0469,  -5.4297,  ..., -11.2969,  -8.5312,  30.7500],
        [-13.2578,  15.8516,  18.0781,  ..., -21.7656,  -8.6406,  10.2031],
        ...,
        [ 40.2812,  18.6094, -25.6094,  ...,  -2.7598,  -3.2441,  41.0000],
        [ -6.1211, -16.8281,   4.4844,  ..., -21.0312,  24.7031,  15.0234],
        [-17.0938, -19.0000,  -0.3831,  ...,  21.5469, -30.2344, -13.2188]],
       device='cuda:0', dtype=torch.float16)
✅ Triton and Torch match
triton_output_with_fp8_inputs=tensor([[-21.4375,  13.1719,   6.0352,  ...,  28.7031,   8.6719, -40.7500],
        [ 10.0000,  37.0000,  -5.5664,  ...,  20.9844,  46.8125,  30.8281],
        [ 19.5625,  -3.0078, -20.0469,  ...,  -2.1309,  -8.0625,  12.5625],
        ...,
        [-18.1562, -34.1562, -27.4219,  ..., -27.3906, -24.0938, -12.3516],
        [ -3.3945,  -8.6250, -23.6562,  ...,  -4.1094,  -3.5332, -16.0781],
        [-23.9688,  -3.2637, -33.6875,  ...,  17.3125, -36.6250,  25.8594]],
       device='cuda:0', dtype=torch.float16)
torch_output_with_fp8_inputs=tensor([[-21.4375,  13.1719,   6.0352,  ...,  28.7031,   8.6719, -40.7500],
        [ 10.0000,  37.0000,  -5.5664,  ...,  20.9844,  46.8125,  30.8281],
        [ 19.5625,  -3.0078, -20.0469,  ...,  -2.1309,  -8.0625,  12.5625],
        ...,
        [-18.1562, -34.1562, -27.4219,  ..., -27.3906, -24.0938, -12.3516],
        [ -3.3945,  -8.6250, -23.6562,  ...,  -4.1094,  -3.5332, -16.0781],
        [-23.9688,  -3.2637, -33.6875,  ...,  17.3125, -36.6250,  25.8594]],
       device='cuda:0', dtype=torch.float16)
✅ Triton and Torch match
matmul-performance-fp16:
         M       N       K      cuBLAS      Triton
0    256.0   256.0   256.0    4.096000    4.096000
1    384.0   384.0   384.0   11.059200   11.059200
2    512.0   512.0   512.0   26.214401   23.831273
3    640.0   640.0   640.0   42.666665   42.666665
4    768.0   768.0   768.0   63.195428   58.982401
5    896.0   896.0   896.0   78.051553   82.642822
6   1024.0  1024.0  1024.0  104.857603   80.659693
7   1152.0  1152.0  1152.0  135.726544  102.964963
8   1280.0  1280.0  1280.0  163.840004  128.000000
9   1408.0  1408.0  1408.0  151.438217  118.516867
10  1536.0  1536.0  1536.0  172.631417  141.557764
11  1664.0  1664.0  1664.0  183.651271  157.875646
12  1792.0  1792.0  1792.0  172.914215  181.281035
13  1920.0  1920.0  1920.0  197.485709  145.515785
14  2048.0  2048.0  2048.0  220.752852  162.885595
15  2176.0  2176.0  2176.0  214.081356  174.988240
16  2304.0  2304.0  2304.0  236.513589  183.752861
17  2432.0  2432.0  2432.0  205.069087  177.813069
18  2560.0  2560.0  2560.0  227.555548  193.893484
19  2688.0  2688.0  2688.0  199.647657  166.373049
20  2816.0  2816.0  2816.0  212.752230  180.224004
21  2944.0  2944.0  2944.0  224.486628  172.443024
22  3072.0  3072.0  3072.0  210.494802  188.116629
23  3200.0  3200.0  3200.0  216.949149  195.718662
24  3328.0  3328.0  3328.0  210.500857  178.196281
25  3456.0  3456.0  3456.0  221.487820  178.366298
26  3584.0  3584.0  3584.0  220.922331  190.498706
27  3712.0  3712.0  3712.0  210.753890  193.975417
28  3840.0  3840.0  3840.0  211.456969  179.824383
29  3968.0  3968.0  3968.0  210.386099  187.153271
30  4096.0  4096.0  4096.0  222.583299  195.652669
matmul-performance-fp8:
         M       N       K      Triton
0    256.0   256.0   256.0    4.096000
1    384.0   384.0   384.0   11.059200
2    512.0   512.0   512.0   26.214401
3    640.0   640.0   640.0   46.545454
4    768.0   768.0   768.0   58.982401
5    896.0   896.0   896.0   87.808000
6   1024.0  1024.0  1024.0   91.180520
7   1152.0  1152.0  1152.0  119.439363
8   1280.0  1280.0  1280.0  141.241376
9   1408.0  1408.0  1408.0  132.970149
10  1536.0  1536.0  1536.0  157.286398
11  1664.0  1664.0  1664.0  160.694855
12  1792.0  1792.0  1792.0  184.252856
13  1920.0  1920.0  1920.0  166.554219
14  2048.0  2048.0  2048.0  192.841562
15  2176.0  2176.0  2176.0  186.330074
16  2304.0  2304.0  2304.0  207.720621
17  2432.0  2432.0  2432.0  199.251522
18  2560.0  2560.0  2560.0  208.713373
19  2688.0  2688.0  2688.0  195.531224
20  2816.0  2816.0  2816.0  207.686706
21  2944.0  2944.0  2944.0  212.974490
22  3072.0  3072.0  3072.0  205.156169
23  3200.0  3200.0  3200.0  213.333323
24  3328.0  3328.0  3328.0  205.103410
25  3456.0  3456.0  3456.0  201.553925
26  3584.0  3584.0  3584.0  213.069643
27  3712.0  3712.0  3712.0  215.295995
28  3840.0  3840.0  3840.0  205.944129
29  3968.0  3968.0  3968.0  216.738793
30  4096.0  4096.0  4096.0  220.029067

Environment details

Triton: 3.3.0
PyTorch: 2.8.0.dev20250417+cu128
GPU: A100

The text was updated successfully, but these errors were encountered:

QuqqU added the performance label Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matmul performance on A100 #6523

Matmul performance on A100 #6523

QuqqU commented Apr 17, 2025

Matmul performance on A100 #6523

Matmul performance on A100 #6523

Comments

QuqqU commented Apr 17, 2025

Describe the issue

Environment details