You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This does an outer product across the N dimension; namely a 16x256x16 matrix multiply. Unfortunately, the code generated for this is quite poor; Triton loads x just fine but then it tries to transpose it element-by-element (this is immediately obvious in the SASS code, which is full of LDS.U16 or whatever). This causes extreme traffic to shared memory subsystem; so much so that it is actually faster to double load x from global memory (the second time with transposed strides) so that the inputs the product are not transposed.
I would expect that this code instead used the transposed version of the MMA instructions, or barring that, did a shuffle in registers rather than trying to do it element-by-element in shared memory. Or even vectorized the loads.
For reference, here's the code I pulled out of Nsight Compute for this:
Describe the issue
Consider the following code:
This does an outer product across the N dimension; namely a 16x256x16 matrix multiply. Unfortunately, the code generated for this is quite poor; Triton loads x just fine but then it tries to transpose it element-by-element (this is immediately obvious in the SASS code, which is full of LDS.U16 or whatever). This causes extreme traffic to shared memory subsystem; so much so that it is actually faster to double load x from global memory (the second time with transposed strides) so that the inputs the product are not transposed.
I would expect that this code instead used the transposed version of the MMA instructions, or barring that, did a shuffle in registers rather than trying to do it element-by-element in shared memory. Or even vectorized the loads.
For reference, here's the code I pulled out of Nsight Compute for this:
Environment details
Triton: main
GPU: GH200
The text was updated successfully, but these errors were encountered: