Add reduction example: Long sum #92

joydddd · 2025-05-30T01:11:29Z

No description provided.

joydddd · 2025-05-30T01:14:16Z

✅ Results Match ✅ Naive Helion
✅ Results Match ✅ Reduction Helion
Helion time: 0.0111s, Helion Reduction Time: 0.0104, torch time: 0.0133, speedup: 1.20x 1.27x

jansel · 2025-05-30T01:35:56Z

examples/long_sum.py

+# Long Sum using Helion's reduction feature
+# Config: reduction_loop allows Helion to generate a looped reduction (same as the naive impl above)
+# Example Config:
+# @helion.kernel(config=helion.Config(block_sizes=[[1]], reduction_loops=[None], num_warps=32, num_stages=4, indexing='block_ptr'))


reduction_loops=[None] means a reduction loop won't be used. Most likely because the autotuner found it was slower. If the goal of the example is to show how Helion can roll reductions maybe we should pick a config (like reduction_loops=[1024]) that generates a loop.

jansel · 2025-05-30T01:40:42Z

examples/long_sum.py

+# Looped Reduction Long Sum
+# Example Config:
+# @helion.kernel(config=helion.Config(block_sizes=[[32768], [1]], num_warps=16, num_stages=5, indexing='pointer'))
+@helion.kernel()
+def long_sum(x: torch.Tensor) -> torch.Tensor:
+    m, n = x.size()
+    out = torch.empty(
+        [m], dtype=x.dtype, device=x.device
+    )
+
+    # Call register_block_size to know block_size_n outside of the reduction loop.
+    block_size_n = hl.register_block_size(n)
+
+    for tile_m in hl.tile(m):
+        acc = hl.zeros([tile_m, block_size_n], dtype=x.dtype)
+        for tile_n in hl.tile(n, block_size=block_size_n): # The reduction loop for n that doesn't fit in a tile.
+            acc += x[tile_m, tile_n]
+        out[tile_m] = acc.sum(-1)
+    return out


In most cases, we want people to write reductions the second way and not this way.

The second way is less code

The second way has a larger search space since it can choose to do a persistent reduction. (Which it looks like the autotuner picked in this case.)

I worry people are going to be copy-and-pasting from our examples, so I don't want a "bad" example to be that prominent. We can keeping it, but we should move after the "good" example and include a clear warning that this restricts the search space and is equivalent to the first one with reduction_loop != None.

joydddd · 2025-05-30T19:50:33Z

examples/long_sum.py

+    for tile_m in hl.tile(m):
+        out[tile_m] = x[tile_m, :].sum(-1)


Helion generated looped reduction:

@triton.jit def _long_sum_reduction_kernel(x, out, out_stride_0, x_stride_0, x_stride_1, _, _REDUCTION_BLOCK_1: tl.constexpr): pid_0 = tl.program_id(0) offset_0 = pid_0 indices_0 = offset_0 + tl.zeros([1], tl.int32) sum_1_acc = tl.full([1, _REDUCTION_BLOCK_1], 0, tl.float32) for roffset_1 in range(0, _, _REDUCTION_BLOCK_1): rindex_1 = roffset_1 + tl.arange(0, _REDUCTION_BLOCK_1).to(tl.int32) mask_1 = rindex_1 < _ load = tl.load(x + (indices_0[:, None] * x_stride_0 + rindex_1[None, :] * x_stride_1), mask_1[None, :], other=0) v_0 = tl.where(tl.broadcast_to(mask_1[None, :], [1, _REDUCTION_BLOCK_1]), load, 0) v_1 = sum_1_acc + v_0 sum_1_acc = v_1 sum_1 = tl.sum(sum_1_acc, 1) tl.store(out + indices_0 * out_stride_0, sum_1, None)

There is an additional mask step here, causing performance drop of 14%.

I'll have a PR to fix this in the next day or two.

Fixed by #109

joydddd · 2025-05-30T19:51:43Z

examples/long_sum.py

+        for tile_n in hl.tile(n, block_size=block_size_n): # Reduction loop
+            acc += x[tile_m, tile_n]


The manual implementation translates to:

@triton.jit def _long_sum_kernel(x, out, out_stride_0, x_stride_0, x_stride_1, n, _BLOCK_SIZE_0: tl.constexpr): pid_0 = tl.program_id(0) offset_1 = pid_0 indices_1 = offset_1 + tl.zeros([1], tl.int32) acc = tl.full([1, _BLOCK_SIZE_0], 0.0, tl.float32) for offset_0 in range(0, n, _BLOCK_SIZE_0): indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) mask_0 = indices_0 < n acc_copy = acc load = tl.load(x + (indices_1[:, None] * x_stride_0 + indices_0[None, :] * x_stride_1), mask_0[None, :], other=0) acc = acc_copy + load sum_1 = tl.sum(acc, 1) tl.store(out + indices_1 * out_stride_0, sum_1, None)

Note how masking only happens during tl.load.

joydddd · 2025-05-30T20:45:13Z

✅ Results Match ✅ naive reduction
✅ Results Match ✅ Reduction Loop
✅ Results Match ✅ Manual Reduction Loop
Helion Naive time: 0.0107s, Helion Looped Time: 0.0131, Helion Manual Loop Time: 0.0111 torch time: 0.0357, speedup: 3.35x 2.73x 3.22x

joydddd · 2025-06-01T20:56:03Z

@drisspg @jansel I don't think I have write permission to this repo. Could you merge this for me?

jansel · 2025-06-02T00:42:06Z

I don't think I have write permission to this repo. Could you merge this for me?

You should be able to add yourself through the fb internal page for the project.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 30, 2025

jansel requested changes May 30, 2025

View reviewed changes

joydddd commented May 30, 2025

View reviewed changes

joydddd force-pushed the long_sum branch from da069b2 to e87e474 Compare May 30, 2025 21:32

jansel approved these changes May 30, 2025

View reviewed changes

joydddd added 4 commits May 30, 2025 21:32

Add long sum example

0386c24

Use matched configs for manual implementation / generated reduction

0be70e5

Add naive reduction + looped reduction examples

1e6c617

Lint

84c8e31

joydddd force-pushed the long_sum branch from e87e474 to 84c8e31 Compare May 31, 2025 01:33

jansel merged commit cb5ddcd into pytorch-labs:main Jun 2, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add reduction example: Long sum #92

Add reduction example: Long sum #92

Uh oh!

joydddd commented May 30, 2025

Uh oh!

joydddd commented May 30, 2025

Uh oh!

jansel May 30, 2025

Uh oh!

jansel May 30, 2025

Uh oh!

joydddd May 30, 2025 •

edited

Loading

Uh oh!

jansel May 30, 2025

Uh oh!

jansel May 31, 2025

Uh oh!

joydddd May 30, 2025

Uh oh!

joydddd commented May 30, 2025

Uh oh!

joydddd commented Jun 1, 2025

Uh oh!

jansel commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

		for tile_n in hl.tile(n, block_size=block_size_n): # Reduction loop
		acc += x[tile_m, tile_n]

Add reduction example: Long sum #92

Add reduction example: Long sum #92

Uh oh!

Conversation

joydddd commented May 30, 2025

Uh oh!

joydddd commented May 30, 2025

Uh oh!

jansel May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jansel May 30, 2025

Choose a reason for hiding this comment

Uh oh!

joydddd May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jansel May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jansel May 31, 2025

Choose a reason for hiding this comment

Uh oh!

joydddd May 30, 2025

Choose a reason for hiding this comment

Uh oh!

joydddd commented May 30, 2025

Uh oh!

joydddd commented Jun 1, 2025

Uh oh!

jansel commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

joydddd May 30, 2025 •

edited

Loading