Add PT compileable support for flash_attn_with_kvcache #1592

jataylo · 2025-04-14T14:47:50Z

Continues #1139 adding custom op for flash_attn_with_kvcache.

On a transformers model this improves perf by >2x by avoiding graph breaks. There is a gotcha here, with this implementation an error is thrown in PyTorch 2.6 in user code when reshaping FA output:

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: <weakref at 0x7f10e00494e0; to 'torch.storage.UntypedStorage' at 0x7f10e0049400>

This is not an issue for PyTorch 2.7, so I had to introduce conditionalisation to workaround this by returning clone of the output tensors only for PT versions earlier than 2.7 and when compile is being used.

jataylo · 2025-04-16T09:45:48Z

@tridao alternatively if preferred, instead of conditionalising the clone for < PT 2.7, we could just disable compile-able support for this op if below 2.7, the additional clone could cause regressions and increase memory usage.

tridao · 2025-04-22T13:21:54Z

We will drop support for pytorch < 2.4 so you can simplify the code.
I'll need to think more about the clone. Does it slow things down when running in eager?

Add PT compileable support for flash_attn_with_kvcache

c75a76b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PT compileable support for flash_attn_with_kvcache #1592

Add PT compileable support for flash_attn_with_kvcache #1592

jataylo commented Apr 14, 2025

jataylo commented Apr 16, 2025

tridao commented Apr 22, 2025

Add PT compileable support for flash_attn_with_kvcache #1592

Are you sure you want to change the base?

Add PT compileable support for flash_attn_with_kvcache #1592

Conversation

jataylo commented Apr 14, 2025

jataylo commented Apr 16, 2025

tridao commented Apr 22, 2025