Refactoring of multi-head attention and support for KV caching #2061

mseeger · 2025-05-30T13:08:27Z

This continues from #1934 . I created a new branch, because the history of the previous one was messed up with a merge operation.

Adds abstraction for key-value caches, implements batched inference.

I am also adding two baseline KV caches, the default one from before (all KV are stored) and a last-recent one.

OK, this PR contains the following parts:

Small things: Start of layer hook in GPT.forward, skip_lm_head in GPT.forward. I need these for gradient computation, but also to put proper head models on top of the transformer. This is generally useful.
Refactoring of multi-head attention: This is needed in order to implement the KV cache abstraction in the way @t-vi suggested (in a phone call). But it also really simplifies things. It also removes a major issue: mask_cache requires lots of memory, it is now computed on demand, with particular attention to inference (where query is much smaller than key)
Proper KV cache abstraction, which modifies slightly how GPT.forward is called (namely, input_pos as int). This simplifies things, though. I also provide a few default implementations. DenseKVCache replicates what is currently in place.

In the library I am writing, there are a number of additional more powerful KV caches, such as H2O and quantization-aware H2O. I am also working on fine-tuning in the presence of KV caches. The abstraction I propose here, enables all of that.

If these changes are not done, I'd have to copy and change quite a bit of your code. This would be hard to maintain, and would run the risk that KV caches are implemented differently at a later point, and then things really diverge.

As I said in the comments above, I found KV caching to be super-important to make large context inference work on a moderate GPU budget, which should be of interest to your customers as well.

mseeger · 2025-05-30T13:09:44Z

Started work to make sure all tests pass.

mseeger · 2025-05-30T13:10:28Z

@t-vi , @Borda , just a heads-up, I continue work in this PR, from #1934

mseeger · 2025-05-30T14:51:53Z

Tests fail for me that should fail in mainline as well. For example, test_against_multimodal_gemma_3 in test_models.py fails in copy_weights_gemma_3, because the skip logic there checks for prefix "vision_tower" or "language_model", but the keys really start with "model.vision_tower" or "model.language_model".

??

mseeger · 2025-05-30T15:01:03Z

I'll submit a PR with a fix.

for more information, see https://pre-commit.ci

Borda · 2025-06-10T08:04:03Z

I'll submit a PR with a fix.

Could you also link the PR here?

mseeger requested review from lantiga, t-vi and Borda as code owners May 30, 2025 13:08

mseeger mentioned this pull request May 30, 2025

Support for KV caching and batched inference #1934

Closed

mseeger force-pushed the kvcache4 branch 9 times, most recently from 0546608 to 5442ea0 Compare June 6, 2025 08:37

Support for advanced KV caching and batch generation

0d6360b

mseeger force-pushed the kvcache4 branch from e652799 to 0d6360b Compare June 6, 2025 10:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

15b5193

for more information, see https://pre-commit.ci

Merge branch 'main' into kvcache4

e667890

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactoring of multi-head attention and support for KV caching #2061

Refactoring of multi-head attention and support for KV caching #2061

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

Borda commented Jun 10, 2025

Uh oh!

Uh oh!

Refactoring of multi-head attention and support for KV caching #2061

Are you sure you want to change the base?

Refactoring of multi-head attention and support for KV caching #2061

Uh oh!

Conversation

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

Borda commented Jun 10, 2025

Uh oh!

Uh oh!