Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds Flash Attention for MLA for the CPU back-end. This should be of interest to people running DeepSeeklV3/R1 on the CPU.
Benefits:
-mla 2
, but this comes at a significant performance penalty (the transposed view of the cache needs to be computed on each compute graph evaluation)K*Q
tensor, which is the major contributor to compute buffer size for long contexts, never materializes. One can keep the compute buffer size to a desired maximum size using the-amb
option, but this comes with the inconvenience of having to think about compute buffer sizes, and a small performance penalty for large contexts-mla 1
(but performance for long contexts is still lower than standard attention with FA)Here is a what we get for KV cache and compute buffer size for DeepSeek-Lite with just MLA for a context of 65k tokens
And here the same with FA enabled
For DeepSeekV3/R1 KV cache will be
61/27 = 2.26X
larger. Without FA, the compute buffer would be 8X larger (8X more heads), with FA it would be only marginally larger (due to the larger embedding size).Just for fun, here is what we need without MLA:
And now without MLA and without FA (i.e., what one has available in mainline
llama.cpp
)Hahaha - 14.2 GiB. For DeepSeekV3/R1 scale KV cache size by 2.26 and compute buffer size by 8, so 44 GiB.
Anyway, here is a performance comparison between FlashMLA and regular MLA for DeepSeek-Lite on a Ryzen-7950X (Zen4) and a Ryzen-5975WX (AVX2)
I.e., about the same on
Zen4
and slightly better on vanillaAVX2
. I think the lower performance at 16k tokens can be improved, but I leave this for another PR.Here the same but for TG as a function of tokens in the KV cache
I.e., very slightly better on
Zen4
and slightly slower on vanillaAVX2
.Supported KV caches are:
F16
BF16
(if CPU has native support forBF16
instructionsQ8_0
Q8_KV
- the fastest optionQ6_0
I didn't allow lower quantization than
Q6_0
because a) quality loss becomes significant; b) build time becomes too long as one adds additional quantization types; and c) KV cache is now so much smaller compared to standard attention that it does not make sense to be stingy with KV cache bits.