MLA: allow Q8_0 K-cache for MLA #206
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After PR #205 we have two KV caches left when using MLA:
kv_l
- contiguous, not transposedkvt_l
- a transposed version ofkv_l
kv_l
can be quantized, and this PR adds the necessary changes.kvl_t
, being a transposed version ofkv_l
, cannot be quantized. It can be eliminated by settingMLA_USE_TRANSPOSED_CACHE
to 0 inllama.cpp
(but thenkv_l
cannot be quantized as making a contiguous transposed tensor out of a quantized tensor as needed during inference does not work at this point).Apart from reducing required KV cache memory, a quantized
kv_l
cache can also slightly improve TG performance after a long prompt. Here is a comparison between the main branch and this PR fortg64@ppN
for different prompt lengthsN
. Model isIQ4_XS
quantized DeepSeek-Lite. The results for the main branch are forfp16
kv_l
andkvt_l
cache, the PR usedQ8_0
forkv_l
andbf16
forkvt_l
(usingbf16
only makes sense for a CPU with native support, such as the Ryzen-7950X used to run the benchmark)