Optimized GEMM/GEMV for IQ1_S #212
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Apparently there are many people who would prefer to just run Unsloth's
IQ1_S
DeepSeek-R1 model as is instead of quantizing toIQ1_S_R4
and taking advantage of the better model quality and improved inference speed.So, here is a
iqk_mul_mat.cpp
implementation forIQ1_S
.I don't have the ability to run DeepSeek-R1, so using DeepSeek-Lite as a surrogate to test performance as it has the same architecture. The downside is that we don't test "pure"
IQ1_S
performance as various tensors that would have been quantized toIQ1_S
get quantized toIQ4_NL
due to their row sizes not being divisible by 256 (theIQ1_S
block size). Performance tests are run on Ryzen-7950X (Zen4
), Ryzen-5975WX (AVX2
) and M2-Max CPU (NEON
)I think one can do better by interleaving 4 rows on the fly, but I leave this for another day.