Optimized GEMM/GEMV for IQ1_S #212

ikawrakow · 2025-02-20T10:39:46Z

Apparently there are many people who would prefer to just run Unsloth's IQ1_S DeepSeek-R1 model as is instead of quantizing to IQ1_S_R4 and taking advantage of the better model quality and improved inference speed.

So, here is a iqk_mul_mat.cpp implementation for IQ1_S.

I don't have the ability to run DeepSeek-R1, so using DeepSeek-Lite as a surrogate to test performance as it has the same architecture. The downside is that we don't test "pure" IQ1_S performance as various tensors that would have been quantized to IQ1_S get quantized to IQ4_NL due to their row sizes not being divisible by 256 (the IQ1_S block size). Performance tests are run on Ryzen-7950X (Zen4), Ryzen-5975WX (AVX2) and M2-Max CPU (NEON)

model	backend	threads	test	t/s (main)	t/s (PR)	Speedup
deepseek2 16B IQ1_S	AVX2	32	pp512	209.49 ± 0.61	484.99 ± 4.61	2.315
deepseek2 16B IQ1_S		2	tg128	12.13 ± 0.01	15.74 ± 0.01	1.298
deepseek2 16B IQ1_S		4	tg128	21.26 ± 0.01	26.29 ± 0.05	1.237
deepseek2 16B IQ1_S		8	tg128	30.85 ± 0.07	36.24 ± 0.13	1.175
deepseek2 16B IQ1_S		16	tg128	40.04 ± 0.01	42.00 ± 0.01	1.049
deepseek2 16B IQ1_S	Zen4	16	pp512	142.33 ± 1.06	496.32 ± 1.75	3.487
deepseek2 16B IQ1_S		2	tg128	14.15 ± 0.02	19.08 ± 0.01	1.348
deepseek2 16B IQ1_S		4	tg128	24.34 ± 0.01	31.31 ± 0.08	1.286
deepseek2 16B IQ1_S		8	tg128	35.64 ± 0.01	42.48 ± 0.02	1.192
deepseek2 16B IQ1_S		16	tg128	44.37 ± 0.08	47.84 ± 0.18	1.078
deepseek2 16B IQ1_S	NEON	8	pp512	88.77 ± 0.30	229.23 ± 1.53	2.582
deepseek2 16B IQ1_S		2	tg128	17.80 ± 0.01	22.72 ± 0.00	1.276
deepseek2 16B IQ1_S		4	tg128	29.80 ± 0.13	37.27 ± 0.24	1.251
deepseek2 16B IQ1_S		8	tg128	49.28 ± 0.07	59.28 ± 0.27	1.203

I think one can do better by interleaving 4 rows on the fly, but I leave this for another day.

godrosev · 2025-02-20T13:15:29Z

ikawrakow, thank you so much. This helped me a lot!
Also, it's not that I'm reluctant to use it IQ1_S_R4。Instead, I need a smaller file size and memory (you said he would reduce it by a few GB), it's just that my current work requires running ready-made Unsloth's DeepSeek-R1.
As soon as I'm done with the job, I'll start doing my own quantification of the IQ1_S_R4 using your suggestion, and my device will test the R1 of the 671B very well and I'll tell you the results! I am 100% convinced that this new way(IQ1_S_R4) of quantizing will have better quality and speed!!
Thanks,again!

Kawrakow added 4 commits February 19, 2025 20:04

Adding iq1_s to iqk_mul_mat (Zen4)

8bf80a0

iq1_s: slightly better on Zen4

189a9a6

iq1_s: AVX2

a5be8c8

iq1s: NEON

74c26d0

ikawrakow merged commit 498a582 into main Feb 20, 2025

This was referenced Feb 20, 2025

Does the iqk_mul_mat.cpp support 1.58-bit quantization model? #209

Closed

Fix NEON gemm/gemv for legacy quants when row size is not divisible by 128 #213

Merged

ikawrakow mentioned this pull request Apr 11, 2025

Add missing references to ik_llama.cpp kvcache-ai/ktransformers#1116

Merged

ikawrakow mentioned this pull request Jun 6, 2025

Check if ffn_up and ffn_gate are of the same type before using fmoe #495

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimized GEMM/GEMV for IQ1_S #212

Optimized GEMM/GEMV for IQ1_S #212

Uh oh!

ikawrakow commented Feb 20, 2025 •

edited

Loading

Uh oh!

godrosev commented Feb 20, 2025

Uh oh!

Uh oh!

Optimized GEMM/GEMV for IQ1_S #212

Optimized GEMM/GEMV for IQ1_S #212

Uh oh!

Conversation

ikawrakow commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

godrosev commented Feb 20, 2025

Uh oh!

Uh oh!

ikawrakow commented Feb 20, 2025 •

edited

Loading