Skip to content

Faster IQ3_KT and IQ4_KT #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 24, 2025
Merged

Faster IQ3_KT and IQ4_KT #453

merged 6 commits into from
May 24, 2025

Conversation

ikawrakow
Copy link
Owner

The PR improves AVX2 performance for the trellis quants IQ3_KT and IQ4_KT recently added in PR #441.
The results below are for LLaMA-3.1-8B on a Ryzen-5975WX CPU.

IQ3_KT

N_KV S_PP t/s (main) S_PP t/s (PR) PP speedup S_TG t/s (main) S_TG t/s (PR) TG speedup
0 61.98 71.59 1.155 11.17 13.30 1.191
512 61.27 70.79 1.155 11.10 13.19 1.188
1024 60.48 69.93 1.156 11.04 13.10 1.187
1536 59.94 69.15 1.154 10.95 12.96 1.184
2048 59.48 68.55 1.152 10.87 12.85 1.182

IQ4_KT

N_KV S_PP t/s (main) S_PP t/s (PR) PP speedup S_TG t/s (main) S_TG t/s (PR) TG speedup
0 44.32 64.91 1.465 9.36 11.69 1.249
512 43.90 64.12 1.461 9.26 11.56 1.248
1024 43.60 63.39 1.454 9.19 11.47 1.248
1536 43.32 62.86 1.451 9.12 11.37 1.247
2048 43.07 62.37 1.448 9.06 11.28 1.245

CPU performance is still much lower than other quantization types. But memory bandwidth is far from saturated, so PP and TG will be better on a faster CPU with more cores.

@ikawrakow ikawrakow merged commit a2c42f9 into main May 24, 2025
@ikawrakow ikawrakow mentioned this pull request Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants