Skip to content

Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4 #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 30, 2025
Merged

Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4 #182

merged 7 commits into from
Jan 30, 2025

Conversation

ikawrakow
Copy link
Owner

TG is about the same. PP-512 comparison between main and this PR for LLaMA-3.1-8B on a Ryzen-5975WX (AVX2) and a Ryzen-7950X (Zen4)

model backend threads test t/s (main) t/s (PR) Speedup
llama 8B Q4_K_S AVX2 32 pp512 291.90 ± 0.64 327.98 ± 0.51 1.124
llama 8B Q5_K_S AVX2 32 pp512 273.59 ± 0.37 302.13 ± 0.61 1.104
llama 8B Q4_K_S Zen4 16 pp512 258.78 ± 1.05 267.69 ± 0.31 1.034
llama 8B Q5_K_S Zen4 16 pp512 246.19 ± 0.65 249.12 ± 0.42 1.012

The improvement on Zen4 is very minor. The benefit there is bloat reduction as I'm now reusing the same implementation as AVX2.

We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f872.
With FA and Q8_0 K-cache we get to 339.5 t/s.
We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.
After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.
@ikawrakow ikawrakow merged commit 2e6b523 into main Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants