Skip to content

Metal implementatio for the trellis quants. #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 1, 2025
Merged

Conversation

ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented May 30, 2025

IQ2_KT and IQ3_KT work. IQ2_KT has a pretty decent performance.

IQ4_KT is not working, so a draft PR for now.

IQ4_KT is disabled for now as there is a bug that I don't find.

Kawrakow added 8 commits May 30, 2025 07:52
Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B
Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B.
Flipping signs is a costly affair.
@ikawrakow ikawrakow marked this pull request as ready for review June 1, 2025 12:22
@ikawrakow ikawrakow merged commit 35374bc into main Jun 1, 2025
@ikawrakow ikawrakow mentioned this pull request Jun 1, 2025
Nexesenex pushed a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Jun 2, 2025
* iq2_kt: Metal dequantize

* iq2_kt: Metal GEMV

Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B

* iq3_kt: Metal dequantize

* iq3_kt: Metal GEMV

Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B.
Flipping signs is a costly affair.

* iq4_kt: Metal dequantize - getting NaNs

* iq4_kt: Metal GEMV - also not working

* iq4_kt: Metal still not working

* Disable iq4_kt on Metal for now

---------

Trellis quants: faster CPU prompt processing (ikawrakow#482)

* Experimenting with dequant + f32 GEMM

For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.

* Experimenting with dequant + f32 GEMM

iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s

* Experimenting with dequant + f16 GEMM on NEON

iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s

Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.

* Experimenting with dequant + f16 GEMM on NEON

iq4_kt: PP512 = 86 t/s from 29 t/s

* Minor

---------

Minor (~2%) iq2_ks TG performance improvement on CUDA (ikawrakow#468)

Direct conversion from fp16 to Q6_0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants