Trellis quants with CPU inference #441

andrewkchan · 2025-05-20T17:16:30Z

As requested a while ago, takes (#113) and adds CPU implementations of the quantized matmuls (via iqk_mul_mat) for inference. AVX2 and F16C support are required.

As predicted, the CPU ops are very slow. For Llama-3.1-8B-Instruct, I get ~~0.3~~ 4.83 t/s with IQ2_KT compared to ~~>1.0~~ 4.59 t/s with F16 on AMD EPYC 7R32 (32 cores). Note I am not a SIMD expert and have only spent moderate time on optimizations (e.g. basic use of AVX2/F16C, flattening of the trellis iterations), so it may be possible to speed things up. I also have not added implementations for HAVE_FANCY_SIMD. Additionally, there are only mulmats for F32 activations, as that is what the 3INST algorithm returns (as pointed out in the original PR description).

I am not sure of the PR practices - if you'd like me to merge into #113 rather than the main branch, happy to change. I also tried to clean up some of the comments / dead code in the WIP branch, but can revert those changes as well.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable.

rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw.

I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B).

so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs.

With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster.

Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.

Re-quantize after determining block scales (at the epxense of much longer quantization time).

Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU.

We arrive at 112 t/s.

We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance.

We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16.

3.125 bpw. So far does not look good on the PPL vs bpw plot.

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX.

Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

ikawrakow · 2025-05-21T07:13:48Z

For Llama-3.1-8B-Instruct, I get 0.3t/s with IQ2_KT compared to >1.0t/s with F16 on AMD EPYC 7R32 (32 cores)

Is this in debug mode? I'm getting 10.4 t/s for IQ2_KT on my 16-core Ryzen-7950X CPU. Which (as expected) is slow for a 2-bit quantized 8B model, but still in the acceptable range.

andrewkchan · 2025-05-21T07:17:47Z

I'm compiling with cmake --build ./build --config Release -j $(nproc). I might need to tweak the number of threads; I've found this greatly impacts performance on my test machine in the past for llama.cpp.

Here's how I'm testing:

alias ik-build='cmake --build ./build --config Release -j $(nproc)'
ik-build && ./build/bin/llama-cli -m ../Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct-IQ2_KT-2.gguf -cnv -p "You are a helpful assistant" -ngl 0 -c 4096

<prompt with something like "1+1=" then CTRL+C after several tokens are generated to get the numbers>

Should I be using llama-bench or some other tool?

ikawrakow · 2025-05-21T07:24:07Z

I also tried llama-cli to make sure the output is coherent, and also get in the range of 10 t/s. To measure performance I now tend to use llama-sweep-bench. For instance, the table below was generated using

./bin/llama-sweep-bench -m iq2kt.bin -c 2560 -t 16 -fa -ctk q8_0 -ctv q8_0

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	11.436	44.77	12.278	10.42
512	128	512	10.743	47.66	12.782	10.01
512	128	1024	10.639	48.13	13.189	9.70
512	128	1536	11.668	43.88	13.185	9.71
512	128	2048	10.462	48.94	13.310	9.62

We get PP and TG performance as a function of the number of tokens in the KV cache N_KV.

andrewkchan · 2025-05-21T07:30:16Z

Ok, well it's great to know the CPU inference performance is not totally unusable and that it's probably just my setup! I will try to figure this out on my own. Might email you some more questions to not pollute this PR discussion. Thanks also for the pointer on benchmarking.

andrewkchan · 2025-05-21T08:11:09Z

I purged my build directory + recompiled and performance is a lot better, and I no longer see the weird ggml_backend_sched_alloc_splits: failed to allocate graph messages from (ggml-org/llama.cpp#8088). Possibly the build cache was using some artifacts from a previous debug build.

Now F16 gets almost 4x faster at 4.59 generation t/s, and IQ2_KT now beats F16 at 4.83 generation t/s for me.

ikawrakow · 2025-05-21T14:35:39Z

I did speed up IQ2_KT slightly, see this branch. Here is what I get now on the Ryzen-7950X

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	8.176	62.62	10.268	12.47
512	128	512	8.312	61.60	10.476	12.22
512	128	1024	8.826	58.01	10.625	12.05
512	128	1536	8.453	60.57	10.704	11.96
512	128	2048	8.488	60.32	10.798	11.85

Overall it looks good to me, so we can think about merging. But there is also PR #435, where I have completely refactored iqk_mul_mat.cpp. Do you want to look into adding the changes on that branch?

andrewkchan · 2025-05-22T04:32:39Z

Terrific, this gets my test machine to 5.59t/s. I saw the LCG ops in next8 taking up lots of time but wasn't sure what to do about it, this is a cool trick - I assume having the constants as locals keeps them in registers or otherwise ensures they remain hot in cache?

Re: #435 - it looks not too difficult to me to reconcile my new kernels with the refactor. If you're done with your refactor already, you could merge your PR and then I can fix the resulting conflicts on this PR - maybe that's the cleanest way to do this? Since this branch is already conflicting with a file on main anyway. Otherwise happy to merge this first, then work on your branch.

Kawrakow added 30 commits November 21, 2024 08:16

WIP

798f93c

WIP

c578478

WIP

afe9db7

Testing Trellis quantization: 4-bit quantized block scales

9ec1455

rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw.

Testing Trellis quantization: playing with scales and generators

f1df1b7

iq2_kt: CUDA dequantize

426a6e6

so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs.

WIP

a961a48

WIP

86948f9

iq2_kt - this is better

36e9c92

Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.

iq2_kt - even better

b3dfe99

Re-quantize after determining block scales (at the epxense of much longer quantization time).

iq2_kt: very slightly faster CUDA dot product

aed3910

iq2_kt: f16 CUDA dot product

b354392

We arrive at 112 t/s.

iq2_kt: faster f16 CUDA dot product

7cafafc

We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance.

iq2_kt: faster f16 CUDA dot product

7bf6e15

We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16.

Minor

590f472

Adding iq3_kt

4774788

3.125 bpw. So far does not look good on the PPL vs bpw plot.

Forgotten change

977f94b

WIP

08503ce

WIP

435eb9b

iq3_kt WIP: slowly improving

f1fb59b

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.

WIP

386d139

iq3_kt WIP: slowly improving

dfcc8a9

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

iq3_kt WIP: slowly improving

8f0d075

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

iq3_kt speed up quantization

e9e5879

Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

iq3_kt: CUDA dot product

0ffc9b4

andrewkchan added 11 commits May 18, 2025 21:35

still super slow (0.17t/s eval)

addac77

flatten 3inst iters + avx2 (0.3t/s eval)

c4e5d3e

iq3_kt (0.3t/s eval) and renames

04eb150

wip buggy iq4_KT

103345a

fix (0.22t/s eval)

cb29146

naming and remove unused fn

922b22f

cleanup

d5eb74d

Merge branch 'main' into andrewkchan/try_trellis

aefab2e

more cleanup

9ceef49

Merge remote-tracking branch 'origin/main' into andrewkchan/try_trellis

4671258

delete unused and noncompiling mmvq functions

60a948b

Kawrakow and others added 3 commits May 22, 2025 03:50

Some performance tweaks

8e73f01

Slighty faster iq2_kt

fd26d8c

port Trellis struct to iq3_kt, iq4_kt

ee5de91

Merge remote-tracking branch 'origin/main' into andrewkchan/try_trellis

2ba0b6f

andrewkchan marked this pull request as ready for review May 22, 2025 09:07

oops untracked files

c345a1a

andrewkchan changed the title ~~Trellis quants with CPU implementations~~ Trellis quants with CPU inference May 23, 2025

ikawrakow approved these changes May 23, 2025

View reviewed changes

ikawrakow merged commit a1c931c into ikawrakow:main May 23, 2025

ikawrakow mentioned this pull request May 24, 2025

Faster IQ3_KT and IQ4_KT #453

Merged

andrewkchan mentioned this pull request May 26, 2025

aarch64 kernels for KT quants #460

Closed

4 tasks

ikawrakow mentioned this pull request Jun 1, 2025

Trellis quantization #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trellis quants with CPU inference #441

Trellis quants with CPU inference #441

Uh oh!

andrewkchan commented May 20, 2025 •

edited

Loading

Uh oh!

ikawrakow commented May 21, 2025

Uh oh!

andrewkchan commented May 21, 2025

Uh oh!

ikawrakow commented May 21, 2025

Uh oh!

andrewkchan commented May 21, 2025

Uh oh!

andrewkchan commented May 21, 2025

Uh oh!

ikawrakow commented May 21, 2025

Uh oh!

andrewkchan commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Trellis quants with CPU inference #441

Trellis quants with CPU inference #441

Uh oh!

Conversation

andrewkchan commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented May 21, 2025

Uh oh!

andrewkchan commented May 21, 2025

Uh oh!

ikawrakow commented May 21, 2025

Uh oh!

andrewkchan commented May 21, 2025

Uh oh!

andrewkchan commented May 21, 2025

Uh oh!

ikawrakow commented May 21, 2025

Uh oh!

andrewkchan commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

andrewkchan commented May 20, 2025 •

edited

Loading

andrewkchan commented May 22, 2025 •

edited

Loading