Skip to content

Trellis quants with CPU inference #441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 73 commits into from
May 23, 2025

Conversation

andrewkchan
Copy link
Contributor

@andrewkchan andrewkchan commented May 20, 2025

As requested a while ago, takes (#113) and adds CPU implementations of the quantized matmuls (via iqk_mul_mat) for inference. AVX2 and F16C support are required.

As predicted, the CPU ops are very slow. For Llama-3.1-8B-Instruct, I get 0.3 4.83 t/s with IQ2_KT compared to >1.0 4.59 t/s with F16 on AMD EPYC 7R32 (32 cores). Note I am not a SIMD expert and have only spent moderate time on optimizations (e.g. basic use of AVX2/F16C, flattening of the trellis iterations), so it may be possible to speed things up. I also have not added implementations for HAVE_FANCY_SIMD. Additionally, there are only mulmats for F32 activations, as that is what the 3INST algorithm returns (as pointed out in the original PR description).

I am not sure of the PR practices - if you'd like me to merge into #113 rather than the main branch, happy to change. I also tried to clean up some of the comments / dead code in the WIP branch, but can revert those changes as well.

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
We arrive at 112 t/s.
We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
3.125 bpw. So far does not look good on the PPL vs bpw plot.
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
@ikawrakow
Copy link
Owner

For Llama-3.1-8B-Instruct, I get 0.3t/s with IQ2_KT compared to >1.0t/s with F16 on AMD EPYC 7R32 (32 cores)

Is this in debug mode? I'm getting 10.4 t/s for IQ2_KT on my 16-core Ryzen-7950X CPU. Which (as expected) is slow for a 2-bit quantized 8B model, but still in the acceptable range.

@andrewkchan
Copy link
Contributor Author

I'm compiling with cmake --build ./build --config Release -j $(nproc). I might need to tweak the number of threads; I've found this greatly impacts performance on my test machine in the past for llama.cpp.

Here's how I'm testing:

alias ik-build='cmake --build ./build --config Release -j $(nproc)'
ik-build && ./build/bin/llama-cli -m ../Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct-IQ2_KT-2.gguf -cnv -p "You are a helpful assistant" -ngl 0 -c 4096

<prompt with something like "1+1=" then CTRL+C after several tokens are generated to get the numbers>

Should I be using llama-bench or some other tool?

@ikawrakow
Copy link
Owner

I also tried llama-cli to make sure the output is coherent, and also get in the range of 10 t/s. To measure performance I now tend to use llama-sweep-bench. For instance, the table below was generated using

./bin/llama-sweep-bench -m iq2kt.bin -c 2560 -t 16 -fa -ctk q8_0 -ctv q8_0
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 11.436 44.77 12.278 10.42
512 128 512 10.743 47.66 12.782 10.01
512 128 1024 10.639 48.13 13.189 9.70
512 128 1536 11.668 43.88 13.185 9.71
512 128 2048 10.462 48.94 13.310 9.62

We get PP and TG performance as a function of the number of tokens in the KV cache N_KV.

@andrewkchan
Copy link
Contributor Author

Ok, well it's great to know the CPU inference performance is not totally unusable and that it's probably just my setup! I will try to figure this out on my own. Might email you some more questions to not pollute this PR discussion. Thanks also for the pointer on benchmarking.

@andrewkchan
Copy link
Contributor Author

I purged my build directory + recompiled and performance is a lot better, and I no longer see the weird ggml_backend_sched_alloc_splits: failed to allocate graph messages from (ggml-org/llama.cpp#8088). Possibly the build cache was using some artifacts from a previous debug build.

Now F16 gets almost 4x faster at 4.59 generation t/s, and IQ2_KT now beats F16 at 4.83 generation t/s for me.

@ikawrakow
Copy link
Owner

I did speed up IQ2_KT slightly, see this branch. Here is what I get now on the Ryzen-7950X

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 8.176 62.62 10.268 12.47
512 128 512 8.312 61.60 10.476 12.22
512 128 1024 8.826 58.01 10.625 12.05
512 128 1536 8.453 60.57 10.704 11.96
512 128 2048 8.488 60.32 10.798 11.85

Overall it looks good to me, so we can think about merging. But there is also PR #435, where I have completely refactored iqk_mul_mat.cpp. Do you want to look into adding the changes on that branch?

@andrewkchan
Copy link
Contributor Author

andrewkchan commented May 22, 2025

Terrific, this gets my test machine to 5.59t/s. I saw the LCG ops in next8 taking up lots of time but wasn't sure what to do about it, this is a cool trick - I assume having the constants as locals keeps them in registers or otherwise ensures they remain hot in cache?

Re: #435 - it looks not too difficult to me to reconcile my new kernels with the refactor. If you're done with your refactor already, you could merge your PR and then I can fix the resulting conflicts on this PR - maybe that's the cleanest way to do this? Since this branch is already conflicting with a file on main anyway. Otherwise happy to merge this first, then work on your branch.

@andrewkchan andrewkchan marked this pull request as ready for review May 22, 2025 09:07
@andrewkchan andrewkchan changed the title Trellis quants with CPU implementations Trellis quants with CPU inference May 23, 2025
@ikawrakow ikawrakow merged commit a1c931c into ikawrakow:main May 23, 2025
@andrewkchan andrewkchan mentioned this pull request May 26, 2025
4 tasks
@ikawrakow ikawrakow mentioned this pull request Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants