Minor (~2%) iq2_ks TG performance improvement on CUDA #468

ikawrakow · 2025-05-28T10:18:37Z

No description provided.

* iq2_kt: Metal dequantize * iq2_kt: Metal GEMV Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B * iq3_kt: Metal dequantize * iq3_kt: Metal GEMV Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B. Flipping signs is a costly affair. * iq4_kt: Metal dequantize - getting NaNs * iq4_kt: Metal GEMV - also not working * iq4_kt: Metal still not working * Disable iq4_kt on Metal for now --------- Trellis quants: faster CPU prompt processing (ikawrakow#482) * Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Minor (~2%) iq2_ks TG performance improvement on CUDA (ikawrakow#468) Direct conversion from fp16 to Q6_0

Minor (~2%) iq2_ks TG performance improvement on CUDA

9b97acd

ikawrakow merged commit 7a8abe2 into main Jun 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minor (~2%) iq2_ks TG performance improvement on CUDA #468

Minor (~2%) iq2_ks TG performance improvement on CUDA #468

Uh oh!

ikawrakow commented May 28, 2025

Uh oh!

Uh oh!

Minor (~2%) iq2_ks TG performance improvement on CUDA #468

Minor (~2%) iq2_ks TG performance improvement on CUDA #468

Uh oh!

Conversation

ikawrakow commented May 28, 2025

Uh oh!

Uh oh!