-
Notifications
You must be signed in to change notification settings - Fork 54
Legacy quants conversion schemes in convert_hf_to_gguf.py #449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why do we need the change in |
Well, when I test a new finetune or merge of a big model I can't run in 16 or even 8 bits, I like to make a simple q5_0 or even q4_0 conversion to test it in chat in full offload or quasi-full offload on my 64GB VRAM.
I think some other folks could use that too, especially the ability to convert and test a finetune or merge of a supported foundation model in a single shot without bothering with the usual 2 steps approach (source gguf, then quant). |
Did you test that the conversion is working? I'm in the middle of something and don't feel like downloading a few models from HF to test. The described new model testing procedure saves 1 conversion to |
The Llama 3 class models are working, that's certain. Yes, the conversion is working in q4_0, q4_1, q5_0, q5_1. I use q4_0 and q5_0 very often, I'm just sharing the code I edited made months ago. If you were inclined to implement q6_0 in the .py conversion during a day of schedule haze and benevolence, that would be even better, of course, because q8_0 is a bit overkill for such tests or iMatrix creation! ^^
I made perplexity tests on my conversions (more than a hundred of them during the last trimester), and they are as expected. For example for a L3 70b with a perplexity of 3.83 in FP16, I will have around 3.91 on a converted q5_0 mix such as proposed. The quality is fine on the fine models. A q4_0 with Embeddings, Output, K and V in q5_0 is still acceptable without an iMatrix as well for testing and iMatrix purpose. Unless I'm thrilled with a model, I keep using some of those conversions as they are when I have enough VRAM, not even bothering to make the whole imat/f16/quant process. When I'll come home tonight, I'll make some tests beyond the Llama 3 70b I've been converting extensively with that method. |
Just checked the 4 conversions types on Llama 3 1B, and they are all coherent, giving me an average recipe of French fries when asked. The feature seem to work with the IK Llama gguf conversion script as it is for the models it can convert normally, without the need to update it with the subsequent mainline PRs. |
58ef8f6
to
3c10a45
Compare
This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.
3c10a45
to
255414f
Compare
This, notably in order to make smaller conversions to generate an iMatrix file.
Q4_0
,Q4_1
are here using embeddings, output, attn_k and attn_v in q5_0.Q5_0
,Q5_1
are here using embeddings, output, attn_k and attn_v in q8_0.Adapted from the following llama.cpp mainline PR : ggml-org/llama.cpp#9022 Original author @chentyjpm
Reason : Even in pure q4_0, an iMatrix is viable (much less than 0.01 ppl difference with a q8_0 one on the final quantization made with the created iMatrix).
Those schemes are thus pertinent imho.
I personally use the q5_0 scheme to make my iMatrixes for the L3 70b models, and the ppl difference is less than 0.005 on the final quantized model with iMatrix, this compared to an f16 iMatrix made by Bartowski or Mradermacher.
Also, 2 forgotten mentions of FTYPE IQ3_KL are added in llama.cpp file, and one IQ5_KS mention in the mmvq_type_supported switch.