-
Notifications
You must be signed in to change notification settings - Fork 11.5k
perplexity mismatch with GPTQ #1877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Slightly different algorithms, and probably also slightly different preparation of the dataset. A while ago I made a perplexity calculation that 100% replicated llama.cpp's method, but for use with GPTQ and pytorch models. You can find that code here: AutoGPTQ/AutoGPTQ#70 I planned to do a perplexity comparison project with it comparing permutations of GPTQ with llama.cpp quant formats, but I still haven't finished it. But the code is still available. One thing I found was that with wikitext, I had to slightly manipulate the dataset in order to get results that matched llama.cpp's. That's because llama.cpp loads the text straight into memory, with no processing. But in Python, when loading the dataset using the Hugging Face In these lines of code https://github.com/PanQiWei/AutoGPTQ/pull/70/files#diff-9724e9bf653714443b2205985e5412245d4472dc7e5367ce1531568104b663adR21-R23 I adjust the output of the wikitext dataset so that the text will exactly match the text that llama.cpp loads in its I didn't check if I needed to do the same with C4, as I primarily tested with wikitext. But if you run the above code with wikitext it will give you an apples-to-apples comparison with llama.cpp |
It's a bit unfortunate that |
It would make sense to switch to a "standard" whatever we call it, as people are kinda using that one. There is also some discussion around the MMLU benchmark here https://twitter.com/Tim_Dettmers/status/1666913630523367429 |
Thanks for the insights. |
Yeah:
Note that although the group_size + act-order case gives an improved perplexity, this config is not actually used by most people. The reason is that with most current GPTQ implementations, using group_size and act-order together will significantly lower performance. So 128g + without act-order is what most users are actually using when they use a 7B or 13B GPTQ. Spreadsheet of everything I analysed is here: https://docs.google.com/spreadsheets/d/1ugN8EGlT-7rSYMBAD4dcq6TCtuL_XS1gSuOhkNA7abs/edit?usp=sharing It's not finished. I also did quantisations for 30B and 65B, and many other GPTQ permutations like 3bit, different damp_percents (an advanced GPTQ parameter), and more. Then I got busy with other things and never finished it off! I really should. |
This is great info. I didn’t manage to run your code successful. But from
your results, it shows ggml_q4_0 model ppl is on par with GPTQ 128g
act_order version.
Plus With the latest CUDA optimization work in llama.cop, the speed is also
on par with gptq or even faster on my RTX A6000. (18 ms/token ggml vs 21
ms/token gotq)
On Fri, Jun 16, 2023 at 12:26 PM Tom Jobbins ***@***.***> wrote:
Yeah:
- Llama 7B 4bit 128g no act-order: 6.3850
- Llama 7B 4bit 128g with act-order: 6.0653
- Llama 13B 4bit 128g no act-order: 5.3370
- Llama 13B 4bit 128g with act-order: 5.3319
Note that although the group_size + act-order case gives an improved
perplexity, this config is not actually used by most people. The reason is
that with most current GPTQ implementations, using group_size and act-order
together will significantly lower performance. So 128g + without act-order
is what most users are actually using when they use a 7B or 13B GPTQ.
Spreadsheet of everything I analysed is here:
https://docs.google.com/spreadsheets/d/1ugN8EGlT-7rSYMBAD4dcq6TCtuL_XS1gSuOhkNA7abs/edit?usp=sharing
It's not finished. I also did quantisations for 30B and 65B, and many
other GPTQ permutations like 3bit, different damp_percents (an advanced
GPTQ parameter), and more.
Then I got busy with other things and never finished it off! I really
should.
—
Reply to this email directly, view it on GitHub
<#1877 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADY2AP4R3HJR6ESU2VIW353XLSCKFANCNFSM6AAAAAAZICWPA4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
null
|
@TheBloke or anyone else Do you know what |
The reordering in GPTQ refers to the fact that they reorder weights in order of least quantization error to avoid performance degradation. Mostly a heuristic method found by experiment it seems. I also implemented llama.cpp perplexity in a PR in the AutoGPTQ repository so that we can more officially compare but it seems that the maintainer has gone inactive. |
An alternative way to compare llama.cpp/AutoGPTQ perplexities would be to create a "llamacpp_HF" wrapper that would turn llama.cpp into a transformers model, allowing eg this code to be used for the evaluation. A similar wrapper was done by @Larryvrh for ExLlama here. I briefly tried the same for llama.cpp but had no luck. The data that I could find was: In the first table, we see the following for llama-13b:
In the second table, we see that the +ppl for q4_1 is 0.1065, and 0.0459 for q4_K_M. Based on this, the table would expand to:
So llama.cpp would perform better than AutoGPTQ. It would be interesting to have this data for all possible quantizations and sizes, including in particular llama-65b with q3_K_M and q4_K_M quantizations, since those seem to be the state of the art for llama-65b inference on consumer GPUs. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
GPTQ is reporting 5.68 PPL on wikitext2 for FP16 baseline. Yet llama.cpp reports 5.9.
What's the mismatch?
The text was updated successfully, but these errors were encountered: