perplexity mismatch with GPTQ #1877

JianbangZ · 2023-06-15T16:14:19Z

GPTQ is reporting 5.68 PPL on wikitext2 for FP16 baseline. Yet llama.cpp reports 5.9.
What's the mismatch?

TheBloke · 2023-06-15T16:32:51Z

Slightly different algorithms, and probably also slightly different preparation of the dataset.

A while ago I made a perplexity calculation that 100% replicated llama.cpp's method, but for use with GPTQ and pytorch models. You can find that code here: AutoGPTQ/AutoGPTQ#70

I planned to do a perplexity comparison project with it comparing permutations of GPTQ with llama.cpp quant formats, but I still haven't finished it. But the code is still available.

One thing I found was that with wikitext, I had to slightly manipulate the dataset in order to get results that matched llama.cpp's. That's because llama.cpp loads the text straight into memory, with no processing. But in Python, when loading the dataset using the Hugging Face datasets library, it splits it into rows and some characters get stripped.

In these lines of code https://github.com/PanQiWei/AutoGPTQ/pull/70/files#diff-9724e9bf653714443b2205985e5412245d4472dc7e5367ce1531568104b663adR21-R23 I adjust the output of the wikitext dataset so that the text will exactly match the text that llama.cpp loads in its perplexity tool.

I didn't check if I needed to do the same with C4, as I primarily tested with wikitext.

But if you run the above code with wikitext it will give you an apples-to-apples comparison with llama.cpp

ggerganov · 2023-06-15T17:58:53Z

It's a bit unfortunate that llama.cpp ended up using a "non-standard" perplexity evaluation.
Wondering what would be the implications of updating the method to match the "standard" one

JianbangZ · 2023-06-15T18:24:29Z

It's a bit unfortunate that llama.cpp ended up using a "non-standard" perplexity evaluation. Wondering what would be the implications of updating the method to match the "standard" one

It would make sense to switch to a "standard" whatever we call it, as people are kinda using that one. There is also some discussion around the MMLU benchmark here https://twitter.com/Tim_Dettmers/status/1666913630523367429

JianbangZ · 2023-06-16T15:10:30Z

Slightly different algorithms, and probably also slightly different preparation of the dataset.

A while ago I made a perplexity calculation that 100% replicated llama.cpp's method, but for use with GPTQ and pytorch models. You can find that code here: PanQiWei/AutoGPTQ#70

I planned to do a perplexity comparison project with it comparing permutations of GPTQ with llama.cpp quant formats, but I still haven't finished it. But the code is still available.

One thing I found was that with wikitext, I had to slightly manipulate the dataset in order to get results that matched llama.cpp's. That's because llama.cpp loads the text straight into memory, with no processing. But in Python, when loading the dataset using the Hugging Face datasets library, it splits it into rows and some characters get stripped.

In these lines of code https://github.com/PanQiWei/AutoGPTQ/pull/70/files#diff-9724e9bf653714443b2205985e5412245d4472dc7e5367ce1531568104b663adR21-R23 I adjust the output of the wikitext dataset so that the text will exactly match the text that llama.cpp loads in its perplexity tool.

I didn't check if I needed to do the same with C4, as I primarily tested with wikitext.

But if you run the above code with wikitext it will give you an apples-to-apples comparison with llama.cpp

Thanks for the insights.
Did you happen to re evaluate the GPTQ-4b-128g perplexity with your code?

TheBloke · 2023-06-16T16:26:00Z

Yeah:

Llama 7B 4bit 128g no act-order: 6.3850
Llama 7B 4bit 128g with act-order: 6.0653
Llama 13B 4bit 128g no act-order: 5.3370
Llama 13B 4bit 128g with act-order: 5.3319

Note that although the group_size + act-order case gives an improved perplexity, this config is not actually used by most people. The reason is that with most current GPTQ implementations, using group_size and act-order together will significantly lower performance. So 128g + without act-order is what most users are actually using when they use a 7B or 13B GPTQ.

Spreadsheet of everything I analysed is here: https://docs.google.com/spreadsheets/d/1ugN8EGlT-7rSYMBAD4dcq6TCtuL_XS1gSuOhkNA7abs/edit?usp=sharing

It's not finished. I also did quantisations for 30B and 65B, and many other GPTQ permutations like 3bit, different damp_percents (an advanced GPTQ parameter), and more.

Then I got busy with other things and never finished it off! I really should.

JianbangZ · 2023-06-16T16:34:13Z

This is great info. I didn’t manage to run your code successful. But from your results, it shows ggml_q4_0 model ppl is on par with GPTQ 128g act_order version. Plus With the latest CUDA optimization work in llama.cop, the speed is also on par with gptq or even faster on my RTX A6000. (18 ms/token ggml vs 21 ms/token gotq)

On Fri, Jun 16, 2023 at 12:26 PM Tom Jobbins ***@***.***> wrote: Yeah: - Llama 7B 4bit 128g no act-order: 6.3850 - Llama 7B 4bit 128g with act-order: 6.0653 - Llama 13B 4bit 128g no act-order: 5.3370 - Llama 13B 4bit 128g with act-order: 5.3319 Note that although the group_size + act-order case gives an improved perplexity, this config is not actually used by most people. The reason is that with most current GPTQ implementations, using group_size and act-order together will significantly lower performance. So 128g + without act-order is what most users are actually using when they use a 7B or 13B GPTQ. Spreadsheet of everything I analysed is here: https://docs.google.com/spreadsheets/d/1ugN8EGlT-7rSYMBAD4dcq6TCtuL_XS1gSuOhkNA7abs/edit?usp=sharing It's not finished. I also did quantisations for 30B and 65B, and many other GPTQ permutations like 3bit, different damp_percents (an advanced GPTQ parameter), and more. Then I got busy with other things and never finished it off! I really should. — Reply to this email directly, view it on GitHub <#1877 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADY2AP4R3HJR6ESU2VIW353XLSCKFANCNFSM6AAAAAAZICWPA4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- null

ggerganov · 2023-06-17T10:38:02Z

@TheBloke or anyone else

Do you know what with act-order means?
From what I get, it means to sort the "activations" before quantizing them. But the trouble I have is I don't know what "activations" means. Is this some of the tensors of the input model (e.g. w1, w2) or something else?

casper-hansen · 2023-07-03T09:19:36Z

@TheBloke or anyone else

Do you know what with act-order means? From what I get, it means to sort the "activations" before quantizing them. But the trouble I have is I don't know what "activations" means. Is this some of the tensors of the input model (e.g. w1, w2) or something else?

The reordering in GPTQ refers to the fact that they reorder weights in order of least quantization error to avoid performance degradation. Mostly a heuristic method found by experiment it seems.

I also implemented llama.cpp perplexity in a PR in the AutoGPTQ repository so that we can more officially compare but it seems that the maintainer has gone inactive.

AutoGPTQ/AutoGPTQ#166

oobabooga · 2023-07-08T16:42:35Z

An alternative way to compare llama.cpp/AutoGPTQ perplexities would be to create a "llamacpp_HF" wrapper that would turn llama.cpp into a transformers model, allowing eg this code to be used for the evaluation. A similar wrapper was done by @Larryvrh for ExLlama here. I briefly tried the same for llama.cpp but had no luck.

The data that I could find was:

The spreadsheet by @TheBloke above (link)
The table in this comment: link

In the first table, we see the following for llama-13b:

Model	Perplexity
AutoGPTQ 4bit-128g	5.3370
llama.cpp q4_1	5.3607

In the second table, we see that the +ppl for q4_1 is 0.1065, and 0.0459 for q4_K_M. Based on this, the table would expand to:

Model	Perplexity
llama.cpp q4_K_M	5.30 (estimated)
AutoGPTQ 4bit-128g	5.3370
llama.cpp q4_1	5.3607

So llama.cpp would perform better than AutoGPTQ. It would be interesting to have this data for all possible quantizations and sizes, including in particular llama-65b with q3_K_M and q4_K_M quantizations, since those seem to be the state of the art for llama-65b inference on consumer GPUs.

github-actions · 2024-04-10T01:07:02Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

JianbangZ mentioned this issue Jan 5, 2024

GGUF compatible quantization (2, 3, 4 bit / any bit) casper-hansen/AutoAWQ#285

Merged

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perplexity mismatch with GPTQ #1877

perplexity mismatch with GPTQ #1877

JianbangZ commented Jun 15, 2023

TheBloke commented Jun 15, 2023 •

edited

Loading

ggerganov commented Jun 15, 2023

JianbangZ commented Jun 15, 2023

JianbangZ commented Jun 16, 2023

TheBloke commented Jun 16, 2023

JianbangZ commented Jun 16, 2023 via email

ggerganov commented Jun 17, 2023

casper-hansen commented Jul 3, 2023 •

edited

Loading

oobabooga commented Jul 8, 2023

github-actions bot commented Apr 10, 2024

perplexity mismatch with GPTQ #1877

perplexity mismatch with GPTQ #1877

Comments

JianbangZ commented Jun 15, 2023

TheBloke commented Jun 15, 2023 • edited Loading

ggerganov commented Jun 15, 2023

JianbangZ commented Jun 15, 2023

JianbangZ commented Jun 16, 2023

TheBloke commented Jun 16, 2023

JianbangZ commented Jun 16, 2023 via email

ggerganov commented Jun 17, 2023

casper-hansen commented Jul 3, 2023 • edited Loading

oobabooga commented Jul 8, 2023

github-actions bot commented Apr 10, 2024

TheBloke commented Jun 15, 2023 •

edited

Loading

casper-hansen commented Jul 3, 2023 •

edited

Loading