Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface #2

Ayushk4 · 2023-03-19T15:56:49Z

MarkSchmidty · 2023-03-20T05:32:15Z

Per Int-4 LLaMa is not enough - Int-3 and beyond binning with a bin size of 128 appears to reduce most of the remaining output quality loss of GPTQ for models larger than ~10B while only negligably effecting the memory requirement.

GPTQ-for-LLaMa, one of the first GPTQ projects, is already moving towards 3bit with binning being the new default.

Given that memory bandwidth is the major bottleneck on CPU, fewer bits means faster inference. For models large enough (~10B+) 3bit GPTQ with binning may be the way to go.

Ayushk4 · 2023-03-20T15:10:10Z

Thanks for the suggestion @MarkSchmidty . I am opening a separate issue (#12) as this will require new C/CPP kernels to be added as well.

In short Int-4 LLaMa is not enough study assumed that only weight was being quantized, not intermediate representations. We need to either add new kernels or do another study of the performance drop.

Ayushk4 changed the title ~~GPTQ Quantize models and Upload to Huggingface~~ Quantize the models using GPTQ and Upload in 4-bit precision format to Huggingface Mar 19, 2023

Ayushk4 changed the title ~~Quantize the models using GPTQ and Upload in 4-bit precision format to Huggingface~~ Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface Mar 20, 2023

Ayushk4 mentioned this issue Mar 20, 2023

Add C functions for MatMul over Int-3 Quant and Int-4 with different bin-sizes #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface #2

Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface #2

Ayushk4 commented Mar 19, 2023

MarkSchmidty commented Mar 20, 2023 •

edited

Loading

Ayushk4 commented Mar 20, 2023

Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface #2

Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface #2

Comments

Ayushk4 commented Mar 19, 2023

MarkSchmidty commented Mar 20, 2023 • edited Loading

Ayushk4 commented Mar 20, 2023

MarkSchmidty commented Mar 20, 2023 •

edited

Loading