You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text was updated successfully, but these errors were encountered:
Ayushk4
changed the title
GPTQ Quantize models and Upload to Huggingface
Quantize the models using GPTQ and Upload in 4-bit precision format to Huggingface
Mar 19, 2023
Per Int-4 LLaMa is not enough - Int-3 and beyond binning with a bin size of 128 appears to reduce most of the remaining output quality loss of GPTQ for models larger than ~10B while only negligably effecting the memory requirement.
GPTQ-for-LLaMa, one of the first GPTQ projects, is already moving towards 3bit with binning being the new default.
Given that memory bandwidth is the major bottleneck on CPU, fewer bits means faster inference. For models large enough (~10B+) 3bit GPTQ with binning may be the way to go.
Ayushk4
changed the title
Quantize the models using GPTQ and Upload in 4-bit precision format to Huggingface
Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface
Mar 20, 2023
Thanks for the suggestion @MarkSchmidty . I am opening a separate issue (#12) as this will require new C/CPP kernels to be added as well.
In short Int-4 LLaMa is not enough study assumed that only weight was being quantized, not intermediate representations. We need to either add new kernels or do another study of the performance drop.
...
The text was updated successfully, but these errors were encountered: