Open
Description
Hi,
Thanks to the great work of the authors of AWQ, maintainers at TGI, and the open-source community, AWQ is now supported in TGI (link).
@TheBloke has released many AWQ-quantized models on HuggingFace all of these can be run using TGI as follows:
text-generation-launcher \
--model-id TheBloke/Llama-2-7b-Chat-AWQ \
--trust-remote-code --port 8080 \
--max-input-length 3072 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
--quantize awq
Note, that this PR uses older GEMM kernels from AWQ (commit f084f40).
Thanks!
Metadata
Metadata
Assignees
Labels
No labels