Open
Description
Are there any runnable demos of using Sparse-QAT/PTQ (2:4) to accelerate inference, such as applying PTQ to a 2:4 sparse LLaMA for inference acceleration? I am curious about the potential speedup ratio this could achieve.
The overall pipeline might be: compressing the Weight matrix using 2:4 sparsity and quantizing it to INT8 format through PTQ/QAT. The Activation matrix should also be quantized to INT8 format through PTQ/QAT. After such processing, the main type of computation would be INT8*INT8.
I would like to know if there is a tutorial document available, as I am a beginner in the field of quantization.
Thx!