2:4 sparsity + PTQ(int8) model's inference

Are there any runnable demos of using Sparse-QAT/PTQ (2:4) to accelerate inference, such as applying PTQ to a 2:4 sparse LLaMA for inference acceleration? I am curious about the potential speedup ratio this could achieve.
The overall pipeline might be: compressing the Weight matrix using 2:4 sparsity and quantizing it to INT8 format through PTQ/QAT. The Activation matrix should also be quantized to INT8 format through PTQ/QAT. After such processing, the main type of computation would be INT8*INT8. 
I would like to know if there is a tutorial document available, as I am a beginner in the field of quantization.
Thx！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2:4 sparsity + PTQ(int8) model's inference #134

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

2:4 sparsity + PTQ(int8) model's inference #134

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions