This repository provides a Sparse Attention API based on SageAttention V1, which can compute attention with any block sparse pattern very fast.
python>=3.9
,torch>=2.3.0
,triton==3.0 / 3.1 / 3.2
- For RTX 5090, please use
torch_nightly + triton 3.4.0+git5389ed79
To use our Sparse_SageAttention, please:
git clone https://github.com/jt-zhang/Sparse_SageAttention_API.git
cd Sparse_SageAttention_API
python setup.py install # or pip install -e .
from sparse_sageattn import sparse_sageattn
sparse_attn_output = sparse_sageattn(
q, k, v,
mask_id=None,
is_causal=False,
tensor_layout="HND")
q, k, v
are inFP16/BF16
with the shape of(batch_size, head_num, seq_len, head_dim)
when using the defaulttenso_layout
mask_id
is a block mask to specify whether the corresponding block in the attention map should be calculated (1
for calculating and0
for skipped). The default value isNone
, which will perform full SageAttention V1. Currently, we only support the block size of (128, 64). For example, for an attention map (512x512) below, we can setmask_id
as:
mask_id = torch.tensor([
[0, 1, 0, 1, 1, 1, 0, 1],
[0, 1, 1, 1, 0, 0, 1, 0],
[1, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 1, 0, 1, 1, 0, 0]
], device='cuda')
is_causal
to specify whether it is a causal attention or not; the default isFalse
.tensor_layout
specifies the layout ofq, k, v
:- for
q,k,v
with shape(batch_size, head_num, seqlen, head_dim)
it should be"HND"
. - for
q,k,v
with shape(batch_size, seqlen, head_num, head_dim)
it should be"NHD"
.
- for
We compare the TOPS of Sparse_SageAttention, FlexAttention and FlashAttention2 under different sparsity on RTX 4090 and RTX 5090:
If you use this code or find our work valuable, please cite:
@inproceedings{zhang2025spargeattn,
title={Spargeattn: Accurate sparse attention accelerating any model inference},
author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@inproceedings{zhang2024sageattention2,
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
@article{zhang2025sageattention3,
title={Sageattention3: Microscaling fp4 attention for inference and an exploration of 8-bit training},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},
journal={arXiv preprint arXiv:2505.11594},
year={2025}
}