Sparse SageAttention

This repository provides a Sparse Attention API based on SageAttention V1, which can compute attention with any block sparse pattern very fast.

Installation

Base Environment

python>=3.9, torch>=2.3.0, triton==3.0 / 3.1 / 3.2

For RTX 5090, please use torch_nightly + triton 3.4.0+git5389ed79

Install Package

To use our Sparse_SageAttention, please:

git clone https://github.com/jt-zhang/Sparse_SageAttention_API.git
cd Sparse_SageAttention_API
python setup.py install  # or pip install -e .

Usage

from sparse_sageattn import sparse_sageattn
sparse_attn_output = sparse_sageattn(
        q, k, v,
        mask_id=None,     
        is_causal=False, 
        tensor_layout="HND")

q, k, v are in FP16/BF16 with the shape of (batch_size, head_num, seq_len, head_dim) when using the default tenso_layout
mask_id is a block mask to specify whether the corresponding block in the attention map should be calculated (1 for calculating and 0 for skipped). The default value is None, which will perform full SageAttention V1. Currently, we only support the block size of (128, 64). For example, for an attention map (512x512) below, we can set mask_id as:

mask_id = torch.tensor([
  [0, 1, 0, 1, 1, 1, 0, 1],
  [0, 1, 1, 1, 0, 0, 1, 0],
  [1, 0, 1, 1, 1, 0, 0, 1],
  [0, 0, 1, 0, 1, 1, 0, 0]
  ], device='cuda')

is_causal to specify whether it is a causal attention or not; the default is False.
tensor_layout specifies the layout of q, k, v:
- for q,k,v with shape (batch_size, head_num, seqlen, head_dim) it should be "HND".
- for q,k,v with shape (batch_size, seqlen, head_num, head_dim) it should be "NHD".

Performance

We compare the TOPS of Sparse_SageAttention, FlexAttention and FlashAttention2 under different sparsity on RTX 4090 and RTX 5090:

Citation

If you use this code or find our work valuable, please cite:

@inproceedings{zhang2025spargeattn,
  title={Spargeattn: Accurate sparse attention accelerating any model inference},
  author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025}
}
@inproceedings{zhang2025sageattention,
  title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, 
  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}
@inproceedings{zhang2024sageattention2,
  title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
  author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025}
}
@article{zhang2025sageattention3,
  title={Sageattention3: Microscaling fp4 attention for inference and an exploration of 8-bit training},
  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},
  journal={arXiv preprint arXiv:2505.11594},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figs		figs
sparse_sageattn		sparse_sageattn
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparse SageAttention

Installation

Base Environment

Install Package

Usage

Performance

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

jt-zhang/Sparse_SageAttention_API

Folders and files

Latest commit

History

Repository files navigation

Sparse SageAttention

Installation

Base Environment

Install Package

Usage

Performance

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages