feat: flash attention support for hexagon-npu #45

chraac · 2025-06-14T14:59:31Z

Related to #34

This PR introduces Flash Attention operator support for the Hexagon NPU backend with significant performance improvements for large-scale transformer attention workloads.

Key Features

Flash Attention Implementation

New Operator: flash_attn_f32 with custom HVX optimizations
Target Use Case: Transformer attention mechanisms on Snapdragon 8 Gen 2
Performance Focus: Optimized for long-context scenarios

API Refactoring

Unified Threading: Standardized all compute APIs to use default_thread_pool::thread_params
Consistent Interface: Streamlined threading interface across all operations
Backwards Compatibility: Maintained existing functionality while improving structure

Unit Tests

Platform: 8 Gen 2
Test Suite: test-backend-ops

[hexagon-npu][OPT_STEP_ADAMW]unsupported, dst: f32[10x5x4x3], supported/unsupported: 612/5027
unload rpcmem lib successfully
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): not supported [hexagon-npu] 
  5639/5639 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops_all.debug.hexagon.2265fe441.7z

Performance Results

NPU vs CPU Performance Comparison

Platform: 8 Gen 2
Test Setup: test-backend-ops, q: F32, kv: F16

output_dims	cpu (us)	npu (us)
64x4x1x1	900	2011
64x4x35x1	13454	10506
64x16x1x1	1024	2055
64x16x35x1	36414	37516
80x4x1x1	619	1425
80x16x35x1	44002	40956
192x16x35x1	71008	49349
256x16x35x1	95015	51239
576x16x35x1	181905	60435

Performance Analysis

Small Workloads: NPU initialization overhead makes CPU more efficient
Medium Workloads: NPU shows 7-31% performance gains
Large Workloads: NPU excels with up to 3× speedup and 67% efficiency gain
Scaling: NPU performance scales significantly better with larger attention matrices

Key Insight: The NPU backend becomes increasingly advantageous for large flash attn scenarios where attention computation intensity grows significantly.

Visualization

Full time consumption across each output size can be seen in the following diagram:

…mpl" This reverts commit 0a8ff2b.

…n implementations

Copilot

Pull Request Overview

This PR adds support for the flash attention operator for the Hexagon NPU backend with significant performance improvements for transformer-based workloads. It also includes API refactoring to use unified threading and performance tracking via updated compute parameters and thread pool enhancements.

Flash Attention Implementation: Introduces a new operator ("flash_attn_f32") with custom HVX optimizations.
API Refactoring: Updates compute parameters to integrate thread parameters from default_thread_pool and enhances logging/debug performance tracking.
General Maintenance: Removes unused quant-related code and adjusts several components (e.g. tensor, op_impl, graph) for consistency and efficiency.

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/device/op_flash_attn.{cpp,hpp}	Adds flash attention support and new implementation details.
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp	Updates mul_mat operator to use new thread parameters and support interfaces.
ggml/src/ggml-qnn/npu/device/thread_pool.hpp	Refactors thread pool using vtcm cache and thread_params.
ggml/src/ggml-qnn/npu/device/tensor.hpp	Adjusts logging and cache invalidation; marks writes with a flag.
ggml/src/ggml-qnn/npu/device/graph.cpp	Updates graph compute flow to use new thread_params in compute_impl.
Others	Various minor updates to logging, API signatures, and support functions.

ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp

ggml/src/ggml-qnn/npu/device/tensor.hpp

chraac added 30 commits May 28, 2025 00:22

add flash attn op

0bb7a5f

expend src tensor size

85156a4

add flash attn sources

8744400

add quantize row functions

60c79cf

make a separated file for vec_dot

b908b9e

wip

2beda85

wip

6a00c93

refactor: rename quants.hpp includes and add vec_dot to type traits

9846b78

add flash_attn impl

c357a15

split vec_scale_f32

f63b99e

move vec_reduction_qf32 to vec_ops

5cecdd7

add vec_scale_f16

0e4ea1e

opt

aae0d33

add vec_mad

baa1b6b

implement vec_mad_f16

62bf246

opt

6f8dd34

add op template

f2ec1e8

opt

39a8c42

add align version

5b339d4

enable flash attn

c527f2a

wip

0d40cfa

log print improve

3a1c0b0

add profiler log

5e01705

wip

9175ec6

wip

8bba769

add multi sub proc perf tracker

300afef

increase log buffer

3e7662b

remove sub prov pcycle

9fc46be

wip

56bcc76

wip

cdeb534

chraac added 20 commits June 10, 2025 11:50

refactoring: hold vtcm at thread local object

7db0816

wip

8639ea4

add profiler log

7d49c49

mark tensors as modified

49a6c27

restrict tensor invalidation to the first thread in compute_impl

0a8ff2b

Revert "restrict tensor invalidation to the first thread in compute_i…

8da9e8e

…mpl" This reverts commit 0a8ff2b.

invalidate last tensor in compute_impl

e2ba224

invalidate last tensor in compute function

faa47bd

wip

0347ee7

refactor dequantize_row_q4_0 to simplify vector alignment

8c6e298

wip

54b3c2a

refactoring: move VTCM quota calculation to thread pool

0809df6

wip

e7a92ba

fix: correct condition check for HEXAGON_SDK_ROOT existence

095c811

wip

8793f0c

wip

28ec32e

wip

62b77a8

wip

a651dcf

fix: update condition checks match the naming

2265fe4

fix: improve tensor handling checks and logging in graph and operatio…

6384928

…n implementations

chraac self-assigned this Jun 14, 2025

chraac requested a review from Copilot June 14, 2025 14:59

This comment was marked as outdated.

Sign in to view

chraac changed the title ~~feat: add flash attn~~ feat: flash attention support for hexagon-npu Jun 14, 2025

chraac added the enhancement New feature or request label Jun 14, 2025

chraac mentioned this pull request Jun 14, 2025

feat: flash attention support for hexagon-npu chraac/llama-cpp-qnn-builder#16

Merged

chraac requested a review from Copilot June 17, 2025 02:14

Copilot AI reviewed Jun 17, 2025

View reviewed changes

ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-qnn/npu/device/tensor.hpp Show resolved Hide resolved

wip

37d97e5

chraac merged commit af620a1 into dev-refactoring Jun 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: flash attention support for hexagon-npu #45

feat: flash attention support for hexagon-npu #45

Uh oh!

chraac commented Jun 14, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: flash attention support for hexagon-npu #45

feat: flash attention support for hexagon-npu #45

Uh oh!

Conversation

chraac commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Features

Flash Attention Implementation

API Refactoring

Unit Tests

Performance Results

NPU vs CPU Performance Comparison

Performance Analysis

Visualization

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chraac commented Jun 14, 2025 •

edited

Loading