Skip to content

feat: flash attention support for hexagon-npu #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 91 commits into from
Jun 18, 2025
Merged

Conversation

chraac
Copy link
Owner

@chraac chraac commented Jun 14, 2025

Related to #34

This PR introduces Flash Attention operator support for the Hexagon NPU backend with significant performance improvements for large-scale transformer attention workloads.

Key Features

Flash Attention Implementation

  • New Operator: flash_attn_f32 with custom HVX optimizations
  • Target Use Case: Transformer attention mechanisms on Snapdragon 8 Gen 2
  • Performance Focus: Optimized for long-context scenarios

API Refactoring

  • Unified Threading: Standardized all compute APIs to use default_thread_pool::thread_params
  • Consistent Interface: Streamlined threading interface across all operations
  • Backwards Compatibility: Maintained existing functionality while improving structure

Unit Tests

Platform: 8 Gen 2
Test Suite: test-backend-ops

[hexagon-npu][OPT_STEP_ADAMW]unsupported, dst: f32[10x5x4x3], supported/unsupported: 612/5027
unload rpcmem lib successfully
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): not supported [hexagon-npu] 
  5639/5639 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops_all.debug.hexagon.2265fe441.7z

Performance Results

NPU vs CPU Performance Comparison

Platform: 8 Gen 2
Test Setup: test-backend-ops, q: F32, kv: F16

output_dims cpu (us) npu (us)
64x4x1x1 900 2011
64x4x35x1 13454 10506
64x16x1x1 1024 2055
64x16x35x1 36414 37516
80x4x1x1 619 1425
80x16x35x1 44002 40956
192x16x35x1 71008 49349
256x16x35x1 95015 51239
576x16x35x1 181905 60435

Performance Analysis

  • Small Workloads: NPU initialization overhead makes CPU more efficient
  • Medium Workloads: NPU shows 7-31% performance gains
  • Large Workloads: NPU excels with up to 3× speedup and 67% efficiency gain
  • Scaling: NPU performance scales significantly better with larger attention matrices

Key Insight: The NPU backend becomes increasingly advantageous for large flash attn scenarios where attention computation intensity grows significantly.

Visualization

Full time consumption across each output size can be seen in the following diagram:
flash_attn 8c6e29883

@chraac chraac self-assigned this Jun 14, 2025
@chraac chraac requested a review from Copilot June 14, 2025 14:59
Copilot

This comment was marked as outdated.

@chraac chraac changed the title feat: add flash attn feat: flash attention support for hexagon-npu Jun 14, 2025
@chraac chraac added the enhancement New feature or request label Jun 14, 2025
@chraac chraac requested a review from Copilot June 17, 2025 02:14
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the flash attention operator for the Hexagon NPU backend with significant performance improvements for transformer-based workloads. It also includes API refactoring to use unified threading and performance tracking via updated compute parameters and thread pool enhancements.

  • Flash Attention Implementation: Introduces a new operator ("flash_attn_f32") with custom HVX optimizations.
  • API Refactoring: Updates compute parameters to integrate thread parameters from default_thread_pool and enhances logging/debug performance tracking.
  • General Maintenance: Removes unused quant-related code and adjusts several components (e.g. tensor, op_impl, graph) for consistency and efficiency.

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/device/op_flash_attn.{cpp,hpp} Adds flash attention support and new implementation details.
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp Updates mul_mat operator to use new thread parameters and support interfaces.
ggml/src/ggml-qnn/npu/device/thread_pool.hpp Refactors thread pool using vtcm cache and thread_params.
ggml/src/ggml-qnn/npu/device/tensor.hpp Adjusts logging and cache invalidation; marks writes with a flag.
ggml/src/ggml-qnn/npu/device/graph.cpp Updates graph compute flow to use new thread_params in compute_impl.
Others Various minor updates to logging, API signatures, and support functions.

@chraac chraac merged commit af620a1 into dev-refactoring Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant