-
Notifications
You must be signed in to change notification settings - Fork 3
feat: flash attention support for hexagon-npu #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…mpl" This reverts commit 0a8ff2b.
…n implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the flash attention operator for the Hexagon NPU backend with significant performance improvements for transformer-based workloads. It also includes API refactoring to use unified threading and performance tracking via updated compute parameters and thread pool enhancements.
- Flash Attention Implementation: Introduces a new operator ("flash_attn_f32") with custom HVX optimizations.
- API Refactoring: Updates compute parameters to integrate thread parameters from default_thread_pool and enhances logging/debug performance tracking.
- General Maintenance: Removes unused quant-related code and adjusts several components (e.g. tensor, op_impl, graph) for consistency and efficiency.
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
ggml/src/ggml-qnn/npu/device/op_flash_attn.{cpp,hpp} | Adds flash attention support and new implementation details. |
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp | Updates mul_mat operator to use new thread parameters and support interfaces. |
ggml/src/ggml-qnn/npu/device/thread_pool.hpp | Refactors thread pool using vtcm cache and thread_params. |
ggml/src/ggml-qnn/npu/device/tensor.hpp | Adjusts logging and cache invalidation; marks writes with a flag. |
ggml/src/ggml-qnn/npu/device/graph.cpp | Updates graph compute flow to use new thread_params in compute_impl. |
Others | Various minor updates to logging, API signatures, and support functions. |
Related to #34
This PR introduces Flash Attention operator support for the Hexagon NPU backend with significant performance improvements for large-scale transformer attention workloads.
Key Features
Flash Attention Implementation
flash_attn_f32
with custom HVX optimizationsAPI Refactoring
default_thread_pool::thread_params
Unit Tests
Platform: 8 Gen 2
Test Suite:
test-backend-ops
Full log:
test-backend-ops_all.debug.hexagon.2265fe441.7z
Performance Results
NPU vs CPU Performance Comparison
Platform: 8 Gen 2
Test Setup:
test-backend-ops
, q:F32
, kv:F16
Performance Analysis
Key Insight: The NPU backend becomes increasingly advantageous for large flash attn scenarios where attention computation intensity grows significantly.
Visualization
Full time consumption across each output size can be seen in the following diagram:
