Seeking Guidance: Addressing Performance-Related Warning Messages to Optimize Execution Speed #329

eanzero · 2024-09-25T04:38:13Z

Thank you for taking the time to review my question.

Before I proceed, I would like to mention that I am a beginner, and I would appreciate your consideration of this fact.

I am seeking assistance in resolving the following warnings to improve execution speed. While I am able to obtain results, I receive the warning messages listed below. From my research, I understand that these warnings can affect execution speed, but I have been unable to find a solution, hence my question.

C:\Users\USER\ddd\segment-anything-2\sam2\modeling\backbones\hieradet.py:68: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
x = F.scaled_dot_product_attention(
C:\Users\USER\ddd\segment-anything-2\sam2\modeling\sam\transformer.py:270: UserWarning: Memory efficient kernel not used because: (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:723.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
C:\Users\USER\ddd\segment-anything-2\sam2\modeling\sam\transformer.py:270: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen/native/transformers/sdp_utils_cpp.h:495.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
C:\Users\USER\ddd\segment-anything-2\sam2\modeling\sam\transformer.py:270: UserWarning: Flash attention kernel not used because: (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:725.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
C:\Users\USER\ddd\segment-anything-2\sam2\modeling\sam\transformer.py:270: UserWarning: CuDNN attention kernel not used because: (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:727.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
C:\Users\USER\ddd\segment-anything-2\sam2\modeling\sam\transformer.py:270: UserWarning: The CuDNN backend needs to be enabled by setting the enviornment variableTORCH_CUDNN_SDPA_ENABLED=1 (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:497.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
C:\Users\USER\anaconda3\envs\ddd\Lib\site-packages\torch\nn\modules\module.py:1562: UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.
Falling back to all available kernels for scaled_dot_product_attention (which may have a slower speed).
return forward_call(*args, **kwargs)

My execution environment is as follows:

Docker
PyTorch 2.4.0
CUDA 12.4
GPU: RTX 3070 (Memory: 8.0G)

The CUDA environment on the host machine is:
Cuda compilation tools, release 12.5, V12.5.82 Build cuda_12.5.r12.5/compiler.34385749_0

I would greatly appreciate any guidance on how to address these warnings. Thank you in advance for your help.

The text was updated successfully, but these errors were encountered:

dario-spagnolo · 2024-09-25T16:23:18Z

I am getting the same warnings.

My environment :

Python 3.10.12
PyTorch 2.4.1
CUDA 12.4
GPU : NVIDIA L4 (24 GB)

renhaa · 2024-10-02T18:01:16Z

Same for me on

python 3.11
pytorch 2.4.0
cuda 12.4.1
Nvidia A40 GPU (48GB)

ronghanghu · 2024-10-02T19:52:44Z

Hi @eanzero @dario-spagnolo @renhaa, you can turn off this warning by changing the line

sam2/sam2/modeling/sam/transformer.py

Line 23 in 52198ea

OLD_GPU, USE_FLASH_ATTN, MATH_KERNEL_ON = get_sdpa_settings()

to

OLD_GPU, USE_FLASH_ATTN, MATH_KERNEL_ON = True, True, True

This would directly try out all the available kernels (instead of trying Flash Attention first and then falling back to other kernels upon errors).

@eanzero The error message above shows that the Flash Attention kernel failed

C:\Users\USER\ddd\segment-anything-2\sam2\modeling\sam\transformer.py:270: UserWarning: Flash attention kernel not used because: (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:725.)

but PyTorch didn't print a further line explaining why it failed. Meanwhile, the GPU you're using (RTX 3070) has a CUDA compute capability of 8.6 according to https://developer.nvidia.com/cuda-gpus, so it should support Flash Attention in principle.

A possible cause is that there could be some mismatch between your CUDA driver, CUDA runtime, and PyTorch versions, causing Flash Attention kernels to fail, especially given that you're using Windows. Previously people have reported issues with Flash Attention on Windows (e.g. in pytorch/pytorch#108175 and Dao-AILab/flash-attention#553), and it could be the same issue in your case. To avoid these issues, it's recommended to use Windows Subsystem for Linux if you're running on Windows.

kleinzcy · 2024-10-14T13:14:32Z

I met the same problem. My env:

torch 2.4.1+cu124
A100
cuda 12.3
flash-attn 2.6.3
python 3.10

In my test, the flash attention is working. But it can't work in sam2. The whole message is as follows:

sam2/sam2/modeling/sam/transformer.py:269: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:723.)
  out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
sam2/sam2/modeling/sam/transformer.py:269: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:495.)
  out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
sam2/sam2/modeling/sam/transformer.py:269: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:725.)
  out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
**sam2/sam2/modeling/sam/transformer.py:269: UserWarning: Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:98.)**
  out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
sam2/sam2/modeling/sam/transformer.py:269: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:727.)
  out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
anaconda3/envs/env_sam/lib/python3.10/site-packages/torch/nn/modules/module.py:1562: UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.
Falling back to all available kernels for scaled_dot_product_attention (which may have a slower speed).

kleinzcy · 2024-10-14T13:19:22Z

Luckily, there is a PR to solve this problem.

#322

It works for me.

bumi001 · 2024-11-09T21:07:32Z

I had this warning:
sam2/modeling/sam/transformer.py:270: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:773.)

The above mentioned pr (#322) fixed that issue for me.

catwell · 2024-11-20T15:47:09Z

The mentioned PR (#322) "works" as in: it silents the logs because it falls back to the next kernel. It will not make you use FlashAttention.

I suspect the reason you have that issue may be that you don't use autocast as described here. The way the code deals with dtypes when not using autocast is a bit weird.

lzbgt · 2024-12-03T03:38:31Z

The mentioned PR (#322) "works" as in: it silents the logs because it falls back to the next kernel. It will not make you use FlashAttention.

I suspect the reason you have that issue may be that you don't use autocast as described here. The way the code deals with dtypes when not using autocast is a bit weird.

hey dude, it's not "slients the logs". If FlashAttention correctly setup, it will work.
And if it fallbacks, it will warning...

lzbgt · 2024-12-03T03:39:35Z

Luckily, there is a PR to solve this problem.

#322

It works for me.

glad it helps.

catwell · 2024-12-03T06:14:18Z

hey dude, it's not "slients the logs". If FlashAttention correctly setup, it will work. And if it fallbacks, it will warning...

torch.nn.attention.sdpa_kernel does not warn when it falls back. So if FlashAttention cannot be selected it will use Efficient Attention or Math and it will not warn. I do not think there is a warning anywhere else.

The true root cause of the error in the logs above is this:

Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead.

This means somewhere code is run in 32 bit, when FlashAttention requires 16 bit (fp16 or bf16). You can fix this by using autocast or patching the code base to add .to in the few places it is missing.

bumi001 · 2024-12-05T22:04:41Z

The mentioned PR (#322) "works" as in: it silents the logs because it falls back to the next kernel. It will not make you use FlashAttention.

I suspect the reason you have that issue may be that you don't use autocast as described here. The way the code deals with dtypes when not using autocast is a bit weird.

Well, before using (#322), I installed flash-attention using pip install flash-attn --no-build-isolation as per https://github.com/Dao-AILab/flash-attention. Do you still think it might have only silenced it in my case?

catwell · 2024-12-05T23:05:04Z

Here is a way to know: instead of passing fallbacks like #322, only pass SDPBackend.FLASH_ATTENTION to torch.nn.attention.sdpa_kernel (remove the other two).

If it still works it is fine, if it breaks you were not using Flash Attention.

sevenshr · 2024-12-11T13:13:36Z

Here is a way to know: instead of passing fallbacks like #322, only pass SDPBackend.FLASH_ATTENTION to torch.nn.attention.sdpa_kernel (remove the other two).

If it still works it is fine, if it breaks you were not using Flash Attention.

I agree with you, because after the changes in the mentioned PR, the training speed is still the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking Guidance: Addressing Performance-Related Warning Messages to Optimize Execution Speed #329

Seeking Guidance: Addressing Performance-Related Warning Messages to Optimize Execution Speed #329

eanzero commented Sep 25, 2024

dario-spagnolo commented Sep 25, 2024

renhaa commented Oct 2, 2024

ronghanghu commented Oct 2, 2024

kleinzcy commented Oct 14, 2024

kleinzcy commented Oct 14, 2024

bumi001 commented Nov 9, 2024

catwell commented Nov 20, 2024 •

edited

Loading

lzbgt commented Dec 3, 2024 •

edited

Loading

lzbgt commented Dec 3, 2024

catwell commented Dec 3, 2024 •

edited

Loading

bumi001 commented Dec 5, 2024

catwell commented Dec 5, 2024

sevenshr commented Dec 11, 2024

Seeking Guidance: Addressing Performance-Related Warning Messages to Optimize Execution Speed #329

Seeking Guidance: Addressing Performance-Related Warning Messages to Optimize Execution Speed #329

Comments

eanzero commented Sep 25, 2024

dario-spagnolo commented Sep 25, 2024

renhaa commented Oct 2, 2024

ronghanghu commented Oct 2, 2024

kleinzcy commented Oct 14, 2024

kleinzcy commented Oct 14, 2024

bumi001 commented Nov 9, 2024

catwell commented Nov 20, 2024 • edited Loading

lzbgt commented Dec 3, 2024 • edited Loading

lzbgt commented Dec 3, 2024

catwell commented Dec 3, 2024 • edited Loading

bumi001 commented Dec 5, 2024

catwell commented Dec 5, 2024

sevenshr commented Dec 11, 2024

catwell commented Nov 20, 2024 •

edited

Loading

lzbgt commented Dec 3, 2024 •

edited

Loading

catwell commented Dec 3, 2024 •

edited

Loading