[Bettertransformer] `Transformers 4.41.0 (torch.SDPA-Bert)` breaks bettertransformers Bert, but works in `Transformers 4.40.2`

### System Info

```shell
python 3.11 
Windows / WSL2
poetry venv
```


### Who can help?

-

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

Installing `torch=2.3.1` and `transformers=4.41.0 (or transformers=4.40.2 for fix)`.

```python
from optimum.bettertransformer import BetterTransformer
from transformers import AutoModel, AutoTokenizer
import torch
model_name = "michaelfeil/bge-small-en-v1.5"

model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


model = BetterTransformer.transform(model) # COMMENT OUT TO MAKE IT WORK.

model = model.cuda()

for num_sentences in [1, 10, 100]:
    sentences = [f"This is sentence number {i * 10}" for i in range(num_sentences)]
    inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True).to(model.device)
    outputs = model(**inputs)
    print(outputs.last_hidden_state.shape)
```

Output:

```python
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
torch.Size([1, 7, 384])
torch.Size([10, 7, 384])
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/michael/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/michael/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/michael/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/michael/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/michael/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/michael/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/michael/infinity/bert.py", line 16, in <module>
    outputs = model(**inputs)
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 1137, in forward
    encoder_outputs = self.encoder(
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 690, in forward
    layer_outputs = layer_module(
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/infinity/libs/infinity_emb/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py", line 300, in forward
    attention_mask = torch.reshape(attention_mask, (attention_mask.shape[0], attention_mask.shape[-1]))
RuntimeError: shape '[100, 8]' is invalid for input of size 6400
```

# Solution

```
Installing dependencies from lock file
Package operations: 0 installs, 1 update, 0 removals
  • Downgrading transformers (4.41.2 -> 4.40.2)
```

Works: 

```python
torch.Size([1, 7, 384])
torch.Size([10, 7, 384])
torch.Size([100, 8, 384])
```


### Expected behavior

Bettertransformer is still 1.5x faster than torch.sdpa -> stuck with pinning huggingface transformers with <4.40.2 for now.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bettertransformer] `Transformers 4.41.0 (torch.SDPA-Bert)` breaks bettertransformers Bert, but works in `Transformers 4.40.2` #1902

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Solution

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bettertransformer] Transformers 4.41.0 (torch.SDPA-Bert) breaks bettertransformers Bert, but works in Transformers 4.40.2 #1902

Description

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Solution

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bettertransformer] `Transformers 4.41.0 (torch.SDPA-Bert)` breaks bettertransformers Bert, but works in `Transformers 4.40.2` #1902