I found follow attention mask type is not support now: https://github.com/huggingface/transformers/blob/42ebb6c23e61119f769d7c7c067d5b4ae10e4a7f/src/transformers/models/gemma3/modeling_gemma3.py#L1147
# Apply bidirectional mask on images if token type ids are provided
        if token_type_ids is not None and sequence_length != 1:
            token_type_mask = token_type_ids.unsqueeze(1) == token_type_ids.unsqueeze(2)
            token_type_mask[token_type_ids == 0] = False  # if text token do not change anything
            token_type_mask = token_type_mask.unsqueeze(1).to(causal_mask.device, dtype=torch.bool)
            causal_mask = causal_mask.clone()
            causal_mask[:, :, :, :sequence_length] = causal_mask[:, :, :, :sequence_length].masked_fill(
                token_type_mask, 0.0
            )
Could i set the attention_mask of the prefill stage with executor api? Thanks .

I'm blocked here. I have supported gemma3 text llm refer to gemma2. And is normal when only input text without image token. But when input with image(by ptuning embedding), output will be wrong. I finally found the attention mask is different for text token and image token in prefill phase. Such as:

Only text token (A causal mask):

	token2	token3	token4	token5
token1	-inf	-inf	-inf	-inf
token2	0	-inf	-inf	-inf
token3	0	0	-inf	-inf
token4	0	0	0	-inf
token5	0	0	0	0

With image token (Not a pure causal mask):

	img_token2	img_token3	img_token4	txt_token5
txt_token1	-inf	-inf	-inf	-inf
img_token2	0	0	0	-inf
img_token3	0	0	0	-inf
img_token4	0	0	0	-inf
txt_token5	0	0	0	0

Is there any some good method to support it? Thanks very much!

gemma 3 architecture #2880

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions