[Bugfix] Fix Gemma3 multimodal placeholder replacement #17074
+9
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
google/gemma-3-4b-it
fails when run via the openai server with multimodal input with the following error:I observed the issue with engine V0, I didn't try with engine V1.
The issue can be observed for messages of this form:
{'role': 'system', 'content': [{'text': 'Describe the image in a short sentence.', 'type': 'text'}]}, {'role': 'user', 'content': [{'text': 'random text', 'type': 'text'}, {'text': 'image: ', 'type': 'text'}, {'image_url': , my_image_url}]}]
The issue comes from the fact that if the input message containing the image also has text parts (the
user
message in the example above), then thechat_utils._get_full_multimodal_text_prompt
method will put the image placeholder at the front like so'<start_of_image>\nrandom text
. Then theGemma3MultiModalProcessor._get_prompt_updates => get_replacement_gemma3 => get_image_repl
will replace the placeholder withGemma3Processor.full_image_sequence = f"\n\n{tokenizer.boi_token}{image_tokens_expanded}{tokenizer.eoi_token}\n\n"
, which will result in 3 consecutive\n
characters after theeoi_token
. The\n\n\n
which will be tokenized as the 109 token, and when passed to_find_mm_placeholders
, the 109 token will be replaced by [107, 108], instead of [108, 107], which won't allow the matching condition in_iter_placeholders
to ever evaluate toTrue
.