-
Notifications
You must be signed in to change notification settings - Fork 1.4k
gemma 3 architecture #2880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 |
Would be epic, ollama and llama.cpp implemented it already |
I found follow attention mask type is not support now:
Could i set the attention_mask of the prefill stage with executor api? Thanks . |
I'm blocked here. I have supported gemma3 text llm refer to gemma2. And is normal when only input text without image token. But when input with image(by ptuning embedding), output will be wrong. I finally found the attention mask is different for text token and image token in prefill phase. Such as: Only text token (A causal mask):
With image token (Not a pure causal mask):
Is there any some good method to support it? Thanks very much! |
I also found gemma3 use Additionally, i found current gpt attention not support TensorRT-LLM/tensorrt_llm/layers/attention.py Line 1001 in 9b931c0
The |
I had try support gemm3 text llm (https://github.com/NetEase-Media/grps_trtllm/tree/master/tools/gemma3/tensorrt_llm_mod). But there is one issue: not support kv cache reuse, ref #2912 . |
can you add gemma 3 architecture?
The text was updated successfully, but these errors were encountered: