-
Notifications
You must be signed in to change notification settings - Fork 12k
llama: Attempt to add ModernBert #14014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The embedding result seems random and very low. There is something wrong with this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the files you added in models, we don't need them, just make sure test-tokenizer-0
succeeds with the GGUF.
src/llama-model.cpp
Outdated
inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, -1); | ||
cb(inpL, "inp_norm", -1); | ||
|
||
auto * inp_attn = build_attn_inp_kv_unified_iswa(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably become:
auto * inp_attn = build_attn_inp_kv_unified_iswa(); | |
auto * inp_attn = build_attn_inp_no_cache_iswa(); |
And add the corresponding mask logic in llama-graph
. Special attention should be taken about how the SWA works for this model - i.e. is it symmetric or not:
# non-symmetric
token i attends to [i - n_swa, i]
# symmetric:
token i attends to [i - n_swa/2, i + n_swa/2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to add the new arch here:
Lines 13195 to 13203 in 5a8ae30
switch (arch) { | |
case LLM_ARCH_BERT: | |
case LLM_ARCH_JINA_BERT_V2: | |
case LLM_ARCH_NOMIC_BERT: | |
case LLM_ARCH_NOMIC_BERT_MOE: | |
case LLM_ARCH_WAVTOKENIZER_DEC: | |
{ | |
res = nullptr; | |
} break; |
To avoid creating a memory module (a.k.a. KV cache) for these models.
So, since vocab is BPE you need to add Line 1557 in 9f47fa5
Set correct attribute on [MASK] token, similarly to this:Lines 2097 to 2105 in 9f47fa5
|
Yep, I also noticed the same with |
@huydt84 Don't forget this-^ it's important. |
Will dig into this tonight/this weekend... |
Thank you! I have just added it |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add new enum llama_swa_type
:
LLAMA_SWA_TYPE_SYMMETRIC = 3,
src/llama-model.cpp
Outdated
inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, -1); | ||
cb(inpL, "inp_norm", -1); | ||
|
||
auto * inp_attn = build_attn_inp_no_cache_iswa(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not an actual iSWA (interleaved SWA) model, we should use simply build_attn_inp_no_cache()
.
src/llama-graph.h
Outdated
@@ -241,6 +249,7 @@ class llm_graph_input_attn_no_cache : public llm_graph_input_i { | |||
|
|||
const llama_hparams & hparams; | |||
const llama_cparams & cparams; | |||
const int n_swa; // Sliding window attention size (0 = disabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already available from the hparams
- no need to duplicate it here.
src/llama-graph.cpp
Outdated
// Check if we're using sliding window attention | ||
if (n_swa > 0) { | ||
const int64_t n_tokens = ubatch->n_tokens; | ||
const int64_t n_seq_tokens = ubatch->n_seq_tokens; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is actually non-causal
attention + sliding window. So merge it with the existing implementation below.
Ok, the issue with That doesn't explain the issue with modernbert unfortunately (though I did try it for fun with Alibaba-NLP/gte-reranker-modernbert-base .. it seems to give reverse scores). |
@CISC cc: @ggerganov I tried to do the embedding with various models, but the output results are barely changed among those attempts. Maybe the params load or inference graph is getting problems somewhere. Can you check that part? |
So, I just noticed at least part of the problem: Lines 1567 to 1571 in 3ac6753
We have cls , but not cls_b , so this has to be modified to handle that...
|
// feed-forward network | ||
ggml_tensor * ffn_up = build_lora_mm(model.layers[il].ffn_up, cur); | ||
cb(ffn_up, "ffn_up", il); | ||
|
||
int64_t split_point = ffn_up->ne[0] / 2; | ||
ggml_tensor * output_ffn_up = ggml_cont(ctx0, ggml_view_2d( | ||
ctx0, ffn_up, split_point, | ||
ffn_up->ne[1], ffn_up->nb[1], 0 | ||
)); | ||
ggml_tensor * output_ffn_gate = ggml_cont(ctx0, ggml_view_2d( | ||
ctx0, ffn_up, split_point, | ||
ffn_up->ne[1], ffn_up->nb[1], | ||
split_point * ggml_element_size(ffn_up) | ||
)); | ||
|
||
// Apply activation function | ||
output_ffn_up = ggml_gelu(ctx0, output_ffn_up); | ||
|
||
// Element-wise multiplication | ||
ggml_tensor * gated = ggml_mul(ctx0, output_ffn_up, output_ffn_gate); | ||
cb(gated, "ffn_gated", il); | ||
|
||
// Final projection | ||
cur = build_lora_mm(model.layers[il].ffn_down, gated); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be merged into build_ffn
as LLM_FFN_GEGLU
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth making a separate PR for visibility.
I don't know whether my implementation is correct or not