-
Notifications
You must be signed in to change notification settings - Fork 53
Support for Llama-3-Nemotron models #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I downloaded the source model and was able to convert it with
|
Well, you see what |
It's quantizing now. Edit: I guessed on the value for the big one based on the difference in number of layers between them. |
Apart from the 253B version that is beyond my reach, this will add support for this model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct ? What about https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 which seems more recent? |
Support for that is not added yet, that one is missing ggml-org/llama.cpp#12843
That is the one I am testing with right now.
That one should work (maybe the convert python might not?) but you may need to add the n_attention_wv value if it is different. |
It is coherent in the cli. Will sweep-bench it later. |
I get this error when I try to run the 49B model (after adjusting the
|
Works if I convert with mainline, so something is missing in the conversion script. |
Thanks for testing that, I'll look into the script. |
Can you try Llama-3_1-Nemotron-Ultra-253B now? |
Oh, I forgot to comment on that one. I solved it for the 49B model by simply accepting |
Can you test the conversion again? This is good to review again, I'm done pushing changes. |
I'm running something on the computer where I downloaded the model. I'll test in a bit when the run finishes. |
Take your time, I'm heading off for now anyways. |
When I run the mainline conversion script, I see this:
But when I run the conversion script in the PR, I see this:
So, something is not quite right with the merges. But I'm actually OK with the conversion script not working. We already have other models that require mainline for conversion to GGUF. |
Thanks for the work! I'm trying L3 Nemotron 253B Q3_K_XL from unsloth (https://huggingface.co/unsloth/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/tree/main/UD-Q3_K_XL), here how is the log looks
And it seems to work without issues Not sure if there's a flag that could improve things for dense models. Also not exactly sure how to enable thinking, but maybe that depends of the UI when using it via API. |
With the commit that I just pushed But then I see a difference in PPL. I didn't run the
Quantization is done in exactly the same way, I'm running with exact same parameters on the same hardware, so something else is different in the converted OK, doing
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective this is ready to merge.
Just waiting for @Lissanro to confirm that it is working for them.
I tried at first using this command:
It loaded successfully, but when trying inference I got this error:
With 12 layers on GPU the error is the same (loads fine, but crashes when I try to use it). If I remove As a last resort, I tried to load only on CPU (
...then at first I thought it worked. So it seems there is an issue specific to CUDA, but CPU-only mode works. Please let me know if additional debugging from my side could help, and if so what steps I need to follow. |
Can you try
Thanks. |
Nice, I see you grabbed the only changes to vocab.py that we were behind: ggml-org/llama.cpp@8ba3858 and ggml-org/llama.cpp@a686171. I think you might have been able to cherry-pick those commits directly.
Interesting, do you mind checking with gguf-hash or some other tool if that one changed tensor is the only difference? I am curious to know why this PR does one tensor less of f32 than mainline. |
Sure, here is the full log: https://pastebin.com/TjqnExDv - it loaded fine, then when I attempted inference it crashed with this error:
|
I'm noticing that there is an issue for MoE models when the partial GPU offload is not done via tensor overrides (as in the command above). I'll try to figure out what is wrong, but in the meantime can you try this:
This is similar to what you tried, but it will load all attention on the GPUs along with the first 32 layers of the experts, the remaining experts will be on the CPU. Not sure about the context, you may need to reduce it somewhat. Thanks! |
Correct me if I'm wrong but isn't nemotron 253B a dense model? So no experts and such |
Oops, I'm getting confused. Doing too many things at a time. Not sure then why partial offload is not working. |
I used Hashes are identical. The other difference is that ours is |
I'm not sure why it is missing (and whether it would cause worse quality at long contexts), the conversion script looks like it handles it. I can see that tensor in gguf's that are on huggingface for these models, so it does seem like it should be there.
This one I understand, they both map to MOSTLY_BF16 (ik_llama.cpp source and llama.cpp source). |
If there is still something I need to test, please let me know (my understanding the last command was given under assumption it was MoE, but since it is a dense model, I assume I need some other command to test or maybe I already provided all debug info that is possible from my side). In any case, thank you very much for looking into this. |
I think I'll merge this one despite the missing |
I think I figured it out (or at least I found one reason why it is missing if it turns out there is more), I'll make a PR later (heading off for a bit right now). |
Port of ggml-org/llama.cpp#10669
It compiles, have not tested yet. Testers welcome, but will try to test myself later.