Skip to content

Support for Llama-3-Nemotron models #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 9, 2025
Merged

Support for Llama-3-Nemotron models #377

merged 8 commits into from
May 9, 2025

Conversation

saood06
Copy link
Collaborator

@saood06 saood06 commented May 4, 2025

Port of ggml-org/llama.cpp#10669

It compiles, have not tested yet. Testers welcome, but will try to test myself later.

@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

I downloaded the source model and was able to convert it with convert_hf_to_gguf.py but I hit an error when attempting to quantize it.

llama.cpp:19268: GGML_ASSERT((qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer || qs.n_attention_wv == 3 * (int)model.hparams.n_layer) && "n_attention_wv is unexpected") failed

@ikawrakow
Copy link
Owner

Well, you see what n_attention_wv is and add another rule for accepting it. This is because of the layers that don't have the usual attention mechanism, I guess.

@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

It's quantizing now.

Edit: I guessed on the value for the big one based on the difference in number of layers between them.

@ikawrakow
Copy link
Owner

Apart from the 253B version that is beyond my reach, this will add support for this model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct ?

What about https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 which seems more recent?

@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

Apart from the 253B version that is beyond my reach

Support for that is not added yet, that one is missing ggml-org/llama.cpp#12843

this will add support for this model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct ?

That is the one I am testing with right now.

What about https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 which seems more recent?

That one should work (maybe the convert python might not?) but you may need to add the n_attention_wv value if it is different.

@saood06 saood06 requested a review from ikawrakow May 4, 2025 13:17
@saood06 saood06 marked this pull request as ready for review May 4, 2025 13:17
@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

It is coherent in the cli.

Will sweep-bench it later.

@ikawrakow
Copy link
Owner

I get this error when I try to run the 49B model (after adjusting the n_attention_vw check):

llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file

@ikawrakow
Copy link
Owner

Works if I convert with mainline, so something is missing in the conversion script.

@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

Works if I convert with mainline, so something is missing in the conversion script.

Thanks for testing that, I'll look into the script.

@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

@Lissanro

Can you try Llama-3_1-Nemotron-Ultra-253B now?

@ikawrakow
Copy link
Owner

the n_attention_vw check may be broken but everything else I think should be fine.

Oh, I forgot to comment on that one. I solved it for the 49B model by simply accepting n_attention_vw if model.arch == LLM_ARCH_DECI. In that way we don't need to adjust that check for every variation they may come up with.

@saood06 saood06 changed the title Support for Llama-3_1-Nemotron-51B Support for Llama-3_1-Nemotron models May 4, 2025
@saood06 saood06 changed the title Support for Llama-3_1-Nemotron models Support for Llama-3-Nemotron models May 4, 2025
@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

@ikawrakow

Can you test the conversion again? This is good to review again, I'm done pushing changes.

@ikawrakow
Copy link
Owner

I'm running something on the computer where I downloaded the model. I'll test in a bit when the run finishes.

@saood06
Copy link
Collaborator Author

saood06 commented May 4, 2025

I'll test in a bit when the run finishes.

Take your time, I'm heading off for now anyways.

@ikawrakow
Copy link
Owner

When I run the mainline conversion script, I see this:

INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128009

But when I run the conversion script in the PR, I see this:

INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model tokenizer
WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128009

So, something is not quite right with the merges.

But I'm actually OK with the conversion script not working. We already have other models that require mainline for conversion to GGUF.

@Panchovix
Copy link

Thanks for the work! I'm trying L3 Nemotron 253B Q3_K_XL from unsloth (https://huggingface.co/unsloth/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/tree/main/UD-Q3_K_XL), here how is the log looks

pancho@fedora:/run/media/pancho/6AE20D1AE20CEBDF/ChatIAs/ik_llama.cpp/build_linux/bin$ ./llama-server -m /run/media/pancho/08329F4A329F3B9E/models_llm/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q3_K_XL-00001-of-00003.gguf -c 12228 -ngl 163 -ts 6.5,6,10,4 --no-warmup -fa -ctk q8_0 -ctv q4_0 -mg 2
INFO [                    main] build info | tid="139738867924992" timestamp=1746386578 build=3671 commit="0e001215"
INFO [                    main] system info | tid="139738867924992" timestamp=1746386578 n_threads=8 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 43 key-value pairs and 648 tensors from /run/media/pancho/08329F4A329F3B9E/models_llm/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deci
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Ultra
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                           general.finetune str              = 3_1-Nemotron-Ultra
llama_model_loader: - kv   5:                           general.basename str              = Llama-3_1-Nemotron-Ultra-253B-V1
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 253B
llama_model_loader: - kv   8:                            general.license str              = other
llama_model_loader: - kv   9:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv  10:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                        deci.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:               deci.attention.head_count_kv arr[i32,162]     = [8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, ...
llama_model_loader: - kv  16:                  deci.attention.head_count arr[i32,162]     = [128, 128, 128, 128, 128, 128, 128, 1...
llama_model_loader: - kv  17:                   deci.feed_forward_length arr[i32,162]     = [5376, 10752, 16128, 16128, 16128, 16...
llama_model_loader: - kv  18:                           deci.block_count u32              = 162
llama_model_loader: - kv  19:                        deci.context_length u32              = 131072
llama_model_loader: - kv  20:                      deci.embedding_length u32              = 16384
llama_model_loader: - kv  21:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                  deci.attention.key_length u32              = 128
llama_model_loader: - kv  23:                deci.attention.value_length u32              = 128
llama_model_loader: - kv  24:                            deci.vocab_size u32              = 128256
llama_model_loader: - kv  25:                  deci.rope.dimension_count u32              = 128
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {{- bos_token }}{%- if messages[0]['r...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                          general.file_type u32              = 12
llama_model_loader: - kv  36:                      quantize.imatrix.file str              = Llama-3_1-Nemotron-Ultra-253B-v1-GGUF...
llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = unsloth_calibration_Llama-3_1-Nemotro...
llama_model_loader: - kv  38:             quantize.imatrix.entries_count i32              = 499
llama_model_loader: - kv  39:              quantize.imatrix.chunks_count i32              = 544
llama_model_loader: - kv  40:                                   split.no u16              = 0
llama_model_loader: - kv  41:                        split.tensors.count i32              = 648
llama_model_loader: - kv  42:                                split.count u16              = 3
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type q3_K:  162 tensors
llama_model_loader: - type q4_K:  326 tensors
llama_model_loader: - type q5_K:   13 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deci
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 16384
llm_load_print_meta: n_layer          = 162
llm_load_print_meta: n_head           = [128, 128, 128, 128, 128, 128, 128, 128, 128, 0, 0, 0, 0, 128, 128, 128, 128, 128, 0, 0, 0, 0, 0, 0, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 128, 128, 128, 0, 128, 0, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 128, 128, 128, 128, 0, 0, 128, 0, 0, 0, 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 128, 128, 0, 128, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 128, 128, 0, 128, 128, 128, 128, 128, 128, 128, 128]
llm_load_print_meta: n_head_kv        = [8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 8, 8, 8, 0, 8, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 8, 8, 8, 8, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 8, 8, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8]
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = [16, 16, 16, 16, 16, 16, 16, 16, 16, 0, 0, 0, 0, 16, 16, 16, 16, 16, 0, 0, 0, 0, 0, 0, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 16, 16, 16, 0, 16, 0, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 16, 16, 16, 16, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 16, 16, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16, 16, 0, 16, 16, 16, 16, 16, 16, 16, 16]
llm_load_print_meta: n_embd_k_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 1024, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
llm_load_print_meta: n_embd_v_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 1024, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = [5376, 10752, 16128, 16128, 16128, 16128, 16128, 16128, 21504, 0, 0, 0, 0, 21504, 21504, 21504, 53248, 53248, 0, 0, 0, 0, 0, 0, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 26624, 0, 0, 0, 21504, 21504, 21504, 21504, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 37376, 37376, 37376, 0, 0, 32000, 26624, 26624, 26624, 26624, 26624, 26624, 0, 26624, 26624, 26624, 26624, 26624, 26624, 26624, 26624, 0, 0, 0, 0, 0, 32000, 53248, 53248, 53248, 0, 0, 0, 0, 0, 0, 0, 0, 399360, 0, 0, 0, 0, 0, 0, 0, 0, 425984, 0, 0, 0, 0, 0, 0, 0, 0, 343040, 0, 0, 0, 0, 0, 301056, 21504, 21504, 26624, 0, 26624, 26624, 37376, 53248, 53248, 53248, 53248, 26624]
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 405B
llm_load_print_meta: model ftype      = Q3_K - Medium
llm_load_print_meta: model params     = 253.401 B
llm_load_print_meta: model size       = 115.764 GiB (3.924 BPW) 
llm_load_print_meta: repeating layers = 113.318 GiB (3.906 BPW, 249.199 B parameters)
llm_load_print_meta: general.name     = Llama_Nemotron_Ultra
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
llm_load_tensors: ggml ctx size =    1.99 MiB
llm_load_tensors: offloading 162 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 163/163 layers to GPU
llm_load_tensors:        CPU buffer size =  1127.25 MiB
llm_load_tensors:      CUDA0 buffer size = 21995.70 MiB
llm_load_tensors:      CUDA1 buffer size = 22587.26 MiB
llm_load_tensors:      CUDA2 buffer size = 45199.39 MiB
llm_load_tensors:      CUDA3 buffer size = 27632.88 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 12288
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   429.02 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   292.52 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   331.53 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   195.01 MiB
llama_new_context_with_model: KV self size  = 1248.00 MiB, K (q8_0):  816.00 MiB, V (q4_0):  432.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =   412.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   420.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  2560.00 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  2086.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    56.01 MiB
llama_new_context_with_model: graph nodes  = 1708
llama_new_context_with_model: graph splits = 5
INFO [                    init] initializing slots | tid="139738867924992" timestamp=1746386887 n_slots=1
INFO [                    init] new slot | tid="139738867924992" timestamp=1746386887 id_slot=0 n_ctx_slot=12288
INFO [                    main] model loaded | tid="139738867924992" timestamp=1746386887
INFO [                    main] chat template | tid="139738867924992" timestamp=1746386887 chat_example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" built_in=true
INFO [                    main] HTTP server listening | tid="139738867924992" timestamp=1746386887 n_threads_http="15" port="8080" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="139738867924992" timestamp=1746386887

And it seems to work without issues

image

Not sure if there's a flag that could improve things for dense models. Also not exactly sure how to enable thinking, but maybe that depends of the UI when using it via API.

@ikawrakow
Copy link
Owner

ikawrakow commented May 5, 2025

With the commit that I just pushed convert_hf_to_gguf.py now converts the Nemotron-Super-49B model correctly.

But then I see a difference in PPL.

I didn't run the bf16 model directly (comes dangerously close to the total RAM I have), but using Q8_0 quantization. I arrive at a lower PPL using the HF->GGUF conversion script in this PR compared to using mainline conversion:

  • PPL = 7.0801 using mainline HF->GGUF
  • PPL = 7.0347 using this PR HF->GGUF

Quantization is done in exactly the same way, I'm running with exact same parameters on the same hardware, so something else is different in the converted bf16 models (and just simple diff tells me that the files differ).

OK, doing diff on the logs, I see this difference:

llama_model_loader: - type  f32:  131 tensors   (mainline)
vs
llama_model_loader: - type  f32:  130 tensors   (this PR)

Copy link
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective this is ready to merge.
Just waiting for @Lissanro to confirm that it is working for them.

@Lissanro
Copy link

Lissanro commented May 5, 2025

I tried at first using this command:

~/pkgs/ik_llama.cpp/build/bin/llama-server \
--model /mnt/secondary/neuro/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF-UD-Q4_K_XL-131072seq/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q4_K_XL-00001-of-00004.gguf \
--ctx-size 81920 --n-gpu-layers 36 --tensor-split 25,25,25,25 \
-fa -ctk q8_0 -ctv q8_0 --threads 64 --host 0.0.0.0 --port 5000 --split-mode row

It loaded successfully, but when trying inference I got this error:

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:3054
  cudaStreamSynchronize(cuda_ctx->stream())
/home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error

With 12 layers on GPU the error is the same (loads fine, but crashes when I try to use it). If I remove --split-mode row, it also results with the same error.

As a last resort, I tried to load only on CPU (CUDA_VISIBLE_DEVICES="" is necessary otherwise it still tries to use CUDA):

CUDA_VISIBLE_DEVICES="" ~/pkgs/ik_llama.cpp/build/bin/llama-server \
--model /mnt/secondary/neuro/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF-UD-Q4_K_XL-131072seq/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q4_K_XL-00001-of-00004.gguf \
--ctx-size 81920 -fa -ctk q8_0 -ctv q8_0 --threads 64 --host 0.0.0.0 --port 5000

...then at first I thought it worked. So it seems there is an issue specific to CUDA, but CPU-only mode works. Please let me know if additional debugging from my side could help, and if so what steps I need to follow.

@ikawrakow
Copy link
Owner

Can you try

~/pkgs/ik_llama.cpp/build/bin/llama-server \
--model /mnt/secondary/neuro/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF-UD-Q4_K_XL-131072seq/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q4_K_XL-00001-of-00004.gguf \
--ctx-size 81920 --n-gpu-layers 36 --tensor-split 25,25,25,25 \
-fa -ctk q8_0 -ctv q8_0 --threads 64 --host 0.0.0.0 --port 5000 -fmoe

Thanks.

@saood06
Copy link
Collaborator Author

saood06 commented May 5, 2025

With the commit that I just pushed convert_hf_to_gguf.py now converts the Nemotron-Super-49B model correctly.

Nice, I see you grabbed the only changes to vocab.py that we were behind: ggml-org/llama.cpp@8ba3858 and ggml-org/llama.cpp@a686171. I think you might have been able to cherry-pick those commits directly.

But then I see a difference in PPL.

I didn't run the bf16 model directly (comes dangerously close to the total RAM I have), but using Q8_0 quantization. I arrive at a lower PPL using the HF->GGUF conversion script in this PR compared to using mainline conversion:

* `PPL = 7.0801` using mainline HF->GGUF

* `PPL = 7.0347`  using this PR HF->GGUF

Quantization is done in exactly the same way, I'm running with exact same parameters on the same hardware, so something else is different in the converted bf16 models (and just simple diff tells me that the files differ).

OK, doing diff on the logs, I see this difference:

llama_model_loader: - type  f32:  131 tensors   (mainline)
vs
llama_model_loader: - type  f32:  130 tensors   (this PR)

Interesting, do you mind checking with gguf-hash or some other tool if that one changed tensor is the only difference? I am curious to know why this PR does one tensor less of f32 than mainline.

@Lissanro
Copy link

Lissanro commented May 5, 2025

Can you try
~/pkgs/ik_llama.cpp/build/bin/llama-server
--model /mnt/secondary/neuro/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF-UD-Q4_K_XL-131072seq/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q4_K_XL-00001-of-00004.gguf
--ctx-size 81920 --n-gpu-layers 36 --tensor-split 25,25,25,25
-fa -ctk q8_0 -ctv q8_0 --threads 64 --host 0.0.0.0 --port 5000 -fmoe

Sure, here is the full log: https://pastebin.com/TjqnExDv - it loaded fine, then when I attempted inference it crashed with this error:

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:3054
  cudaStreamSynchronize(cuda_ctx->stream())
/home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error

@ikawrakow
Copy link
Owner

@Lissanro

I'm noticing that there is an issue for MoE models when the partial GPU offload is not done via tensor overrides (as in the command above). I'll try to figure out what is wrong, but in the meantime can you try this:

~/pkgs/ik_llama.cpp/build/bin/llama-server
--model /mnt/secondary/neuro/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF-UD-Q4_K_XL-131072seq/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q4_K_XL-00001-of-00004.gguf
--ctx-size 81920 --n-gpu-layers 100 --tensor-split 25,25,25,25
-fa -ctk q8_0 -ctv q8_0 --threads 64 --host 0.0.0.0 --port 5000 -fmoe
-ot "blk\.3[2-9]\.ffn=CPU,blk\.[4-9][0-9]\.ffn=CPU"

This is similar to what you tried, but it will load all attention on the GPUs along with the first 32 layers of the experts, the remaining experts will be on the CPU. Not sure about the context, you may need to reduce it somewhat.

Thanks!

@Panchovix
Copy link

Correct me if I'm wrong but isn't nemotron 253B a dense model? So no experts and such

@ikawrakow
Copy link
Owner

Correct me if I'm wrong but isn't nemotron 253B a dense model? So no experts and such

Oops, I'm getting confused. Doing too many things at a time. Not sure then why partial offload is not working.

@ikawrakow
Copy link
Owner

Interesting, do you mind checking with gguf-hash or some other tool if that one changed tensor is the only difference? I am curious to know why this PR does one tensor less of f32 than mainline.

I used gguf-dump.py, and the missing tensor is rope_freqs.

Hashes are identical.

The other difference is that ours is general.file_type = 24, while theirs is general.file_type = 32. I don't know what that means.

@saood06
Copy link
Collaborator Author

saood06 commented May 7, 2025

I used gguf-dump.py, and the missing tensor is rope_freqs.

I'm not sure why it is missing (and whether it would cause worse quality at long contexts), the conversion script looks like it handles it.

I can see that tensor in gguf's that are on huggingface for these models, so it does seem like it should be there.

The other difference is that ours is general.file_type = 24, while theirs is general.file_type = 32. I don't know what that means.

This one I understand, they both map to MOSTLY_BF16 (ik_llama.cpp source and llama.cpp source).

@Lissanro
Copy link

Lissanro commented May 7, 2025

If there is still something I need to test, please let me know (my understanding the last command was given under assumption it was MoE, but since it is a dense model, I assume I need some other command to test or maybe I already provided all debug info that is possible from my side). In any case, thank you very much for looking into this.

@ikawrakow
Copy link
Owner

I think I'll merge this one despite the missing rope_freqs tensors. We can try to sort out later why is it missing if people find performance degradation with long context.

@ikawrakow ikawrakow merged commit bc6ae51 into main May 9, 2025
@saood06
Copy link
Collaborator Author

saood06 commented May 9, 2025

I think I'll merge this one despite the missing rope_freqs tensors. We can try to sort out later why is it missing if people find performance degradation with long context.

I think I figured it out (or at least I found one reason why it is missing if it turns out there is more), I'll make a PR later (heading off for a bit right now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: unknown model architecture: 'deci' (when loading Llama-3_1-Nemotron-Ultra-253B)
6 participants