Add support for bitnet2b_2501 model #337

saood06 · 2025-04-21T09:36:24Z

Very direct port of microsoft/BitNet#167 more specifically this commit, Eddie-Wang1120/llama.cpp@a8ac707

I had to do some minor additional fixes, it now compiles.

I have not ran the model yet.

ikawrakow · 2025-04-21T16:08:46Z

I fetched the model from https://huggingface.co/microsoft/bitnet-b1.58-2B-4T

When I try to run convert_hf_to_gguf.py, it tells me

INFO:hf-to-gguf:Loading model: bitnet-2B-4T
ERROR:hf-to-gguf:Model BitNetForCausalLM is not supported

ikawrakow · 2025-04-21T16:18:33Z

And after noticing that it is now "BitNetForCausalLM" instead of "BitnetForCausalLM" and fixing it, I get

INFO:hf-to-gguf:Loading model: bitnet-2B-4T
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {2560, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2560}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.uint8 --> F16, shape = {6912, 640}
INFO:hf-to-gguf:blk.0.ffn_down.scale,        torch.uint8 --> F32, shape = {}
Traceback (most recent call last):
  File "/home/iwan/other/ik_llama.cpp/convert_hf_to_gguf.py", line 4015, in <module>
    main()
  File "/home/iwan/other/ik_llama.cpp/convert_hf_to_gguf.py", line 4009, in main
    model_instance.write()
  File "/home/iwan/other/ik_llama.cpp/convert_hf_to_gguf.py", line 387, in write
    self.prepare_tensors()
  File "/home/iwan/other/ik_llama.cpp/convert_hf_to_gguf.py", line 280, in prepare_tensors
    for new_name, data in ((n, d.squeeze().numpy()) for n, d in self.modify_tensors(data_torch, name, bid)):
  File "/home/iwan/other/ik_llama.cpp/convert_hf_to_gguf.py", line 1654, in modify_tensors
    tensors.append((self.map_tensor_name(name), data_torch))
  File "/home/iwan/other/ik_llama.cpp/convert_hf_to_gguf.py", line 200, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.weight_scale'

saood06 · 2025-04-22T02:33:41Z

I can reproduce the issue with the safetensors conversion, but using the method outlined in #169 I was able to get it running.

./bin/llama-quantize --allow-requantize /mnt/sda/bitnet/gguf/ggml-model-i2_s.gguf /mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf iq2_bn

Full log inside

main: build = 3641 (35691804)
main: built with gcc (Clear Linux OS for Intel Architecture) 14.2.1 20241210 releases/gcc-14.2.0-551-g21a09f0507 for x86_64-generic-linux
main: quantizing '/mnt/sda/bitnet/gguf/ggml-model-i2_s.gguf' to '/mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf' as IQ2_BN
llama_model_loader: loaded meta data with 24 key-value pairs and 333 tensors from /mnt/sda/bitnet/gguf/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: unknown type i2_s
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bitnet-25
llama_model_loader: - kv   1:                               general.name str              = bitnet2b_2501
llama_model_loader: - kv   2:                       bitnet-25.vocab_size u32              = 128256
llama_model_loader: - kv   3:                   bitnet-25.context_length u32              = 4096
llama_model_loader: - kv   4:                 bitnet-25.embedding_length u32              = 2560
llama_model_loader: - kv   5:                      bitnet-25.block_count u32              = 30
llama_model_loader: - kv   6:              bitnet-25.feed_forward_length u32              = 6912
llama_model_loader: - kv   7:             bitnet-25.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:             bitnet-25.attention.head_count u32              = 20
llama_model_loader: - kv   9:          bitnet-25.attention.head_count_kv u32              = 5
llama_model_loader: - kv  10:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  11: bitnet-25.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                   bitnet-25.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 40
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type i2_s:  210 tensors
[   1/ 333]                        output.weight - [ 2560, 128256,     1,     1], type =    f16, converting to q6_K .. size =   626.25 MiB ->   256.86 MiB
[   2/ 333]                    token_embd.weight - [ 2560, 128256,     1,     1], type =    f16, converting to iq4_nl .. size =   626.25 MiB ->   176.13 MiB
[   3/ 333]               blk.0.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[   4/ 333]                blk.0.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[   5/ 333]            blk.0.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[   6/ 333]                blk.0.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[   7/ 333]                  blk.0.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[   8/ 333]                blk.0.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[   9/ 333]           blk.0.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  10/ 333]                  blk.0.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  11/ 333]             blk.0.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  12/ 333]                  blk.0.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  13/ 333]                  blk.0.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  14/ 333]               blk.1.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  15/ 333]                blk.1.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  16/ 333]            blk.1.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  17/ 333]                blk.1.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  18/ 333]                  blk.1.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  19/ 333]                blk.1.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  20/ 333]           blk.1.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  21/ 333]                  blk.1.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  22/ 333]             blk.1.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  23/ 333]                  blk.1.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  24/ 333]                  blk.1.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  25/ 333]              blk.10.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  26/ 333]               blk.10.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  27/ 333]           blk.10.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  28/ 333]               blk.10.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  29/ 333]                 blk.10.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  30/ 333]               blk.10.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  31/ 333]          blk.10.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  32/ 333]                 blk.10.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  33/ 333]            blk.10.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  34/ 333]                 blk.10.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  35/ 333]                 blk.10.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  36/ 333]              blk.11.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  37/ 333]               blk.11.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  38/ 333]           blk.11.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  39/ 333]               blk.11.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  40/ 333]                 blk.11.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  41/ 333]               blk.11.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  42/ 333]          blk.11.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  43/ 333]                 blk.11.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  44/ 333]            blk.11.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  45/ 333]                 blk.11.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  46/ 333]                 blk.11.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  47/ 333]              blk.12.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  48/ 333]               blk.12.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  49/ 333]           blk.12.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  50/ 333]               blk.12.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  51/ 333]                 blk.12.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  52/ 333]               blk.12.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  53/ 333]          blk.12.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  54/ 333]                 blk.12.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  55/ 333]            blk.12.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  56/ 333]                 blk.12.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  57/ 333]                 blk.12.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  58/ 333]              blk.13.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  59/ 333]               blk.13.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  60/ 333]           blk.13.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  61/ 333]               blk.13.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  62/ 333]                 blk.13.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  63/ 333]               blk.13.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  64/ 333]          blk.13.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  65/ 333]                 blk.13.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  66/ 333]            blk.13.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  67/ 333]                 blk.13.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  68/ 333]                 blk.13.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  69/ 333]              blk.14.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  70/ 333]               blk.14.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  71/ 333]           blk.14.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  72/ 333]               blk.14.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  73/ 333]                 blk.14.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  74/ 333]               blk.14.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  75/ 333]          blk.14.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  76/ 333]                 blk.14.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  77/ 333]            blk.14.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  78/ 333]                 blk.14.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  79/ 333]                 blk.14.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  80/ 333]              blk.15.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  81/ 333]               blk.15.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  82/ 333]           blk.15.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  83/ 333]               blk.15.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  84/ 333]                 blk.15.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  85/ 333]               blk.15.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  86/ 333]          blk.15.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  87/ 333]                 blk.15.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  88/ 333]            blk.15.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  89/ 333]                 blk.15.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[  90/ 333]                 blk.15.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  91/ 333]              blk.16.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  92/ 333]               blk.16.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[  93/ 333]           blk.16.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[  94/ 333]               blk.16.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  95/ 333]                 blk.16.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[  96/ 333]               blk.16.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  97/ 333]          blk.16.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[  98/ 333]                 blk.16.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[  99/ 333]            blk.16.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 100/ 333]                 blk.16.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 101/ 333]                 blk.16.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 102/ 333]              blk.17.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 103/ 333]               blk.17.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 104/ 333]           blk.17.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 105/ 333]               blk.17.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 106/ 333]                 blk.17.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 107/ 333]               blk.17.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 108/ 333]          blk.17.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 109/ 333]                 blk.17.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 110/ 333]            blk.17.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 111/ 333]                 blk.17.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 112/ 333]                 blk.17.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 113/ 333]              blk.18.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 114/ 333]               blk.18.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 115/ 333]           blk.18.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 116/ 333]               blk.18.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 117/ 333]                 blk.18.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 118/ 333]               blk.18.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 119/ 333]          blk.18.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 120/ 333]                 blk.18.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 121/ 333]            blk.18.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 122/ 333]                 blk.18.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 123/ 333]                 blk.18.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 124/ 333]              blk.19.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 125/ 333]               blk.19.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 126/ 333]           blk.19.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 127/ 333]               blk.19.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 128/ 333]                 blk.19.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 129/ 333]               blk.19.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 130/ 333]          blk.19.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 131/ 333]                 blk.19.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 132/ 333]            blk.19.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 133/ 333]                 blk.19.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 134/ 333]                 blk.19.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 135/ 333]               blk.2.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 136/ 333]                blk.2.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 137/ 333]            blk.2.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 138/ 333]                blk.2.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 139/ 333]                  blk.2.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 140/ 333]                blk.2.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 141/ 333]           blk.2.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 142/ 333]                  blk.2.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 143/ 333]             blk.2.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 144/ 333]                  blk.2.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 145/ 333]                  blk.2.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 146/ 333]              blk.20.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 147/ 333]               blk.20.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 148/ 333]           blk.20.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 149/ 333]               blk.20.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 150/ 333]                 blk.20.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 151/ 333]               blk.20.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 152/ 333]          blk.20.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 153/ 333]                 blk.20.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 154/ 333]            blk.20.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 155/ 333]                 blk.20.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 156/ 333]                 blk.20.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 157/ 333]              blk.21.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 158/ 333]               blk.21.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 159/ 333]           blk.21.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 160/ 333]               blk.21.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 161/ 333]                 blk.21.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 162/ 333]               blk.21.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 163/ 333]          blk.21.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 164/ 333]                 blk.21.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 165/ 333]            blk.21.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 166/ 333]                 blk.21.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 167/ 333]                 blk.21.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 168/ 333]              blk.22.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 169/ 333]               blk.22.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 170/ 333]           blk.22.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 171/ 333]               blk.22.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 172/ 333]                 blk.22.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 173/ 333]               blk.22.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 174/ 333]          blk.22.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 175/ 333]                 blk.22.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 176/ 333]            blk.22.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 177/ 333]                 blk.22.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 178/ 333]                 blk.22.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 179/ 333]              blk.23.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 180/ 333]               blk.23.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 181/ 333]           blk.23.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 182/ 333]               blk.23.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 183/ 333]                 blk.23.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 184/ 333]               blk.23.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 185/ 333]          blk.23.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 186/ 333]                 blk.23.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 187/ 333]            blk.23.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 188/ 333]                 blk.23.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 189/ 333]                 blk.23.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 190/ 333]              blk.24.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 191/ 333]               blk.24.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 192/ 333]           blk.24.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 193/ 333]               blk.24.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 194/ 333]                 blk.24.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 195/ 333]               blk.24.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 196/ 333]          blk.24.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 197/ 333]                 blk.24.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 198/ 333]            blk.24.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 199/ 333]                 blk.24.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 200/ 333]                 blk.24.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 201/ 333]              blk.25.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 202/ 333]               blk.25.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 203/ 333]           blk.25.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 204/ 333]               blk.25.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 205/ 333]                 blk.25.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 206/ 333]               blk.25.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 207/ 333]          blk.25.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 208/ 333]                 blk.25.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 209/ 333]            blk.25.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 210/ 333]                 blk.25.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 211/ 333]                 blk.25.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 212/ 333]              blk.26.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 213/ 333]               blk.26.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 214/ 333]           blk.26.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 215/ 333]               blk.26.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 216/ 333]                 blk.26.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 217/ 333]               blk.26.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 218/ 333]          blk.26.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 219/ 333]                 blk.26.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 220/ 333]            blk.26.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 221/ 333]                 blk.26.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 222/ 333]                 blk.26.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 223/ 333]              blk.27.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 224/ 333]               blk.27.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 225/ 333]           blk.27.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 226/ 333]               blk.27.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 227/ 333]                 blk.27.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 228/ 333]               blk.27.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 229/ 333]          blk.27.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 230/ 333]                 blk.27.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 231/ 333]            blk.27.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 232/ 333]                 blk.27.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 233/ 333]                 blk.27.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 234/ 333]              blk.28.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 235/ 333]               blk.28.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 236/ 333]           blk.28.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 237/ 333]               blk.28.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 238/ 333]                 blk.28.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 239/ 333]               blk.28.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 240/ 333]          blk.28.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 241/ 333]                 blk.28.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 242/ 333]            blk.28.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 243/ 333]                 blk.28.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 244/ 333]                 blk.28.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 245/ 333]              blk.29.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 246/ 333]               blk.29.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 247/ 333]           blk.29.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 248/ 333]               blk.29.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 249/ 333]                 blk.29.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 250/ 333]               blk.29.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 251/ 333]          blk.29.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 252/ 333]                 blk.29.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 253/ 333]            blk.29.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 254/ 333]                 blk.29.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 255/ 333]                 blk.29.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 256/ 333]               blk.3.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 257/ 333]                blk.3.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 258/ 333]            blk.3.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 259/ 333]                blk.3.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 260/ 333]                  blk.3.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 261/ 333]                blk.3.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 262/ 333]           blk.3.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 263/ 333]                  blk.3.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 264/ 333]             blk.3.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 265/ 333]                  blk.3.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 266/ 333]                  blk.3.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 267/ 333]               blk.4.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 268/ 333]                blk.4.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 269/ 333]            blk.4.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 270/ 333]                blk.4.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 271/ 333]                  blk.4.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 272/ 333]                blk.4.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 273/ 333]           blk.4.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 274/ 333]                  blk.4.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 275/ 333]             blk.4.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 276/ 333]                  blk.4.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 277/ 333]                  blk.4.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 278/ 333]               blk.5.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 279/ 333]                blk.5.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 280/ 333]            blk.5.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 281/ 333]                blk.5.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 282/ 333]                  blk.5.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 283/ 333]                blk.5.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 284/ 333]           blk.5.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 285/ 333]                  blk.5.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 286/ 333]             blk.5.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 287/ 333]                  blk.5.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 288/ 333]                  blk.5.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 289/ 333]               blk.6.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 290/ 333]                blk.6.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 291/ 333]            blk.6.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 292/ 333]                blk.6.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 293/ 333]                  blk.6.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 294/ 333]                blk.6.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 295/ 333]           blk.6.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 296/ 333]                  blk.6.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 297/ 333]             blk.6.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 298/ 333]                  blk.6.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 299/ 333]                  blk.6.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 300/ 333]               blk.7.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 301/ 333]                blk.7.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 302/ 333]            blk.7.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 303/ 333]                blk.7.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 304/ 333]                  blk.7.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 305/ 333]                blk.7.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 306/ 333]           blk.7.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 307/ 333]                  blk.7.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 308/ 333]             blk.7.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 309/ 333]                  blk.7.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 310/ 333]                  blk.7.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 311/ 333]               blk.8.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 312/ 333]                blk.8.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 313/ 333]            blk.8.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 314/ 333]                blk.8.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 315/ 333]                  blk.8.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 316/ 333]                blk.8.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 317/ 333]           blk.8.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 318/ 333]                  blk.8.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 319/ 333]             blk.8.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 320/ 333]                  blk.8.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 321/ 333]                  blk.8.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 322/ 333]               blk.9.attn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 323/ 333]                blk.9.ffn_down.weight - [ 6912,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.23 MiB
[ 324/ 333]            blk.9.ffn_sub_norm.weight - [ 6912,     1,     1,     1], type =    f32, size =    0.026 MB
[ 325/ 333]                blk.9.ffn_gate.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 326/ 333]                  blk.9.ffn_up.weight - [ 2560,  6912,     1,     1], type =   i2_s, converting to iq2_bn .. size =     4.22 MiB ->     4.25 MiB
[ 327/ 333]                blk.9.ffn_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 328/ 333]           blk.9.attn_sub_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB
[ 329/ 333]                  blk.9.attn_k.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 330/ 333]             blk.9.attn_output.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 331/ 333]                  blk.9.attn_q.weight - [ 2560,  2560,     1,     1], type =   i2_s, converting to iq2_bn .. size =     1.56 MiB ->     1.57 MiB
[ 332/ 333]                  blk.9.attn_v.weight - [ 2560,   640,     1,     1], type =   i2_s, converting to iq2_bn .. size =     0.39 MiB ->     0.39 MiB
[ 333/ 333]                   output_norm.weight - [ 2560,     1,     1,     1], type =    f32, size =    0.010 MB

llama_model_quantize_internal: model size  =  1751.06 MB
llama_model_quantize_internal: quant size  =   934.16 MB

main: quantize time =  7087.18 ms
main:    total time =  7087.18 ms

I even ran it with the same prompt that you ran on the other bitnet's.

./bin/llama-cli -m /mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf -s 12345 -p "Write an essay about ecosystem" -t 8 --numa  distribute -n 900

Full log inside

Log start
main: build = 3641 (35691804)
main: built with gcc (Clear Linux OS for Intel Architecture) 14.2.1 20241210 releases/gcc-14.2.0-551-g21a09f0507 for x86_64-generic-linux
main: seed  = 12345
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
llama_model_loader: loaded meta data with 24 key-value pairs and 333 tensors from /mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bitnet-25
llama_model_loader: - kv   1:                               general.name str              = bitnet2b_2501
llama_model_loader: - kv   2:                       bitnet-25.vocab_size u32              = 128256
llama_model_loader: - kv   3:                   bitnet-25.context_length u32              = 4096
llama_model_loader: - kv   4:                 bitnet-25.embedding_length u32              = 2560
llama_model_loader: - kv   5:                      bitnet-25.block_count u32              = 30
llama_model_loader: - kv   6:              bitnet-25.feed_forward_length u32              = 6912
llama_model_loader: - kv   7:             bitnet-25.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:             bitnet-25.attention.head_count u32              = 20
llama_model_loader: - kv   9:          bitnet-25.attention.head_count_kv u32              = 5
llama_model_loader: - kv  10:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  11: bitnet-25.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                   bitnet-25.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 137
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:    1 tensors
llama_model_loader: - type iq2_bn:  210 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'llama3'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY MAY BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bitnet-25
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_layer          = 30
llm_load_print_meta: n_head           = 20
llm_load_print_meta: n_head_kv        = 5
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 640
llm_load_print_meta: n_embd_v_gqa     = 640
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 6912
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = IQ2_BN - 2.00 bpw Bitnet
llm_load_print_meta: model params     = 2.741 B
llm_load_print_meta: model size       = 934.155 MiB (2.859 BPW)
llm_load_print_meta: repeating layers = 501.162 MiB (2.017 BPW, 2.084 B parameters)
llm_load_print_meta: general.name     = bitnet2b_2501
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/31 layers to GPU
llm_load_tensors:        CPU buffer size =   934.16 MiB
........................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   300.00 MiB
llama_new_context_with_model: KV self size  =  300.00 MiB, K (f16):  150.00 MiB, V (f16):  150.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   255.50 MiB
llama_new_context_with_model: graph nodes  = 995
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 8 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 900, n_keep = 1


Write an essay about ecosystem services

A: The concept of ecosystem services refers to the benefits that humans derive from natural ecosystems. These services can be classified into four categories: provisioning, regulating, cultural, and supporting. Provisioning services include the availability of food, water, and other essential resources, such as timber. Regulating services are related to the regulation of natural processes, such as the water cycle and the climate. Cultural services encompass the aesthetic, recreational, and spiritual benefits that humans derive from nature. Lastly, supporting services are the background processes, like nutrient cycling and photosynthesis, that allow ecosystems to function.

The importance of ecosystem services is evident in their role in maintaining the health and well-being of both humans and the environment. Without these services, many of our daily needs would be impossible to meet. For example, the provisioning services that provide us with food and water would be severely compromised without the support of natural ecosystems. Additionally, regulating services like climate regulation and water purification would be difficult to achieve without the presence of healthy ecosystems.

The value of ecosystem services is often underestimated in economic and policy decisions, as the costs of environmental degradation and climate change are not always reflected in market prices. This can lead to a disregard for the importance of maintaining and protecting natural ecosystems, as well as for the services they provide. To address this, it is essential to incorporate the value of ecosystem services into economic and policy frameworks, such as through environmental taxation and environmental impact assessments.

In conclusion, ecosystem services play a crucial role in sustaining human life and well-being, as well as the health of the planet. Recognizing the value of these services and incorporating them into decision-making processes is vital for the long-term sustainability of both human societies and the natural world. By protecting and preserving ecosystems, we can ensure the continued provision of essential services, as well as the well-being of future generations.

##Follow-up questions:
1. Can you provide more examples of ecosystem services?
2. How can the value of ecosystem services be effectively integrated into policy decisions?
3. What are some potential challenges in implementing policies that incorporate the value of ecosystem services?
4. Are there any existing policies or frameworks that already recognize the value of ecosystem services?

##Answers:

1. Examples of ecosystem services include pollination of crops, which is crucial for food production; disease regulation, as ecosystems can help control the spread of pests and diseases; and carbon sequestration, where ecosystems absorb and store carbon dioxide from the atmosphere.

2. One way to integrate the value of ecosystem services into policy decisions is by conducting environmental impact assessments, which evaluate the potential environmental effects of a proposed policy or development project. Another approach is to incorporate the cost of ecosystem services into economic valuations, such as by assigning a monetary value to the benefits provided by ecosystem services. Additionally, policies like environmental taxes can be implemented to account for the negative impacts of human activities on ecosystems and their services.

3. Some potential challenges in implementing policies that incorporate the value of ecosystem services include the lack of consensus on the valuation of ecosystem services, the difficulty in quantifying the benefits and costs of these services, and the need for effective data collection and analysis. Additionally, there may be resistance from stakeholders who do not fully recognize the value of ecosystem services or who prioritize economic development over environmental protection.

4. Yes, there are several existing policies and frameworks that already recognize the value of ecosystem services. For example, the World Bank's Sustainable Development Goals (SDGs) emphasize the importance of conserving and sustainably using ecosystems and their services. The European Union's European Green Deal also highlights the need to protect and restore ecosystems and their services. The concept of ecosystem services has been integrated into environmental policy and management frameworks, such as the U.S. National Environmental Policy Act, which requires environmental impact assessments for major federal actions that could affect ecosystems and their services.

##Follow-up questions:
1. Can you elaborate on the role of environmental impact assessments in incorporating the value of ecosystem services into policy decisions?
2. How do the Sustainable Development Goals (SDGs) specifically address the importance of ecosystem services?
3. Are there any international frameworks or agreements that recognize the value of ecosystem services?

##Answers:

1. Environmental impact assessments (EIAs) play a crucial role in incorporating the value of ecosystem services into policy decisions. An EIA evaluates the potential environmental effects of a proposed policy or development project, including the impact on ecosystems and their services. By considering the value of ecosystem services, policymakers can

llama_print_timings:        load time =     295.32 ms
llama_print_timings:      sample time =      82.35 ms /   900 runs   (    0.09 ms per token, 10929.49 tokens per second)
llama_print_timings: prompt eval time =     185.71 ms /     6 tokens (   30.95 ms per token,    32.31 tokens per second)
llama_print_timings:        eval time =   31443.27 ms /   899 runs   (   34.98 ms per token,    28.59 tokens per second)
llama_print_timings:       total time =   32058.76 ms /   905 tokens
Log end

They seem to have a seperate script in the PR that converts the model but I'm having issues using that script with it placed in ik_llama.cpp as it hooks into gguf-py. (Well first, I had to comment out the torch compile on line 948 which did not work as I have CPU only triton on that system.) It hit this error.

INFO:convert:Loading model file /mnt/sda/bitnet/safetensors/model.safetensors
Traceback (most recent call last):
  File "/home/saood06/ik_main/ik_llama.cpp/build_bitnet/../temp.py", line 1852, in <module>
    main()
    ~~~~^^
  File "/home/saood06/ik_main/ik_llama.cpp/build_bitnet/../temp.py", line 1783, in main
    model_plus = load_some_model(args.model)
  File "/home/saood06/ik_main/ik_llama.cpp/build_bitnet/../temp.py", line 1661, in load_some_model
    models_plus.append(lazy_load_file(path))
                       ~~~~~~~~~~~~~~^^^^^^
  File "/home/saood06/ik_main/ik_llama.cpp/build_bitnet/../temp.py", line 1164, in lazy_load_file
    return lazy_load_safetensors_file(fp, path)
  File "/home/saood06/ik_main/ik_llama.cpp/build_bitnet/../temp.py", line 1143, in lazy_load_safetensors_file
    model = {name: convert(info) for (name, info) in header.items() if name != '__metadata__'}
                   ~~~~~~~^^^^^^
  File "/home/saood06/ik_main/ik_llama.cpp/build_bitnet/../temp.py", line 1131, in convert
    data_type = SAFETENSORS_DATA_TYPES[info['dtype']]
                ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'U8'

For now maybe we can just have GGUF support only, relying on elsewhere to do conversion from safetensors just like Gemma3?

Edit: Peak speed for me is at 24 threads, would be curious to see it on your machines since you have a lot of comparitive numbers.

llama_print_timings:        load time =     301.94 ms
llama_print_timings:      sample time =      11.75 ms /   128 runs   (    0.09 ms per token, 10895.47 tokens per second)
llama_print_timings: prompt eval time =     121.43 ms /     6 tokens (   20.24 ms per token,    49.41 tokens per second)
llama_print_timings:        eval time =    3495.94 ms /   127 runs   (   27.53 ms per token,    36.33 tokens per second)
llama_print_timings:       total time =    3683.50 ms /   133 tokens

Edit 2: Pushed the python fix for the new name even if that file still doesn't work. I don't see a point of pushing the standalone file since I still can't get that to work either. If they are going to have a standalone file, we may as well tell people to grab a GGUF (I could even upload one for this model it's small enough).

Edit 3: Even higher speeds with the R4 variant.

llama_print_timings:        load time =     299.55 ms
llama_print_timings:      sample time =      11.89 ms /   128 runs   (    0.09 ms per token, 10760.82 tokens per second)
llama_print_timings: prompt eval time =      98.51 ms /     6 tokens (   16.42 ms per token,    60.91 tokens per second)
llama_print_timings:        eval time =    3330.97 ms /   127 runs   (   26.23 ms per token,    38.13 tokens per second)
llama_print_timings:       total time =    3498.51 ms /   133 tokens

Edit 4: Using bench to see some numbers of both, where now 48 seems better again, showing best result for both R4 and normal variants.

./bin/llama-bench -m /mnt/sda/bitnet/gguf/ggml-model-iq2_bn_r4.gguf -r 3 -t 48 --numa distribute

model	size	params	backend	threads	test	t/s
bitnet-25 2B IQ2_BN_R4 - 2.00 bpw Bitnet	934.16 MiB	2.74 B	CPU	48	pp512	305.60 ± 17.19
bitnet-25 2B IQ2_BN_R4 - 2.00 bpw Bitnet	934.16 MiB	2.74 B	CPU	48	tg128	37.04 ± 0.58

./bin/llama-bench -m /mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf -r 3 -t 48 --numa distribute

model	size	params	backend	threads	test	t/s
bitnet-25 2B IQ2_BN - 2.00 bpw Bitnet	934.16 MiB	2.74 B	CPU	48	pp512	290.60 ± 11.27
bitnet-25 2B IQ2_BN - 2.00 bpw Bitnet	934.16 MiB	2.74 B	CPU	48	tg128	36.79 ± 0.52

Very informal testing, no dropping of cache, or other precautions taken.

Edit 5: It is available on huggingface here.

Edit 6: Another informal benchmark, this time sweep bench.

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.263	405.31	3.747	34.16
512	128	512	1.373	373.02	3.764	34.01
512	128	1024	1.503	340.58	3.890	32.91
512	128	1536	1.647	310.83	4.042	31.67
512	128	2048	1.774	288.67	4.170	30.69
512	128	2560	2.027	252.53	4.369	29.30
512	128	3072	2.149	238.29	4.557	28.09
512	128	3584	2.474	206.93	4.805	26.64

ikawrakow · 2025-04-22T05:48:56Z

Yes, I got it running by converting the i2_s model as well. But what about the missing pre-tokenizer?

main: build = 3642 (2641658c)
main: built with gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
main: seed  = 1745300836
llama_model_loader: loaded meta data with 24 key-value pairs and 333 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bitnet-25
llama_model_loader: - kv   1:                               general.name str              = bitnet2b_2501
llama_model_loader: - kv   2:                       bitnet-25.vocab_size u32              = 128256
llama_model_loader: - kv   3:                   bitnet-25.context_length u32              = 4096
llama_model_loader: - kv   4:                 bitnet-25.embedding_length u32              = 2560
llama_model_loader: - kv   5:                      bitnet-25.block_count u32              = 30
llama_model_loader: - kv   6:              bitnet-25.feed_forward_length u32              = 6912
llama_model_loader: - kv   7:             bitnet-25.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:             bitnet-25.attention.head_count u32              = 20
llama_model_loader: - kv   9:          bitnet-25.attention.head_count_kv u32              = 5
llama_model_loader: - kv  10:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  11: bitnet-25.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                   bitnet-25.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 137
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq2_bn:  211 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'llama3'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY MAY BE DEGRADED!         
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:

Is llama3 OK, or are we crippling the model by using the llama3 pre-tokenizer?

ikawrakow · 2025-04-22T06:07:30Z

Here sweep-bench performance on my Ryzen-7950X using -ctk q8_0 -fa -rtr -t 16

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.431	1187.87	2.054	62.33
512	128	512	0.455	1124.72	2.171	58.97
512	128	1024	0.489	1046.19	2.288	55.94
512	128	1536	0.522	981.58	2.412	53.08
512	128	2048	0.555	922.89	2.501	51.18
512	128	2560	0.584	876.83	2.625	48.77
512	128	3072	0.616	831.77	2.723	47.00
512	128	3584	0.650	788.26	2.841	45.06

saood06 · 2025-04-22T06:15:43Z

Yes, I got it running by converting the i2_s model as well. But what about the missing pre-tokenizer?

Is llama3 OK, or are we crippling the model by using the llama3 pre-tokenizer?

It does seem to have an issue using EOS tokens and stopping generation, so there is an issue.

ikawrakow · 2025-04-22T06:30:00Z

Here the results of the official Microsoft BitNet implementation (build a8ac7072, just pulled)

model	size	params	backend	threads	test	t/s
bitnet-25 2B I2_S - 2 bpw ternary	1.71 GiB	2.74 B	CPU	16	pp512	473.34 ± 1.09
bitnet-25 2B I2_S - 2 bpw ternary	1.71 GiB	2.74 B	CPU	16	tg128	43.85 ± 0.02

BitNet is a llama.cpp fork that does nothing else but adding BitNet support, with 2.6X lower PP and 1.42X lower TG performance than ik_llama.cpp - 15.8k stars.

ikawrakow

I think we can merge like this. It is fine to just use I2_S GGUFs. We can sort out the pre-tokenizer issue later.

saood06 · 2025-04-22T06:54:14Z

I think we can merge like this. It is fine to just use I2_S GGUFs. We can sort out the pre-tokenizer issue later.

Okay. I'll make an issue. I tested the model more, it is coherent, and can even do multi turn conversation, it just doesn't ever use an EOS token and so it never stops it's own generation it will just continue until I stopped it, and I still don't really understand it's chat template:

{% for message in messages %}{% if loop.first %}{{ bos_token }}{% endif %}{% if message['role'] == 'user' %}{{ 'Human: ' + message['content'] + '\n\nBITNETAssistant: ' + eos_token }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% endif %}{% endfor %}

saood06 · 2025-04-22T07:08:26Z

Here sweep-bench performance on my Ryzen-7950X using -ctk q8_0 -fa -rtr -t 16

I couldn't get flash attention running, it would always just exit with Floating point exception (core dumped).

ikawrakow · 2025-04-22T07:16:33Z

I couldn't get flash attention running, it would always just exit with Floating point exception (core dumped).

Something is missing in the logic for your number of threads. The model has a strange number of attention heads - 20 in total and 5 KV heads. I'm working on a better strategy for distributing the work between the threads.

saood06 · 2025-04-22T07:26:59Z

I couldn't get flash attention running, it would always just exit with Floating point exception (core dumped).

Something is missing in the logic for your number of threads. The model has a strange number of attention heads - 20 in total and 5 KV heads. I'm working on a better strategy for distributing the work between the threads.

I see, yes I can get it working with 16 and 32 threads, but I can't give performance numbers now as I can't drop my caches right now.

potassiummmm and others added 2 commits April 21, 2025 03:54

add support for bitnet2b_2501 model

3569180

Fixes

2641658

saood06 requested a review from ikawrakow April 21, 2025 09:36

Support both model names

3d7206e

ikawrakow approved these changes Apr 22, 2025

View reviewed changes

ikawrakow merged commit cc39800 into main Apr 22, 2025

ikawrakow mentioned this pull request Apr 22, 2025

BitNet adjustments #338

Merged

jdluzen mentioned this pull request May 1, 2025

Bug: Updated BitNet arch bitnet-b1.58 #365

Closed

vlasky mentioned this pull request May 14, 2025

Feature Request: Support for BitNet b1.58 2B4T model architecture Mozilla-Ocho/llamafile#753

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for bitnet2b_2501 model #337

Add support for bitnet2b_2501 model #337

Uh oh!

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

saood06 commented Apr 22, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

ikawrakow left a comment

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

Uh oh!

Add support for bitnet2b_2501 model #337

Add support for bitnet2b_2501 model #337

Uh oh!

Conversation

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

saood06 commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

ikawrakow commented Apr 22, 2025

Uh oh!

saood06 commented Apr 22, 2025

Uh oh!

Uh oh!

saood06 commented Apr 22, 2025 •

edited

Loading