Skip to content

Warmup and NUMA changes added, MLA changes updated #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

saood06
Copy link
Owner

@saood06 saood06 commented Feb 3, 2025

This is actually usable at much higher context size, main worked fine under llama-batched-bench but paged out and had worse performance in server, and was limited to ~8k. This is tested in server at ~8k, with TG at 2.18, and is currently launched with 64K, 128K errors out. Tested on dual socket Xeon E5-2690 v3 with 384 GB RAM.

CPU buffer size = 362010.72 MiB
n_ctx      = 64000
CPU NUMA KV buffer size = 313101.56 MiB
KV self size  = 305000.00 MiB, K (f16): 183000.00 MiB, V (f16): 122000.00 MiB
KV self size  = 4289.06 MiB, K^R (f16):  476.56 MiB, c^KV (f16): 3812.50 MiB
NUMA compute buffer size = 32343.01 MiB
  • To-do: Fix NUMA to actually toggle depending on if NUMA is enabled or not.
  • To-do: Sync RPC to make it functional and add performance and add override model tensor buffers [#11397] ( will have limited practical benefit while old KV cache is still being allocated and not used). Edit: Based on initial tests on llama.cpp this causes performance loss, and also does not function past ~28 layers.
  • To-do: Grab any FA implementation if one appears.

ikawrakow and others added 7 commits January 29, 2025 14:05
* Adding gp option to llama-bench

Similar to pg, but it only looks at TG speed with a given
prompt length.

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

They still need to be divisible by 32.

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

.. on NEON

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

.., on AVX2

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

.., on AVX2

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

... on NEON

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

... on Zen4.

Also fix q8_0 K-cache for head sizes that are not multiple of 128.

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
* Slightly faster AVX2 implementation for q4_k_r4

* Even better AVX2 implementation for q4_k_r4

We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f872.
With FA and Q8_0 K-cache we get to 339.5 t/s.

* Fix llama-bench labels that I broke with ikawrakow#181

* Faster AVX2 implementation for q5_k_q4

We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.

* Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4

After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.

* Minor tweak

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
* Quantization mixes tweaks

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on Zen4

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on AVX2

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on AVX2

* Make q6_0_w4 work with row size that are not a multiple of 128

... on Zen4

* Make q6_0_w4 work with row size that are not a multiple of 128

... on Zen4

* Make q5_0_r4 work with row size that are not a multiple of 128

... on Zen4 and AVX2

* Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128

also on NEON.

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Co-authored-by: Stanisław Szymczyk <[email protected]>
Co-authored-by: Stanisław Szymczyk <[email protected]>
@saood06 saood06 changed the title Mla update warmup numa Warmup and NUMA changes added, MLA changes updated Feb 3, 2025
@fairydreaming
Copy link

Cool! Note that this mmap based KV buffer allocator shall work even without NUMA, I shall probably name it ggml_backend_mmap_buffer_type.

@saood06
Copy link
Owner Author

saood06 commented Feb 3, 2025

Cool! Note that this mmap based KV buffer allocator shall work even without NUMA, I shall probably name it ggml_backend_mmap_buffer_type.

Makes sense, even though it synergizes with NUMA, it still has benefits without NUMA.

If you end up using this branch, I'd appreciate performance numbers.

I'm going to try and crudely make it so that the old KV cache does not allocate, because I do think if that's done you can offload everything besides the non shared experts with just 24GB of VRAM at ~23k context, assuming it works. I was only able to RPC 29 layers even when I had ample VRAM, 30+ would just silently crash the RPC server ( without the MLA branch, I have yet to test the diff you gave me to make the MLA branch work).

@saood06
Copy link
Owner Author

saood06 commented Feb 4, 2025

Some runtime numbers showing 30K context.

kv cache rm [p0, end) | timestamp=1738584936 p0=3
kv cache rm [p0, end) | timestamp=1738585206 p0=2051
kv cache rm [p0, end) | timestamp=1738585594 p0=4099
kv cache rm [p0, end) | timestamp=1738586098 p0=6147
kv cache rm [p0, end) | timestamp=1738586716 p0=8195
kv cache rm [p0, end) | timestamp=1738587443 p0=10243
kv cache rm [p0, end) | timestamp=1738588289 p0=12291
kv cache rm [p0, end) | timestamp=1738589245 p0=14339
kv cache rm [p0, end) | timestamp=1738590323 p0=16387
kv cache rm [p0, end) | timestamp=1738591540 p0=18435
kv cache rm [p0, end) | timestamp=1738592866 p0=20483
kv cache rm [p0, end) | timestamp=1738594456 p0=22531
kv cache rm [p0, end) | timestamp=1738596175 p0=24579
prompt eval time     = 12074054.06 ms / 25522 tokens (  473.08 ms per token,     2.11 tokens per second) | timestamp=1738599260 t_prompt_processing=12074054.06 n_prompt_tokens_processed=25522 t_token=473.08416503408824 n_tokens_second=2.113788779905463
generation eval time = 2250383.89 ms /  2088 runs   ( 1077.77 ms per token,     0.93 tokens per second) | timestamp=1738599260 t_token_generation=2250383.888 n_decoded=2088 t_token=1077.7700613026818 n_tokens_second=0.9278416945366968
total time = 14324437.95 ms | timestamp=1738599260  t_prompt_processing=12074054.06 t_token_generation=2250383.888 t_total=14324437.948

At higher context you can see how PP slows down as it gets deeper into the prompt.

Another high context generation, less PP in this one as it cached most of the previous prompt.

kv cache rm [p0, end) | timestamp=1738649710 p0=26931
prompt eval time     =  126917.48 ms /   143 tokens (  887.53 ms per token,     1.13 tokens per second) | timestamp=1738652616 t_prompt_processing=126917.477 n_prompt_tokens_processed=143 t_token=887.5348041958042 n_tokens_second=1.1267163780761258
generation eval time = 2778653.10 ms /  2726 runs   ( 1019.32 ms per token,     0.98 tokens per second) |  timestamp=1738652616 t_token_generation=2778653.096 n_decoded=2726 t_token=1019.3151489361702 n_tokens_second=0.9810508565909877
 total time = 2905570.57 ms | timestamp=1738652616 id_slot=0 id_task=11466 t_prompt_processing=126917.477 t_token_generation=2778653.096 t_total=2905570.573

ikawrakow and others added 3 commits February 5, 2025 13:49
* iq1_s_r4: basics - quantize/dequantize

* iq1_s_r4: gemm/gemv works on AVX2/Zen4

* Don't forget to make sure we have a multiple of 4 rows per thread

* iq1_s_r4: this is better

* iq1_s_r4: fix Zen4 after AVX2 changes

* iq1_s_r4: NEON gemm/gemv

* iq1_s_r4: more bits for shared experts

With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.

On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.

* Forgotten counter increment

* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv

* Compiler warnings

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
ikawrakow and others added 9 commits February 6, 2025 14:08
* iq1_m_r4: basics (quantize/dequantize)

* iq1_m_r4: Zen4 gemm

* iq1_m_r4: neon gemm

* iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4

With the deltas being per group of 8, we cannot make use
of the q8 sums stored in q8_1, so we get a tiny gain by
using q8_0_x4.

* iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving

* Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving

* Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
@saood06
Copy link
Owner Author

saood06 commented Mar 27, 2025

No longer needed, all changes made their way upstream (the new KV cache is a draft PR).

Some rough deep context timing info on first quant of V3-0324 below.

INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743073276 id_slot=0 id_task=0 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743073509 id_slot=0 id_task=0 p0=2048
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743073812 id_slot=0 id_task=0 p0=4096
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743074174 id_slot=0 id_task=0 p0=6144
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743074593 id_slot=0 id_task=0 p0=8192
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743075072 id_slot=0 id_task=0 p0=10240
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743075609 id_slot=0 id_task=0 p0=12288
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743076198 id_slot=0 id_task=0 p0=14336
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743076833 id_slot=0 id_task=0 p0=16384
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743077519 id_slot=0 id_task=0 p0=18432
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743078256 id_slot=0 id_task=0 p0=20480
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743079045 id_slot=0 id_task=0 p0=22528
INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743079892 id_slot=0 id_task=0 p0=24576
INFO [           print_timings] prompt eval time     = 7038544.97 ms / 25523 tokens (  275.77 ms per token,     3.63 tokens per second) | tid="94391278368704" timestamp=1743081431 id_slot=0 id_task=0 t_prompt_processing=7038544.974 n_prompt_tokens_processed=25523 t_token=275.7726354268699 n_tokens_second=3.6261755937172477
INFO [           print_timings] generation eval time = 1116594.60 ms /  1254 runs   (  890.43 ms per token,     1.12 tokens per second) | tid="94391278368704" timestamp=1743081431 id_slot=0 id_task=0 t_token_generation=1116594.603 n_decoded=1254 t_token=890.4263181818181 n_tokens_second=1.1230575507268508
INFO [           print_timings]           total time = 8155139.58 ms | tid="94391278368704" timestamp=1743081431 id_slot=0 id_task=0 t_prompt_processing=7038544.974 t_token_generation=1116594.603 t_total=8155139.5770000005


INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743084919 id_slot=0 id_task=1268 p0=25560
INFO [           print_timings] prompt eval time     =  452433.05 ms /  1020 tokens (  443.56 ms per token,     2.25 tokens per second) | tid="94391278368704" timestamp=1743085553 id_slot=0 id_task=1268 t_prompt_processing=452433.046 n_prompt_tokens_processed=1020 t_token=443.56180980392156 n_tokens_second=2.254477229322458
INFO [           print_timings] generation eval time =  181123.38 ms /   202 runs   (  896.65 ms per token,     1.12 tokens per second) | tid="94391278368704" timestamp=1743085553 id_slot=0 id_task=1268 t_token_generation=181123.382 n_decoded=202 t_token=896.6504059405942 n_tokens_second=1.1152618605586768
INFO [           print_timings]           total time =  633556.43 ms | tid="94391278368704" timestamp=1743085553 id_slot=0 id_task=1268 t_prompt_processing=452433.046 t_token_generation=181123.382 t_total=633556.428


INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743085774 id_slot=0 id_task=1472 p0=26781
INFO [           print_timings] prompt eval time     =   41350.86 ms /    58 tokens (  712.95 ms per token,     1.40 tokens per second) | tid="94391278368704" timestamp=1743087252 id_slot=0 id_task=1472 t_prompt_processing=41350.863 n_prompt_tokens_processed=58 t_token=712.9459137931034 n_tokens_second=1.4026309438813889
INFO [           print_timings] generation eval time = 1436835.68 ms /  1560 runs   (  921.05 ms per token,     1.09 tokens per second) | tid="94391278368704" timestamp=1743087252 id_slot=0 id_task=1472 t_token_generation=1436835.675 n_decoded=1560 t_token=921.0485096153847 n_tokens_second=1.0857191446057322
INFO [           print_timings]           total time = 1478186.54 ms | tid="94391278368704" timestamp=1743087252 id_slot=0 id_task=1472 t_prompt_processing=41350.863 t_token_generation=1436835.675 t_total=1478186.538
INFO [            update_slots] slot released | tid="94391278368704" timestamp=1743087252 id_slot=0 id_task=1472 n_ctx=80128 n_past=28398 n_system_tokens=0 n_cache_tokens=28398 truncated=false


INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743087584 id_slot=0 id_task=3075 p0=28340
INFO [           print_timings] prompt eval time     =    1376.04 ms /     1 tokens ( 1376.04 ms per token,     0.73 tokens per second) | tid="94391278368704" timestamp=1743087645 id_slot=0 id_task=3075 t_prompt_processing=1376.044 n_prompt_tokens_processed=1 t_token=1376.044 n_tokens_second=0.7267209478766666
INFO [           print_timings] generation eval time =   59883.59 ms /    66 runs   (  907.33 ms per token,     1.10 tokens per second) | tid="94391278368704" timestamp=1743087645 id_slot=0 id_task=3075 t_token_generation=59883.592 n_decoded=66 t_token=907.3271515151515 n_tokens_second=1.102138295244547
INFO [           print_timings]           total time =   61259.64 ms | tid="94391278368704" timestamp=1743087645 id_slot=0 id_task=3075 t_prompt_processing=1376.044 t_token_generation=59883.592 t_total=61259.636


INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743087705 id_slot=0 id_task=3143 p0=28406
INFO [           print_timings] prompt eval time     =   16752.33 ms /    24 tokens (  698.01 ms per token,     1.43 tokens per second) | tid="94391278368704" timestamp=1743088999 id_slot=0 id_task=3143 t_prompt_processing=16752.333 n_prompt_tokens_processed=24 t_token=698.013875 n_tokens_second=1.4326362781828657
INFO [           print_timings] generation eval time = 1277981.12 ms /  1347 runs   (  948.76 ms per token,     1.05 tokens per second) | tid="94391278368704" timestamp=1743088999 id_slot=0 id_task=3143 t_token_generation=1277981.121 n_decoded=1347 t_token=948.7610400890869 n_tokens_second=1.0540061804246323
INFO [           print_timings]           total time = 1294733.45 ms | tid="94391278368704" timestamp=1743088999 id_slot=0 id_task=3143 t_prompt_processing=16752.333 t_token_generation=1277981.121 t_total=1294733.4540000001


INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743090423 id_slot=0 id_task=4506 p0=29815
INFO [           print_timings] prompt eval time     =    1372.07 ms /     1 tokens ( 1372.07 ms per token,     0.73 tokens per second) | tid="94391278368704" timestamp=1743091271 id_slot=0 id_task=4506 t_prompt_processing=1372.07 n_prompt_tokens_processed=1 t_token=1372.07 n_tokens_second=0.7288257887717099
INFO [           print_timings] generation eval time =  847485.33 ms /   872 runs   (  971.89 ms per token,     1.03 tokens per second) | tid="94391278368704" timestamp=1743091271 id_slot=0 id_task=4506 t_token_generation=847485.333 n_decoded=872 t_token=971.8868497706421 n_tokens_second=1.0289263613722033
INFO [           print_timings]           total time =  848857.40 ms | tid="94391278368704" timestamp=1743091271 id_slot=0 id_task=4506 t_prompt_processing=1372.07 t_token_generation=847485.333 t_total=848857.4029999999


INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743092367 id_slot=0 id_task=5387 p0=30729
INFO [           print_timings] prompt eval time     =   14743.08 ms /    16 tokens (  921.44 ms per token,     1.09 tokens per second) | tid="94391278368704" timestamp=1743093142 id_slot=0 id_task=5387 t_prompt_processing=14743.078 n_prompt_tokens_processed=16 t_token=921.442375 n_tokens_second=1.0852550600356317
INFO [           print_timings] generation eval time =  760422.11 ms /   770 runs   (  987.56 ms per token,     1.01 tokens per second) | tid="94391278368704" timestamp=1743093142 id_slot=0 id_task=5387 t_token_generation=760422.111 n_decoded=770 t_token=987.5611831168832 n_tokens_second=1.012595489875228
INFO [           print_timings]           total time =  775165.19 ms | tid="94391278368704" timestamp=1743093142 id_slot=0 id_task=5387 t_prompt_processing=14743.078 t_token_generation=760422.111 t_total=775165.189
INFO [            update_slots] slot released | tid="94391278368704" timestamp=1743093142 id_slot=0 id_task=5387 n_ctx=80128 n_past=31514 n_system_tokens=0 n_cache_tokens=31514 truncated=false


INFO [            update_slots] kv cache rm [p0, end) | tid="94391278368704" timestamp=1743094012 id_slot=0 id_task=6159 p0=31109
INFO [           print_timings] prompt eval time     =  237285.02 ms /   429 tokens (  553.11 ms per token,     1.81 tokens per second) | tid="94391278368704" timestamp=1743095211 id_slot=0 id_task=6159 t_prompt_processing=237285.025 n_prompt_tokens_processed=429 t_token=553.1119463869463 n_tokens_second=1.8079522717457623
INFO [           print_timings] generation eval time =  961263.29 ms /   947 runs   ( 1015.06 ms per token,     0.99 tokens per second) | tid="94391278368704" timestamp=1743095211 id_slot=0 id_task=6159 t_token_generation=961263.287 n_decoded=947 t_token=1015.0615491024288 n_tokens_second=0.9851619351400445
INFO [           print_timings]           total time = 1198548.31 ms | tid="94391278368704" timestamp=1743095211 id_slot=0 id_task=6159 t_prompt_processing=237285.025 t_token_generation=961263.287 t_total=1198548.312

@saood06 saood06 closed this Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants