Open
Description
Name and Version
b5595 3a07714
b5600 d17a809
(with CUDA)
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m Qwen2.5-14B-Instruct-Q8_0.gguf -ngl 99 --temp 0 -fa -cb -c 44200 -np 17
llama-server -m Qwen2.5-1.5B-Instruct-Q8_0.gguf -ngl 99 --temp 0 -fa -cb -c 166400 -np 64
Problem description & steps to reproduce
This assertion fails sporadically: GGML_ASSERT(nf == nh && "KV defrag bug: nf != nh")
It works fine for 2k or even 50k inference tasks that were completed in parallel, then it randomly fails.
Prompt sizes are roughly in the range from 100 to 600 tokens, and the generated tokens somewhere between 8 and 2k.
I've added debug output. Maybe these numbers yield a clue regarding what failed.
nf != nh (1681 != 1704)
i0: 1194, nh: 1704, nf: 1681, is: 1194, n_used: 2898, n_kv: 12260
Expected n_used: 2898, actual: 2875 (based on checking cells.is_empty for 0 to n_kv)
is_empty is true for cells 1194 to 2897 (others have not been checked)
First Bad Commit
No response