Description
Problem. The custom seed
value is not passed to the inference engine when using llama.cpp HTTP server
(even though it works as expected in llama_cpp_python
package).
How to reproduce: in the latest Linux version of llama.cpp
repeat several times exactly the same cURL request to the completion API endpoint of the llama.cpp HTTP server
, with the prompt containing an open question and with a high value of temperature
and top_p
(to maximize the variability of model output), while fixing the seed
, e.g. like this one to infer from the 8-bit quant of bartowski/Meta-Llama-3-8B-Instruct-GGUF
(Meta-Llama-3-8B-Instruct-Q8_0.gguf) model:
$ curl --request POST --url http://localhost:12345/completion --data '{"prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nwrite a tweet that Elon Musk would write to boost TSLA shares<|eot_id|><|start_header_id|>assistant<|end_header_id|>", "temperature": 0.7, "top_p": 0.8, "repeat_penalty": 1.1, "seed": 42, "n_predict": 2048}' | grep seed
We can see that regardless of the value passed to seed
in the HTTP request (e.g. 42 in the example above), the seed
values reported to the HTTP client are invariably the default ones (4294967295, i.e. -1 cast to to unsigned int).
The fact that the default -1 (i.e. random, unobservable and non-repeatable seed) is used as the seed, while the custom client-supplied values are being ignored, is corroborated by the fact that the model-generated output is always different, rather than always the same as expected (and as attainable with the above settings when repeating this test against the non-server llama.cpp
backend using its Python package - local binding, without client-server communication).