How to run a llama server with fixed prompt cache without caching each of my upcoming queries? #14282

NIKHILDUGAR · 2025-06-19T12:58:00Z

NIKHILDUGAR
Jun 19, 2025

I am currently running my server as such:-

CUDA_VISIBLE_DEVICES="0" ./llama.cpp/build/bin/llama-server --model llama.cpp/34B/km.gguf --gpu-layers 99999 --no-context-shift --ctx-size 8000 --keep -1

and pass my prompt as

prompt = "long text prompt around 7900 tokens"
current_prompt_messages = list(prompt)
user_query = "User Query taken as input"

current_prompt_messages.append({"role": "user", "content": user_query})
payload = {
                "slot_id": 0, "temperature": 0.00, "n_keep": -1, "top-k":1,
            "cache_prompt": True, "messages": current_prompt_messages,
        }
r = requests.post("http://127.0.0.1:8080/v1/chat/completions", data=json.dumps(payload), headers=headers, timeout=30)

Now the problem I am facing with this is :

cache_prompt: true makes each upcoming query cached and stores as history and which I think believe causes inconsistency in results in multiple runs (although I am aware caching doesn't change the probabilities but i believe it keeps it inits history or something to influence the outputs) .
The first run takes a lot of time , hence pre-caching it will be helpful.

Appreciate any and all help and advice. Thanks.

Answered by ggerganov

Jun 23, 2025

You must be doing something wrong. Here is how you can test it:

# start the server
make -j && ./bin/llama-server -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf --no-context-shift --ctx-size 8000 --keep -1


# generate some random prompt:
chunk=$(printf 'hello %.0s' {1..7000})

# send first request:
curl --request POST --url http://127.0.0.1:8080/completion --header "Content-Type: application/json"     --data '{"prompt": "Prompt: '"${chunk}"', Question: blabla", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# send dummy request:
curl --request POST --url http://127.0.0.1:8080/completion --header "Content-Type: application/json"     --data '{"prompt": "Prompt: '"${chunk}

View full answer

ggerganov · 2025-06-19T13:07:34Z

ggerganov
Jun 19, 2025
Maintainer

After each request, send a dummy request with the original "fixed" prompt and n_predict == 0.

14 replies

NIKHILDUGAR Jun 19, 2025
Author

i am trying right now. appreciate all the help. Love your work. Thank you.

NIKHILDUGAR Jun 20, 2025
Author

hey, I have another question,Does llama-server (run as command mentioned above) have a chat history that it uses?

NIKHILDUGAR Jun 23, 2025
Author

Yes. Currently you send like this:
{prompt}query0

{prompt}query1

{prompt}query2

...
What you have to do is:
{prompt}query0
{prompt}

{prompt}query1
{prompt}

{prompt}query2
{prompt}

...

so the second one is the dummy request which takes 2seconds to process which is significant as regular query is itself 4 seconds

ggerganov Jun 23, 2025
Maintainer

You must be doing something wrong. Here is how you can test it:

# start the server
make -j && ./bin/llama-server -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf --no-context-shift --ctx-size 8000 --keep -1


# generate some random prompt:
chunk=$(printf 'hello %.0s' {1..7000})

# send first request:
curl --request POST --url http://127.0.0.1:8080/completion --header "Content-Type: application/json"     --data '{"prompt": "Prompt: '"${chunk}"', Question: blabla", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# send dummy request:
curl --request POST --url http://127.0.0.1:8080/completion --header "Content-Type: application/json"     --data '{"prompt": "Prompt: '"${chunk}"'", "n_predict": 0, "cache_prompt": true, "temperature": 0.0}' | jq

Here is the output from the llama-server:

1.59.121.625 I srv  update_slots: all slots are idle
1.59.122.003 I srv  log_server_r: request: POST /completion 127.0.0.1 200
2.09.537.456 I slot launch_slot_: id  0 | task 2 | processing task
2.09.537.469 I slot update_slots: id  0 | task 2 | new prompt, n_ctx_slot = 8000, n_keep = -1, n_prompt_tokens = 7008
2.09.537.517 I slot update_slots: id  0 | task 2 | kv cache rm [2, end)
2.09.537.914 I slot update_slots: id  0 | task 2 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.292237
2.11.002.455 I slot update_slots: id  0 | task 2 | kv cache rm [2050, end)
2.11.002.872 I slot update_slots: id  0 | task 2 | prompt processing progress, n_past = 4098, n_tokens = 2048, progress = 0.584475
2.12.659.008 I slot update_slots: id  0 | task 2 | kv cache rm [4098, end)
2.12.659.405 I slot update_slots: id  0 | task 2 | prompt processing progress, n_past = 6146, n_tokens = 2048, progress = 0.876712
2.14.708.701 I slot update_slots: id  0 | task 2 | kv cache rm [6146, end)
2.14.708.867 I slot update_slots: id  0 | task 2 | prompt processing progress, n_past = 7008, n_tokens = 862, progress = 0.999715
2.14.710.461 I slot update_slots: id  0 | task 2 | prompt done, n_past = 7008, n_tokens = 862
2.15.707.034 I slot      release: id  0 | task 2 | stop processing: n_past = 7008, truncated = 0
2.15.707.047 I slot print_timing: id  0 | task 2 | 
prompt eval time =    6169.56 ms /  7006 tokens (    0.88 ms per token,  1135.58 tokens per second)
       eval time =       0.01 ms /     1 tokens (    0.01 ms per token, 76923.08 tokens per second)
      total time =    6169.57 ms /  7007 tokens
2.15.707.504 I srv  update_slots: all slots are idle
2.15.708.634 I srv  log_server_r: request: POST /completion 127.0.0.1 200
2.28.012.152 I slot launch_slot_: id  0 | task 7 | processing task
2.28.012.165 I slot update_slots: id  0 | task 7 | new prompt, n_ctx_slot = 8000, n_keep = -1, n_prompt_tokens = 7003
2.28.012.221 I slot update_slots: id  0 | task 7 | kv cache rm [7002, end)
2.28.012.224 I slot update_slots: id  0 | task 7 | prompt processing progress, n_past = 7003, n_tokens = 1, progress = 0.000143
2.28.013.804 I slot update_slots: id  0 | task 7 | prompt done, n_past = 7003, n_tokens = 1
2.28.101.283 I slot      release: id  0 | task 7 | stop processing: n_past = 7003, truncated = 0
2.28.101.296 I slot print_timing: id  0 | task 7 | 
prompt eval time =      89.11 ms /     1 tokens (   89.11 ms per token,    11.22 tokens per second)
       eval time =       0.01 ms /     1 tokens (    0.01 ms per token, 71428.57 tokens per second)
      total time =      89.12 ms /     2 tokens
2.28.101.799 I srv  update_slots: all slots are idle

Note that the second request took only ~90ms to process.

Answer selected by NIKHILDUGAR

NIKHILDUGAR Jun 23, 2025
Author

        user_query_with_time = f"The current date and time is {current_formatted_dt}. Write the correct JSON query for and remember to STRICTLY use attribs whenever ATTRIBUTES OF VEHICLE OR PERSON and DO NOT add c_timestamp unless date/time is mentioned in the following text: {text}"
        print(user_query_with_time)
        current_prompt_messages.append({"role": "user", "content": user_query_with_time})
        payload = {
                "slot_id": 0, "temperature": 0.0, "n_keep": -1,"sampling-seq":1, "top-k":1,
            "cache_prompt": cache_value, "messages": current_prompt_messages,
            "chat_template_kwargs": {"enable_thinking": "false"},"top_p": 1.0
        }
        cache_value = False
        headers = {'Content-Type': 'application/json'}
        print("Sending request to LLM for standard query...")
        try:
            r = requests.post("http://127.0.0.1:8080/v1/chat/completions", data=json.dumps(payload), headers=headers, timeout=30)
            current_prompt_messages.pop(-1)
            payloadfake= {
                "slot_id": 0, "temperature": 0.0, "n_keep": -1,"sampling-seq":1, "top-k":1,
            "cache_prompt": cache_value, "messages": current_prompt_messages,
            "chat_template_kwargs": {"enable_thinking": "false"},"top_p": 1.0,"n_predict":0     }
            rfake = requests.post("http://127.0.0.1:8080/v1/chat/completions", data=json.dumps(payloadfake), headers=headers, timeout=30)

This is how I am sending ,I am not using the word Prompt but I dont think that should be the difference maker.

ggerganov Jun 23, 2025
Maintainer

You have to remove cache_value = False. Basically "cache_prompt" should always be True.

Btw, keep in mind that putting the time at the beginning of the query will prevent prefix caching because the prefix is different every time (due to different time).

NIKHILDUGAR Jun 23, 2025
Author

Understood, it works now, but I will check for cache clearance in a bit. I wanted to ask this as well:
Does llama-server (run as command mentioned above) have a chat history that it uses?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to run a llama server with fixed prompt cache without caching each of my upcoming queries? #14282

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 14 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to run a llama server with fixed prompt cache without caching each of my upcoming queries? #14282

Uh oh!

NIKHILDUGAR Jun 19, 2025

Replies: 1 comment · 14 replies

Uh oh!

ggerganov Jun 19, 2025 Maintainer

Uh oh!

NIKHILDUGAR Jun 19, 2025 Author

Uh oh!

NIKHILDUGAR Jun 20, 2025 Author

Uh oh!

NIKHILDUGAR Jun 23, 2025 Author

Uh oh!

ggerganov Jun 23, 2025 Maintainer

Uh oh!

NIKHILDUGAR Jun 23, 2025 Author

Uh oh!

ggerganov Jun 23, 2025 Maintainer

Uh oh!

NIKHILDUGAR Jun 23, 2025 Author

NIKHILDUGAR
Jun 19, 2025

Replies: 1 comment 14 replies

ggerganov
Jun 19, 2025
Maintainer

NIKHILDUGAR Jun 19, 2025
Author

NIKHILDUGAR Jun 20, 2025
Author

NIKHILDUGAR Jun 23, 2025
Author

ggerganov Jun 23, 2025
Maintainer

NIKHILDUGAR Jun 23, 2025
Author

ggerganov Jun 23, 2025
Maintainer

NIKHILDUGAR Jun 23, 2025
Author