Skip to content

How to run a llama server with fixed prompt cache without caching each of my upcoming queries? #14282

Answered by ggerganov
NIKHILDUGAR asked this question in Q&A
Discussion options

You must be logged in to vote

You must be doing something wrong. Here is how you can test it:

# start the server
make -j && ./bin/llama-server -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf --no-context-shift --ctx-size 8000 --keep -1


# generate some random prompt:
chunk=$(printf 'hello %.0s' {1..7000})

# send first request:
curl --request POST --url http://127.0.0.1:8080/completion --header "Content-Type: application/json"     --data '{"prompt": "Prompt: '"${chunk}"', Question: blabla", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# send dummy request:
curl --request POST --url http://127.0.0.1:8080/completion --header "Content-Type: application/json"     --data '{"prompt": "Prompt: '"${chunk}

Replies: 1 comment 14 replies

Comment options

You must be logged in to vote
14 replies
@NIKHILDUGAR
Comment options

@NIKHILDUGAR
Comment options

@NIKHILDUGAR
Comment options

@ggerganov
Comment options

Answer selected by NIKHILDUGAR
@NIKHILDUGAR
Comment options

@ggerganov
Comment options

@NIKHILDUGAR
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants