Description
What happened?
As seen here:
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
The llama.cpp server should support --prompt-cache [FNAME]
I have not been able to get this feature to work.
I have tried workarounds such as using llama-cli to generate the prompt-cache and then specify this file for llama-server.
Is there some minimally reproducible code snippet that shows this feature working? Is it implemented?
Thanks in advance.
Name and Version
CLI Call to generate prompt cache.
version: 3613 (fc54ef0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
$ ./llama-cli -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin --prompt-cache-all --file "/.../prompt_files/pirate_prompt.txt"
Server Call (after generating prompt_cache.bin with llama-cli)(this prompt file is the same as the above without the final user input which will be sent via the request).
version: 3613 (fc54ef0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
$ ./llama-server -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" --host 0.0.0.0 --port 8080 -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin" --prompt-cache-ro --keep -1 -f "/.../prompt_files/pirate_prompt_server.txt"
What operating system are you seeing the problem on?
Linux
Relevant log output
No response