-
Hello llama.cpp community, The Environment Hardware: NVIDIA GeForce RTX 4090 The Problem When I run a simple vision query, the model takes an extremely long time to respond (~30-40 seconds) and generates an unwanted ... block before the final answer even if I try to disable it. In LM Studio, the exact same model on the same hardware responds in under 4 seconds with a clean, direct answer. My llama-server result: prompt eval time = 3779.89 ms / 18 tokens ( 209.99 ms per token, 4.76 tokens per second) LM Studio result: prompt eval time = 2714.37 ms / 1849 tokens ( 1.47 ms per token, 681.19 tokens per second) Steps to Reproduce Start llama-server.exe with the following command: 1 - llama-server.exe -m "path/to/Revisual-R1-final.Q4_K_M.gguf" --mmproj "path/to/Revisual-R1-final.mmproj-Q8_0.gguf" -ngl -1 --port 8081 --path "path/to/image_folder" -t 8 -b 1024 2 - Send an API request using curl: curl http://127.0.0.1:8081/v1/chat/completions What I Have Tried (and what failed) These were sort of trying to patch up the real problem because I don't know wha is really going on, some may seem pointless, but I really don't know what to do here to fix the issue 1 - Strict Sampling: Adding parameters like top_k, top_p, min_p, and repeat_penalty to the API request to match the LM Studio log. The model still generates the block. Finally my question: Main issue: Why is the eval performance (ms per token) so drastically degraded for this specific model on a high-end GPU when using llama-server? Is there a known incompatibility or a backend setting I'm missing that could cause it to fall back to a non-performant kernel? Secondary Question: Is the uncontrollable "thinking" a symptom of this performance bug, or is it a separate control issue? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Please post the llama-server log, but my guess would be that you need to set |
Beta Was this translation helpful? Give feedback.
Please post the llama-server log, but my guess would be that you need to set
-ngl
to a high value like 999 (assuming it still defaults to 0, not sure what behaviour you get with your -1 value). If that doesn't work, you might not have the CUDA binaries and it's running on CPU, but the log would show that.