Extreme Performance Issue & Unwanted reasoning when using Revisual-R1 #14466

arcseedai · 2025-06-30T18:58:10Z

arcseedai
Jun 30, 2025

Hello llama.cpp community,
I'm working with a specific multimodal model, Revisual-R1-final-GGUF, and I'm encountering a significant performance issue and undesirable behavior when using llama-server.exe compared to other frontends like LM Studio.
I'm hoping to understand if I'm missing a specific flag or API parameter to control this behavior correctly.

The Environment

Hardware: NVIDIA GeForce RTX 4090
llama.cpp build: 5774 (27208bf) (from the server log)
Model: Revisual-R1-final.Q4_K_M.gguf and its corresponding mmproj file.

The Problem

When I run a simple vision query, the model takes an extremely long time to respond (~30-40 seconds) and generates an unwanted ... block before the final answer even if I try to disable it. In LM Studio, the exact same model on the same hardware responds in under 4 seconds with a clean, direct answer.

My llama-server result:

prompt eval time = 3779.89 ms / 18 tokens ( 209.99 ms per token, 4.76 tokens per second)
eval time = 29677.95 ms / 128 tokens ( 231.86 ms per token, 4.31 tokens per second)
total time = 33457.84 ms / 146 tokens

LM Studio result:

prompt eval time = 2714.37 ms / 1849 tokens ( 1.47 ms per token, 681.19 tokens per second)
eval time = 629.21 ms / 75 runs ( 8.39 ms per token, 119.20 tokens per second)
total time = 3800.46 ms / 1924 tokens

Steps to Reproduce

Start llama-server.exe with the following command:

1 - llama-server.exe -m "path/to/Revisual-R1-final.Q4_K_M.gguf" --mmproj "path/to/Revisual-R1-final.mmproj-Q8_0.gguf" -ngl -1 --port 8081 --path "path/to/image_folder" -t 8 -b 1024

2 - Send an API request using curl:

curl http://127.0.0.1:8081/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "local-revisual-model",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "http://127.0.0.1:8081/test_img.jpg"
}
}
]
}
],
"temperature": 0.1,
"n_predict": 128
}'

What I Have Tried (and what failed)

These were sort of trying to patch up the real problem because I don't know wha is really going on, some may seem pointless, but I really don't know what to do here to fix the issue

1 - Strict Sampling: Adding parameters like top_k, top_p, min_p, and repeat_penalty to the API request to match the LM Studio log. The model still generates the block.
2 - Startup Flag --reasoning-budget 0: The server starts, but the model's behavior is unchanged.
3 - Startup Flags --reasoning-format deepseek --reasoning-budget 0: The behavior is also unchanged.
4 - API Parameter "chat_template_kwargs": {"enable_thinking": False}: This was ignored.
5 - API Template Override: Sending "chat_template": "chatml" was also ignored.
6 - Prompt Manipulation: Prepending commands like /no_think to the prompt text, or adding an empty assistant message to the messages array, did not prevent the behavior.

Finally my question:

Main issue: Why is the eval performance (ms per token) so drastically degraded for this specific model on a high-end GPU when using llama-server? Is there a known incompatibility or a backend setting I'm missing that could cause it to fall back to a non-performant kernel?

Secondary Question: Is the uncontrollable "thinking" a symptom of this performance bug, or is it a separate control issue?

Answered by 0cc4m

Jul 1, 2025

Please post the llama-server log, but my guess would be that you need to set -ngl to a high value like 999 (assuming it still defaults to 0, not sure what behaviour you get with your -1 value). If that doesn't work, you might not have the CUDA binaries and it's running on CPU, but the log would show that.

View full answer

0cc4m · 2025-07-01T07:26:23Z

0cc4m
Jul 1, 2025
Collaborator

Please post the llama-server log, but my guess would be that you need to set -ngl to a high value like 999 (assuming it still defaults to 0, not sure what behaviour you get with your -1 value). If that doesn't work, you might not have the CUDA binaries and it's running on CPU, but the log would show that.

1 reply

arcseedai Jul 1, 2025
Author

Oh wow, thank you so much, I been trying to figure this out for two days, and you just saw it right away, I guess I expected to -1 to be all layers and it just skipped my mind that this parameter would be the issue.

Really thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extreme Performance Issue & Unwanted reasoning when using Revisual-R1 #14466

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extreme Performance Issue & Unwanted reasoning when using Revisual-R1 #14466

Uh oh!

Uh oh!

arcseedai Jun 30, 2025

Replies: 1 comment · 1 reply

Uh oh!

0cc4m Jul 1, 2025 Collaborator

Uh oh!

arcseedai Jul 1, 2025 Author

arcseedai
Jun 30, 2025

Replies: 1 comment 1 reply

0cc4m
Jul 1, 2025
Collaborator

arcseedai Jul 1, 2025
Author