Open
Description
I know how to deploy and call an API using an LLM with speculative decoding and a draft model via llama-serve.
./build/bin/llama-server --model Qwen3-14B-Q8_0.gguf --reasoning-budget 0 --model-draft Qwen3-0.6B-Q8_0.gguf --n-gpu-layers 99 -ngld 99 -fa --draft-max 16 --draft-min 0 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0
But how can I serve a model using lookahead decoding instead?
The command
./build/bin/llama-lookahead --model Qwen3-14B-Q8_0.gguf --n-gpu-layers 99
doesn't work because it requires an input prompt.
Reference: #4207
Thanks in advance.
Metadata
Metadata
Assignees
Labels
No labels