[How to serve lookahead decoding Qwen 3]

I know how to deploy and call an API using an LLM with speculative decoding and a draft model via llama-serve. 
```
./build/bin/llama-server --model Qwen3-14B-Q8_0.gguf --reasoning-budget 0 --model-draft Qwen3-0.6B-Q8_0.gguf --n-gpu-layers 99 -ngld 99 -fa --draft-max 16 --draft-min 0 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 
```
But how can I serve a model using lookahead decoding instead?
The command 
```
./build/bin/llama-lookahead --model Qwen3-14B-Q8_0.gguf --n-gpu-layers 99
```
doesn't work because it requires an input prompt. 

Reference: https://github.com/ggml-org/llama.cpp/pull/4207

Thanks in advance. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[How to serve lookahead decoding Qwen 3] #14057

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[How to serve lookahead decoding Qwen 3] #14057

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions