Replies: 1 comment
-
This looks like a description of "speculative decoding", there are a couple of llama.cpp examples implementing it here: https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative, https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative-simple. It's not currently supported at all in the high level executors. It's probably possible to implement using the BatchedExecutor (I sketched out a prototype a while ago, never quite got it working though). It should definitely be possible to implement using the low level/native API (we just directly expose all the llama.cpp calls). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to do something similar on llamasharp now?
https://huggingface.co/blog/assisted-generation
Beta Was this translation helpful? Give feedback.
All reactions