Description
This was triggered by trying to make structured generation with DeepSeek, where we want to let the model generate anything (unstructured) between the think
tags, and start structured generation after </think>
has been generated. It would thus make sense to have an anything
construct that corresponds to .*
, the unstructured case. Intuitively, we could add until
method so that users could write the following for a classification task
from outlines.types import anything, either
model = ...
outline = "<think>" + anything.until("</think>") + "</think>" + either("yes", "no")
result = model.generate("Are you a reasoning model?", outline)
The implementation is trickier than it looks, anything.until("<think>")
is best expressed with a negative lookahead:
((?!<\/think>).)*
but the regex engine that outlines-core
uses does not support lookaheads. However, it is equivalent to the following regular expression:
([^<]|<[^\/]|<\/[^t]|<\/t[^h]|<\/th[^i]|<\/thi[^n]|<\/thin[^k]|<\/think[^>])*
The idea would thus be to have anything.until
generate a Regex
node with an "expanded lookahead". Computing the expansion shouldn't be too hard.
Note: we could also decide that "until" keeps the </think>
token. We should discuss.
Regexes
It is tempting to also implement until_regex
this way. It is very simple for simple patterns like [0-9]
:
(^[0-9])*
however it can get complicated quite quickly, for instance anything.until_regex("(abc|def)")
which we would need to translate to:
([^ad]|a[^b]|ab[^c]|d[^e]|de[^f])*
so we will not implement this in the first iteration.