Replies: 2 comments 8 replies
-
Yes, "static" computation graphs have nice benefits, especially for the GPU support idea, so we should support it.
|
Beta Was this translation helpful? Give feedback.
8 replies
-
I have created #8366 which addresses this. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Is there a way to avoid re-creating the graph in every tick?
I was thinking about this and from my limited understanding we could:
ggml_rope()
to taken_past
as a tensor (so we can change the value without re-creating the graph)ggml_shift()
operation, which would push everything to the left. this would be applied to the KV-cache after computing the result. shifting everything by one would make space for the new data which is copied withggml_cpy
N > 1
, but the idea is that it will move data out of the context, just likeF.pad(xn, (0, 0, 1, -1)
does in PyTorchI think it could simplify the codebase and maybe even be useful for GPU support? #915
Also, I'm not really sure what scratch is needed for, it looks like a fast arena-allocator to reduce the impact of this periodic graph re-creation, so maybe scratch wouldn't be necessary either.
BTW: there are models which are specifically built on this "shift" operation. So there's another motivation for a new op.
https://github.com/lucidrains/token-shift-gpt
Beta Was this translation helpful? Give feedback.
All reactions