Avoid re-creating graph for every token. #1548

cztomsik · 2023-05-21T07:09:05Z

cztomsik
May 21, 2023

Is there a way to avoid re-creating the graph in every tick?

I was thinking about this and from my limited understanding we could:

change ggml_rope() to take n_past as a tensor (so we can change the value without re-creating the graph)
add ggml_shift() operation, which would push everything to the left. this would be applied to the KV-cache after computing the result. shifting everything by one would make space for the new data which is copied with ggml_cpy
of course, shift can also work with N > 1, but the idea is that it will move data out of the context, just like F.pad(xn, (0, 0, 1, -1) does in PyTorch

I think it could simplify the codebase and maybe even be useful for GPU support? #915

Also, I'm not really sure what scratch is needed for, it looks like a fast arena-allocator to reduce the impact of this periodic graph re-creation, so maybe scratch wouldn't be necessary either.

BTW: there are models which are specifically built on this "shift" operation. So there's another motivation for a new op.
https://github.com/lucidrains/token-shift-gpt

ggerganov · 2023-05-21T07:53:21Z

ggerganov
May 21, 2023
Maintainer

Yes, "static" computation graphs have nice benefits, especially for the GPU support idea, so we should support it.

Regarding ggml_rope() - yes, this is the way.
Regarding ggml_shift() - there might be an alternative approach which would be much simpler to implement, unless I am missing something. The n_past parameter is used only in ggml_view_xd() calls. So we could add a "parametrized view" operator ggml_view_param() which takes n_past as a tensor and returns a regular view

8 replies

ggerganov May 21, 2023
Maintainer

I don't think KV shift works. We discussed this here: #71 (comment) (start at this comment and read through)

When you run out of context, you just need to take part of it and recompute the entire thing from scratch.
Unless we are missing some insight, there does not seem to be a better way to do it

cztomsik May 22, 2023
Author

Hm, that is unfortunate :-/ I was hoping that the relative nature of ROPE will be enough. But I think the reasoning in the thread makes a lot of sense, it cannot work because the input ROPE is already mixed everywhere down below.

SlyEcho May 26, 2023
Collaborator

Is the RoPE operation expensive? We could store the values unmodified and then apply the operation later. Or perhaps, store a copy.

cztomsik May 26, 2023
Author

If I understand it correctly, this wouldn't resolve the issue, the problem is that the previous ROPEd value has been already mixed into the state of next layers. It would likely just get completely confused.

(so even if you ROPE at every step, you cannot unROPE what has been already fed through the matrices and stored into the KV cache)

cztomsik Jun 3, 2023
Author

BTW: if you've ever implemented GPT from scratch, and tried removing pos embeddings entirely, you've probably noticed that the network still learns something, as long as you provide a causal mask. This is a paper that goes further in this direction and claims that no pos. embedding performs better than anything else. Wild. I don't think it changes anything here right now, but maybe in future, the problem will just disappear with new models....
https://twitter.com/hardmaru/status/1664794735507132426

agray3 · 2024-07-08T10:25:59Z

agray3
Jul 8, 2024

I have created #8366 which addresses this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid re-creating graph for every token. #1548

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Avoid re-creating graph for every token. #1548

cztomsik May 21, 2023

Replies: 2 comments · 8 replies

ggerganov May 21, 2023 Maintainer

ggerganov May 21, 2023 Maintainer

cztomsik May 22, 2023 Author

SlyEcho May 26, 2023 Collaborator

cztomsik May 26, 2023 Author

cztomsik Jun 3, 2023 Author

agray3 Jul 8, 2024

cztomsik
May 21, 2023

Replies: 2 comments 8 replies

ggerganov
May 21, 2023
Maintainer

ggerganov May 21, 2023
Maintainer

cztomsik May 22, 2023
Author

SlyEcho May 26, 2023
Collaborator

cztomsik May 26, 2023
Author

cztomsik Jun 3, 2023
Author

agray3
Jul 8, 2024