-
Notifications
You must be signed in to change notification settings - Fork 12.1k
ggml : implement REGLU/GEGLU/SWIGLU ops #14158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that these ops change the shape of the input tensor.
I think it would be better to introduce:
enum ggml_glu_op {
GGML_GLU_OP_REGLU,
GGML_GLU_OP_GEGLU,
GGML_GLU_OP_SWIGLU,
};
// similar to ggml_unary()
GGML_API struct ggml_tensor * ggml_glu(
struct ggml_context * ctx,
struct ggml_tensor * a,
enum ggml_glu_op op);
// these simply call ggml_glu()
GGML_API struct ggml_tensor * ggml_reglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_swiglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope we don't forget to implement these in the rest of the backends.
Adding @JohannesGaessler for review of the CUDA changes.
Yes, let's add the rest of the backends first before merging. At least Metal and Vulkan. |
More generally, I've been thinking that it would be useful to have something like a backend-specific graph optimization step in ggml. That way you could do things like fuse tensors only if the fused tensor is supported by the backend and only if using it makes sense given the tensor shapes. |
Any suggestions on who could help with that? |
ggml-ci
struct ggml_context * ctx, | ||
struct ggml_tensor * a); | ||
|
||
GGML_API struct ggml_tensor * ggml_swiglu( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to note that I have been observing one variants of swiglu. it's used by ultravox, which sigmoid the second half of the vector instead of the first half
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting, worth adding a parameter for, or best just handling in conversion?
https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_3-70b/blob/main/ultravox_model.py#L701-L704
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have a param since the GGUFs are already on the internet. Haven't thought about permuting the FFN up tensor before, nice suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added swapped variants.
@ggerganov I didn't dare update metal code, so needs to be implemented there too. :)
@0cc4m @jeffbolznv are either of you interested in a Vulkan implementation? |
I can look into it tomorrow. |
CUDA performance test:
Also a plot of the same data using #14169 : |
Huh, I didn't expect the benefit to be that much. Interesting. |
Everything other than mat-mat mul is either bandwidth or small dispatch limited. Fusion is a big opportunity. We should reopen discussions about how to enable more types of fusion. |
Nice! Will be interesting to see numbers on other backends as well... |
ggml-ci
Hmmm, it just occurred to me that we should be able to (now that I pass along a pointer to the gate separately) perform these ops on models with separate ffn_up/gate tensors too by conditionally setting src[1]. |
struct ggml_context * ctx, | ||
struct ggml_tensor * a); | ||
|
||
GGML_API struct ggml_tensor * ggml_geglu( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I don't even know why geglu was added in the first place. It doesn't seem to be used by any models. And to make matter worse, the PR where it was added has no useful description: #14074
So I wonder if we actually need to implement it as a kernel. The current kernel use tanh approximation, but in practice, there can be many different approximations for gelu op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen several, and in fact we already support a few (Gemma, DeepSeekV1, Jina-Bert and T5), it's just that the gate is split (some at conversion because we didn't have the op).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wonder if we actually need to implement it as a kernel. The current kernel use tanh approximation, but in practice, there can be many different approximations for gelu op.
It's pretty easy adding different GLU ops (and in CUDA I even reuse the original op), adding GEGLU_ERF if necessary shouldn't be a problem.
I implemented Vulkan shaders for the new ops. |
Interesting.. I tried implementing for SYCL, saw little improvement. When I saw the graph logs, it wasn't using the fused kernels for llama 3.2 3B.
I am using from this branch. |
That's normal, llama 3.2 doesn't have a single up+gate does it? |
IIRC, it has SWIGLU. But I didn't check if it is using a single up+gate or not. Edit: Nevermind, Need to check #14181 I guess |
Yep. :) |
Please merge this PR first so that I can adjust the existing kernels for split up and gate. :) I will deduplicate the SYCL code then. |
The plan is to merge #14181 into this one once @ggerganov signs off on it, then backends can be updated, and once all tests go green, merge into master. |
@qnixsynapse If you want you can bring the other branch up-to-date and add your changes there. |
Implement REGLU/GEGLU/SWIGLU ops to avoid unnecessary tensor duplications and a little more efficient execution by combining ops in one.
Only CPU and CUDA right now, help needed to complete other backends!