ggml : implement REGLU/GEGLU/SWIGLU ops #14158

CISC · 2025-06-12T23:24:57Z

Implement REGLU/GEGLU/SWIGLU ops to avoid unnecessary tensor duplications and a little more efficient execution by combining ops in one.

Only CPU and CUDA right now, help needed to complete other backends!

ggerganov

I missed that these ops change the shape of the input tensor.

I think it would be better to introduce:

enum ggml_glu_op {
    GGML_GLU_OP_REGLU,
    GGML_GLU_OP_GEGLU,
    GGML_GLU_OP_SWIGLU,
};

// similar to ggml_unary()
GGML_API struct ggml_tensor * ggml_glu(
        struct ggml_context * ctx,
         struct ggml_tensor * a,
           enum ggml_glu_op   op);

// these simply call ggml_glu()
GGML_API struct ggml_tensor * ggml_reglu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a);

GGML_API struct ggml_tensor * ggml_geglu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a);

GGML_API struct ggml_tensor * ggml_swiglu(
        struct ggml_context * ctx,
        struct ggml_tensor  * a);

ggml/include/ggml.h

ggerganov

Hope we don't forget to implement these in the rest of the backends.

Adding @JohannesGaessler for review of the CUDA changes.

ggerganov · 2025-06-13T08:31:31Z

Only CPU and CUDA right now, help needed to complete other backends!

Yes, let's add the rest of the backends first before merging. At least Metal and Vulkan.

ggml/src/ggml-cuda/unary.cu

JohannesGaessler · 2025-06-13T09:14:48Z

More generally, I've been thinking that it would be useful to have something like a backend-specific graph optimization step in ggml. That way you could do things like fuse tensors only if the fused tensor is supported by the backend and only if using it makes sense given the tensor shapes.

CISC · 2025-06-13T10:29:17Z

Only CPU and CUDA right now, help needed to complete other backends!

Yes, let's add the rest of the backends first before merging. At least Metal and Vulkan.

Any suggestions on who could help with that?

ggml-ci

ngxson · 2025-06-13T13:14:56Z

ggml/include/ggml.h

+            struct ggml_context * ctx,
+            struct ggml_tensor  * a);
+
+    GGML_API struct ggml_tensor * ggml_swiglu(


just want to note that I have been observing one variants of swiglu. it's used by ultravox, which sigmoid the second half of the vector instead of the first half

Oh, interesting, worth adding a parameter for, or best just handling in conversion?
https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_3-70b/blob/main/ultravox_model.py#L701-L704

I think it would be nice to have a param since the GGUFs are already on the internet. Haven't thought about permuting the FFN up tensor before, nice suggestion

Added swapped variants.

@ggerganov I didn't dare update metal code, so needs to be implemented there too. :)

JohannesGaessler · 2025-06-13T13:29:00Z

@0cc4m @jeffbolznv are either of you interested in a Vulkan implementation?

0cc4m · 2025-06-13T13:42:21Z

I can look into it tomorrow.

ggml-ci

ggml/src/ggml-cuda/unary.cu

JohannesGaessler · 2025-06-13T17:47:38Z

CUDA performance test:

GPU	Model	Microbatch size	Test	t/s master	t/s cisc/unary-reglu-geglu-swiglu	Speedup
RTX 4090	chatglm 9B Q4_0	1	pp512	157.48	160.49	1.02
RTX 4090	chatglm 9B Q4_0	2	pp512	268.16	276.78	1.03
RTX 4090	chatglm 9B Q4_0	4	pp512	517.41	535.36	1.03
RTX 4090	chatglm 9B Q4_0	8	pp512	826.69	855.46	1.03
RTX 4090	chatglm 9B Q4_0	16	pp512	1407.13	1453.62	1.03
RTX 4090	chatglm 9B Q4_0	32	pp512	2545.45	2664.80	1.05
RTX 4090	chatglm 9B Q4_0	64	pp512	4414.61	4704.57	1.07
RTX 4090	chatglm 9B Q4_0	128	pp512	6467.60	7028.01	1.09
RTX 4090	chatglm 9B Q4_0	256	pp512	8670.62	9451.16	1.09
RTX 4090	chatglm 9B Q4_0	512	pp512	9842.99	10832.14	1.10

Also a plot of the same data using #14169 :

ggerganov · 2025-06-13T19:11:08Z

CUDA performance test:

Huh, I didn't expect the benefit to be that much. Interesting.

jeffbolznv · 2025-06-13T20:27:21Z

Huh, I didn't expect the benefit to be that much. Interesting.

Everything other than mat-mat mul is either bandwidth or small dispatch limited. Fusion is a big opportunity. We should reopen discussions about how to enable more types of fusion.

CISC · 2025-06-13T20:54:58Z

CUDA performance test:

Nice! Will be interesting to see numbers on other backends as well...

ggml-ci

CISC · 2025-06-13T21:38:33Z

Hmmm, it just occurred to me that we should be able to (now that I pass along a pointer to the gate separately) perform these ops on models with separate ffn_up/gate tensors too by conditionally setting src[1].

ngxson · 2025-06-13T21:59:11Z

ggml/include/ggml.h

+            struct ggml_context * ctx,
+            struct ggml_tensor  * a);
+
+    GGML_API struct ggml_tensor * ggml_geglu(


Tbh I don't even know why geglu was added in the first place. It doesn't seem to be used by any models. And to make matter worse, the PR where it was added has no useful description: #14074

So I wonder if we actually need to implement it as a kernel. The current kernel use tanh approximation, but in practice, there can be many different approximations for gelu op.

Nvm, see: https://github.com/ggml-org/llama.cpp/pull/14014/files#r2146203459

I've seen several, and in fact we already support a few (Gemma, DeepSeekV1, Jina-Bert and T5), it's just that the gate is split (some at conversion because we didn't have the op).

So I wonder if we actually need to implement it as a kernel. The current kernel use tanh approximation, but in practice, there can be many different approximations for gelu op.

It's pretty easy adding different GLU ops (and in CUDA I even reuse the original op), adding GEGLU_ERF if necessary shouldn't be a problem.

0cc4m · 2025-06-14T10:14:36Z

I implemented Vulkan shaders for the new ops.

qnixsynapse · 2025-06-14T13:43:41Z

Interesting.. I tried implementing for SYCL, saw little improvement. When I saw the graph logs, it wasn't using the fused kernels for llama 3.2 3B.

[SYCL][OP] call ggml_sycl_mul_mat: dst='ffn_gate-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]	src0='blk.19.ffn_gate.weight':type=f16;ne=[3072, 8192, 1, 1];nb=[2, 6144, 50331648, 50331648]	src1='ffn_norm-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288]
[SYCL][OP] call ggml_sycl_op_dequantize_mul_mat_vec/to_fp16_sycl: dst='ffn_gate-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]	src0='blk.19.ffn_gate.weight':type=f16;ne=[3072, 8192, 1, 1];nb=[2, 6144, 50331648, 50331648]	src1='ffn_norm-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288] : converting src1 to fp16
[SYCL][OP] call ggml_sycl_op_dequantize_mul_mat_vec/to_fp16_sycl done
[SYCL][OP] call ggml_sycl_mul_mat done
[SYCL][OP] call ggml_sycl_silu: dst='ffn_silu-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]	src0='ffn_gate-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]
[SYCL][OP] call ggml_sycl_silu done
[SYCL][OP] call ggml_sycl_mul_mat: dst='ffn_up-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]	src0='blk.19.ffn_up.weight':type=f16;ne=[3072, 8192, 1, 1];nb=[2, 6144, 50331648, 50331648]	src1='ffn_norm-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288]
[SYCL][OP] call ggml_sycl_op_dequantize_mul_mat_vec/to_fp16_sycl: dst='ffn_up-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]	src0='blk.19.ffn_up.weight':type=f16;ne=[3072, 8192, 1, 1];nb=[2, 6144, 50331648, 50331648]	src1='ffn_norm-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288] : converting src1 to fp16
[SYCL][OP] call ggml_sycl_op_dequantize_mul_mat_vec/to_fp16_sycl done
[SYCL][OP] call ggml_sycl_mul_mat done
[SYCL][OP] call ggml_sycl_mul: dst='ffn_gate_par-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]	src0='ffn_silu-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]	src1='ffn_up-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]
[SYCL][OP] call ggml_sycl_mul done
[SYCL][OP] call ggml_sycl_mul_mat: dst='ffn_out-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288]	src0='blk.19.ffn_down.weight':type=f16;ne=[8192, 3072, 1, 1];nb=[2, 16384, 50331648, 50331648]	src1='ffn_gate_par-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768]
[SYCL][OP] call ggml_sycl_op_dequantize_mul_mat_vec/to_fp16_sycl: dst='ffn_out-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288]	src0='blk.19.ffn_down.weight':type=f16;ne=[8192, 3072, 1, 1];nb=[2, 16384, 50331648, 50331648]	src1='ffn_gate_par-19':type=f32;ne=[8192, 1, 1, 1];nb=[4, 32768, 32768, 32768] : converting src1 to fp16
[SYCL][OP] call ggml_sycl_op_dequantize_mul_mat_vec/to_fp16_sycl done
[SYCL][OP] call ggml_sycl_mul_mat done
[SYCL][OP] call ggml_sycl_add: dst='l_out-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288]	src0='ffn_out-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288]	src1='ffn_inp-19':type=f32;ne=[3072, 1, 1, 1];nb=[4, 12288, 12288, 12288]
[SYCL][OP] call ggml_sycl_add done

I am using from this branch.

CISC · 2025-06-14T13:50:21Z

Interesting.. I tried implementing for SYCL, saw little improvement. When I saw the graph logs, it wasn't using the fused kernels for llama 3.2 3B.

That's normal, llama 3.2 doesn't have a single up+gate does it?

qnixsynapse · 2025-06-14T13:55:44Z

IIRC, it has SWIGLU. But I didn't check if it is using a single up+gate or not.

Edit: Nevermind, Need to check #14181 I guess

CISC · 2025-06-14T14:12:01Z

Edit: Nevermind, Need to check #14181 I guess

Yep. :)

qnixsynapse · 2025-06-14T14:34:43Z

Yep. :)

Please merge this PR first so that I can adjust the existing kernels for split up and gate. :)

I will deduplicate the SYCL code then.

CISC · 2025-06-14T14:42:45Z

Please merge this PR first so that I can adjust the existing kernels for split up and gate. :)

The plan is to merge #14181 into this one once @ggerganov signs off on it, then backends can be updated, and once all tests go green, merge into master.

CISC · 2025-06-14T15:03:23Z

@qnixsynapse If you want you can bring the other branch up-to-date and add your changes there.

CISC added 6 commits June 12, 2025 17:39

implement unary REGLU/GEGLU/SWIGLU cpu ops

c717198

relax constraints

92943e7

duplicate shape of source

319c6cb

fix ggml_vec_geglu_f16

6fe7e07

special case gated ops

7e075be

implement unary REGLU/GEGLU/SWIGLU cuda ops

1acd121

CISC added the help wanted Extra attention is needed label Jun 12, 2025

CISC requested a review from ggerganov June 12, 2025 23:25

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 12, 2025

ggerganov reviewed Jun 13, 2025

View reviewed changes

ggml/include/ggml.h Outdated Show resolved Hide resolved

CISC added 2 commits June 13, 2025 09:00

tighten constraints again

5c58196

refactor into GGML_GLU_OP

f4be71e

CISC changed the title ~~ggml : implement unary REGLU/GEGLU/SWIGLU ops~~ ggml : implement REGLU/GEGLU/SWIGLU ops Jun 13, 2025

CISC requested a review from ggerganov June 13, 2025 08:23

ggerganov approved these changes Jun 13, 2025

View reviewed changes

ggerganov requested a review from JohannesGaessler June 13, 2025 08:30

JohannesGaessler reviewed Jun 13, 2025

View reviewed changes

ggml/src/ggml-cuda/unary.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/unary.cu Show resolved Hide resolved

ggml/src/ggml-cuda/unary.cu Outdated Show resolved Hide resolved

metal : add glu kernels

564861d

ggml-ci

github-actions bot added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label Jun 13, 2025

ngxson reviewed Jun 13, 2025

View reviewed changes

CISC added 2 commits June 13, 2025 16:10

add CUDA_GLU_BLOCK_SIZE [no ci]

4b7d4dd

more constraints and use 64bit ints

d1d3f4f

ggml-ci

JohannesGaessler reviewed Jun 13, 2025

View reviewed changes

ggml/src/ggml-cuda/unary.cu Outdated Show resolved Hide resolved

64bit multiplication [no ci]

e3d2b20

implement swapped variants (cpu/cuda)

39eba35

update comment [no ci]

98a5019

ggml-ci

ngxson reviewed Jun 13, 2025

View reviewed changes

CISC mentioned this pull request Jun 14, 2025

llama: Attempt to add ModernBert #14014

Open

Vulkan: Add GLU ops and shaders

8dc1d9f

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jun 14, 2025

CISC mentioned this pull request Jun 14, 2025

ggml : implement GLU for split up/gate #14181

Open

SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate

95e4be0

github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jun 14, 2025

ggml : implement REGLU/GEGLU/SWIGLU ops #14158

Are you sure you want to change the base?

ggml : implement REGLU/GEGLU/SWIGLU ops #14158

Conversation

CISC commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Jun 13, 2025

Uh oh!

CISC commented Jun 13, 2025

Uh oh!

ngxson Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jun 13, 2025

Uh oh!

0cc4m commented Jun 13, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Jun 13, 2025

Uh oh!

ggerganov commented Jun 13, 2025

Uh oh!

jeffbolznv commented Jun 13, 2025

Uh oh!

CISC commented Jun 13, 2025

Uh oh!

CISC commented Jun 13, 2025

Uh oh!

ngxson Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m commented Jun 14, 2025

Uh oh!

qnixsynapse commented Jun 14, 2025

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

qnixsynapse commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

qnixsynapse commented Jun 14, 2025

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

CISC commented Jun 12, 2025 •

edited

Loading

ngxson Jun 13, 2025 •

edited

Loading

qnixsynapse commented Jun 14, 2025 •

edited

Loading