(Batched) Matrix Multiplication and Fused Operations #14394

taronaeo · 2025-06-26T12:55:22Z

taronaeo
Jun 26, 2025

Hi, I'm actively working on implementing the IBM zDNN library as a backend accelerator and have gotten GGML_OP_MUL_MAT operation running with only contiguous tensors. However, I'm seeing a huge performance regression when using the backend accelerator (0.02 tokens per second on zDNN vs 1.68 tokens per second on CPU-only) due to how the accelerator setup per operation is designed.

I want to know if GGML has the following

Batched Matrix Multiplication - multiple matmul computations within 1 ggml tensor operation. zDNN can really benefit from batched matmul operations by reducing the number of setup and teardown operations, effectively reducing overhead
Fused Operations - e.g., matrix multiplication + bias within 1 ggml tensor operation. zDNN currently provides fused bias addition with matmul that can be used to improve performance.

Thank you!

Answered by slaren

Jun 26, 2025

Batched matrix multiplication is supported by using the dimensions 3 and 4 of the operand tensors, but mostly this is used with GQA
Operator fusion is not currently done by any of the backends, although it is being worked on in #14366

What kind of setup and teardown are you talking about? If you have to copy the whole tensor data on each operation, that may explain the performance that you are seeing.

View full answer

slaren · 2025-06-26T13:52:17Z

slaren
Jun 26, 2025
Maintainer

Batched matrix multiplication is supported by using the dimensions 3 and 4 of the operand tensors, but mostly this is used with GQA
Operator fusion is not currently done by any of the backends, although it is being worked on in vulkan: Add fusion support for RMS_NORM+MUL #14366

What kind of setup and teardown are you talking about? If you have to copy the whole tensor data on each operation, that may explain the performance that you are seeing.

5 replies

taronaeo Jun 26, 2025
Author

What kind of setup and teardown are you talking about?

zDNN requires the following setup and teardown per operation (I'll be using matmul as an example here),

Create zTensor pre-transform and transform descriptors
malloc memory for zTensor transform from pre-transform descriptors
Copy ggml input, weights and bias tensor buffers to zTensor transform descriptors
Run matmul operation $C = B*A^T + bias$ (bias is pre-filled with zeros here because there's no data) and save to resultant zTensor descriptor
Copy resultant zTensor descriptor buffer back to ggml result tensor buffer
Teardown all pre-transform and transform descriptors

That's 5 steps of overhead per operation that we're having currently 😞

If you have to copy the whole tensor data on each operation, that may explain the performance that you are seeing.

Yes that is what we're doing, a literal 1-to-1 copy from GGML tensor buffer to zTensor descriptor buffer. Is this not how it's supposed to go?

Edit: zDNN only supports contiguous buffers and I've limited the operations to run only on contiguous buffers, strided data will run on CPU if that helps :)

slaren Jun 26, 2025
Maintainer

The overhead of making a copy of every tensor on every evaluation is too high. It may work for big batch sizes, but for generation memory bandwidth is typically the limiting factor. You need to do any pre-processing necessary in advance, at least for the weights. You can do this in the ggml-backend buffer interface as such:
init_tensor: allocate resources, store any additional data in tensor->extra (such as the zTensor object)
set_tensor: copy data to the tensor, applying any necessary transformations
get_tensor: retrieve data from the tensor, applying the inverse transformation if necessary
reset: free all the resources allocated in the buffer

taronaeo Jun 27, 2025
Author

That makes so much sense, and it appears that I omitted that portion of the code from a while back thinking that I wasn't going to use it! I'll go ahead and implement it again but I have a question:

init_tensor: allocate resources, store any additional data in tensor->extra (such as the zTensor object)

When running operations like matmul, we init pre-transform, transform, and zTensor descriptors for src0, src1, and destination ggml_tensor. But in the context of init_tensor, do we

also do the same init process for src0, src1, and destination, or;
do we init only for the ggml_tensor given from init_tensor function?

For point 2, if we only initialise the destination tensor, we need to somehow know if the tensor needs to be transposed e.g. tensor A in $C = B*A^T$ before the pre-transform, transform, and zTensors are initialised. Once they are fully init, we can't change it anymore. Is there an appropriate helper function for us to check this?

Edit: I've tried debugging the tensors and using helper functions and couldn't see it reporting that tensor A needs transposing:

ggml_zdnn_mul_mat_dispatch: use_mul_mat_vec   = 0
ggml_zdnn_mul_mat_dispatch: use_mul_mat_vec_q = 0
ggml_zdnn_mul_mat_dispatch: use_mul_mat_q     = 0
ggml_zdnn_mul_mat_dispatch: src0:     2048     2048        1        1
ggml_zdnn_mul_mat_dispatch:              4     8192 16777216 16777216
ggml_zdnn_mul_mat_dispatch: src1:     2048        2        1        1
ggml_zdnn_mul_mat_dispatch:              4     8192    16384    16384
ggml_zdnn_mul_mat_dispatch: src0 is contiguous 1, transposed 0, type = f32, name = blk.0.attn_q.weight
ggml_zdnn_mul_mat_dispatch: src1 is contiguous 1, transposed 0, type = f32, name = attn_norm-0

slaren Jun 27, 2025
Maintainer

During init_tensor you only have to initialize the tensor passed to the function, not its src. I don't know enough about this library to help you figure how to best map it to ggml, you may have to figure some way to determine what are the best parameters to use for each tensor.

taronaeo Jun 28, 2025
Author

No worries, I'll do the brainstorming of the relation between $ggml \leftrightarrow zDNN$ tensors. I've implemented the buffer allocation and, as a first stage test, init_tensor. However, while debugging, I realise that some ggml_tensors do not go through my backend's init_tensor function, but instead, go through the CPU backend.

This poses a problem for zDNN because all tensors in the operation require tensor descriptors to be init'd etc., before computation can happen.

Is there a codepath within ggml that would force a ggml_tensor to be routed to CPU instead of the registered backend?

For reference, these are the debug info I'm seeing

#2  0x000003fff7783092 in ggml_zdnn_mul_mat_dispatch (ctx=..., src0=0x20b5d20, src1=0x3ffb429c220, dst=0x3ffb429c390) at /opt/llama.cpp-zdnn/ggml/src/ggml-zdnn/ggml-zdnn.cpp:43

(gdb) print *(ggml_tensor *)0x20b5d20  # src0
$1 = {type = GGML_TYPE_F32, buffer = 0x20b2990, ne = {2048, 2048, 1, 1}, nb = {4, 8192, 16777216, 16777216}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x3fb0fcff100, name = "blk.0.attn_q.weight", '\000' <repeats 44 times>, extra = 0x20b4290, padding = "\000\000\000\000\000\000\000"}

(gdb) print *(ggml_tensor *)0x3ffb429c220  # src1
$2 = {type = GGML_TYPE_F32, buffer = 0x13f6c90, ne = {2048, 2, 1, 1}, nb = {4, 8192, 16384, 16384}, op = GGML_OP_MUL, op_params = {0 <repeats 16 times>}, flags = 0, src = {0x3ffb429c0b0, 0x2086200, 0x0, 0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x3fae7041160, name = "attn_norm-0", '\000' <repeats 52 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}

(gdb) print *(ggml_tensor *)0x3ffb429c390  # dst
$3 = {type = GGML_TYPE_F32, buffer = 0x172e490, ne = {2048, 2, 1, 1}, nb = {4, 8192, 16384, 16384}, op = GGML_OP_MUL_MAT, op_params = {0 <repeats 16 times>}, flags = 0, src = {0x20b5d20, 0x3ffb429c220, 0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x2598800, name = "Qcur-0", '\000' <repeats 57 times>, extra = 0x13e2260, padding = "\000\000\000\000\000\000\000"}

We can see that only src0 and dst tensors have extra pointed to a valid address. src1, however, is pointing to 0x0. A quick check against the buffer would show that somehow src1 went through the CPU backend instead of zDNN.

(gdb) print *(ggml_backend_buffer_t *)0x13f6c90  # src1 buffer address
$7 = (ggml_backend_buffer_t) 0x3fff75676b0 <ggml_backend_cpu_buffer_free_buffer(ggml_backend_buffer_t)>

(gdb) print *(ggml_backend_buffer_t *)0x172e490  # dst buffer address
$8 = (ggml_backend_buffer_t) 0x3fff7782008 <ggml_backend_zdnn_buffer_free(ggml_backend_buffer_t)>

Any clue as to why src1 is being init on the CPU instead of the registered backend? :)

(Batched) Matrix Multiplication and Fused Operations #14394

Uh oh!

taronaeo Jun 26, 2025

Replies: 1 comment · 5 replies

Uh oh!

slaren Jun 26, 2025 Maintainer

Uh oh!

Uh oh!

taronaeo Jun 26, 2025 Author

Uh oh!

Uh oh!

slaren Jun 26, 2025 Maintainer

Uh oh!

Uh oh!

taronaeo Jun 27, 2025 Author

Uh oh!

slaren Jun 27, 2025 Maintainer

Uh oh!

Uh oh!

taronaeo Jun 28, 2025 Author

taronaeo
Jun 26, 2025

Replies: 1 comment 5 replies

slaren
Jun 26, 2025
Maintainer

taronaeo Jun 26, 2025
Author

slaren Jun 26, 2025
Maintainer

taronaeo Jun 27, 2025
Author

slaren Jun 27, 2025
Maintainer

taronaeo Jun 28, 2025
Author