Issue with embedding generation #14454

marcobambini · 2025-06-30T06:23:46Z

marcobambini
Jun 30, 2025

I wrote a very simple example using a model that I am sure is ready to be used to generate embeddings.
However, I continue to receive the following error: decode: cannot decode batches with this context (calling encode() instead).

From the source code, I see that it is generated when memory is false:

int llama_context::decode(const llama_batch & batch_inp) {
    GGML_ASSERT((!batch_inp.token && batch_inp.embd) || (batch_inp.token && !batch_inp.embd)); // NOLINT

    if (!memory) {
        LLAMA_LOG_DEBUG("%s: cannot decode batches with this context (calling encode() instead)\n", __func__);
        return encode(batch_inp);
    }

How can I ensure that memory is initialized correctly?
I haven't found any explicit function to initialize it.

Answered by ggerganov

Jun 30, 2025

Most embedding models don't have a memory (a.k.a. a KV cache). This is not an error - just a warning telling you that you can simply use llama_encode() instead of llama_decode().

View full answer

ggerganov · 2025-06-30T07:59:52Z

ggerganov
Jun 30, 2025
Maintainer

Most embedding models don't have a memory (a.k.a. a KV cache). This is not an error - just a warning telling you that you can simply use llama_encode() instead of llama_decode().

1 reply

marcobambini Jun 30, 2025
Author

Thanks @ggerganov. I spent two days trying to fix this—don’t you think it should be made clearer that it’s just a warning? Or better yet, do you really think it's necessary to generate a warning in this case at all?

marcobambini · 2025-06-30T11:22:14Z

marcobambini
Jun 30, 2025
Author

@ggerganov can I use llama_get_memory to decide at runtime if llama_decode or llama_encode must be used?

1 reply

ggerganov Jun 30, 2025
Maintainer

If your application is going to support both models with and without a memory, then you should simply call llama_decode() always. The warning message is a debug message (LLAMA_LOG_DEBUG) - it only appears in debug builds and is useful for the llama.cpp developers. It's not something that you should worry about.

For example, llama-server always calls llama_decode() regardless what model is used.

@ggerganov can I use llama_get_memory to decide at runtime if llama_decode or llama_encode must be used?

You can, but it's much simpler to just call llama_decode().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with embedding generation #14454

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Issue with embedding generation #14454

Uh oh!

marcobambini Jun 30, 2025

Replies: 2 comments · 2 replies

Uh oh!

ggerganov Jun 30, 2025 Maintainer

Uh oh!

marcobambini Jun 30, 2025 Author

Uh oh!

marcobambini Jun 30, 2025 Author

Uh oh!

ggerganov Jun 30, 2025 Maintainer

marcobambini
Jun 30, 2025

Replies: 2 comments 2 replies

ggerganov
Jun 30, 2025
Maintainer

marcobambini Jun 30, 2025
Author

marcobambini
Jun 30, 2025
Author

ggerganov Jun 30, 2025
Maintainer