How to most efficiently use llama.cpp in scripts ? #1636

wtarreau · 2023-05-29T09:52:07Z

wtarreau
May 29, 2023

Hello,

I've been experimenting with llama.cpp for a few week-ends now with one goal in mind, to use an LLM's understanding of natural language to read commit messages and try to figure which ones need to be backported and which ones not, because in the project (haproxy) we have all the info there, and it's a boringly repetitive task for developers who waste a lot of precious time on this and sometimes do mistakes due to the intense repetition.

Until yesterday I never managed to get anything meaningful out of it because prompts lead to random garbage being generated as completion, and interactive mode is simply unusable since there's no way to end it after the end of the generation (or at least that I found).

Yesterday I managed to write quite long a prompt which, combined with vicuna-13b, does exactly what I need. It contains a description of how the project works, the rules to backport patches, what info the developers need etc, a dump of the commit message, and I arranged it as a conversation made of a single question/response between the human and the machine. It works amazingy well, providing accurate justifications for its choices, and I would say its judgement is on par with humans' on this extremely boring task. Here's an example of what I get after some trivial grep/sed post-processing of the output:

#id: 9b07d4fe BUG/MINOR: stats: fix ctx->field update in
Bot: this patch fixes a bug related to the "ctx->field" update in the "stats" context. \
     It should be backported to the "2.7" and "2.6" maintenance branches, as they were \
     affected by the bug introduced in the "MINOR: stats: introduce stats field ctx" change.

#id: eb3f26d5 BUG/MEDIUM: stconn: Schedule a shutw on shutr if data must be sent
Bot: this patch addresses a bug related to the scheduling of a shutdown on the write side \
     when there is pending data to be sent. It should be backported to at least version 2.2 \
     along with the previous commit (7f59d68fe) as it references the same issue number (#2033).

As-is, this summary already provides tremendous value to improve the developer's experience.

However, it takes about 1 minute on a 24-core machine just to process the prompt. This is something I can live with, but I'm still thinking I'm missing something there. I found the --prompt-cache option (which, by the way seems to suffer from a design mistake since we'd rather need separate --prompt-cache-load and --prompt-cache-save options so as not to overwrite a cache we're trying to start up with), and found that it loads incredibly faster. The problem is that I cannot store the commit message into it anymore, I'd just have to only place instructions there. And if I do this, then again I cannot find a way to append the commit message and the human's question after the prompt; llama starts to generate an answer immediately. I can prevent it from doing so by switching to interactive mode or interactive-first but then there is no way to stop it.

I tried to concatenate prompts (directs or from file), etc, but to no avail. In the end I'm still finding myself generating ultra-long prompts on the fly that cannot be cache, while the initial constant part takes 1 minute.

What I'm really trying to do is to preload a partial prompt from the cache (the part that describes how the project works), then use either a complementary prompt, or user input to ask the question (possibly in interactive mode) and quit after the response is provided.

Am I looking at the wrong approach ? I don't know if it's even technically possible to save a prompt to be reused before user inputs, so that the two still deliver something coherent together. I thought that maybe having the ability to force to exit on a matching reverse-prompt could approach what I need, but I'm not sure the engine is designed to work this way.

I can continue to waste one minute (24 minutes of CPU) per patch if that's the only solution, but it make me feel that it's a terrible waste of CPU resources. I can live with it if I'm told that there are strong technical limitations that leave no other choice, but any idea to address this the correct way would be nice.

Thanks!
Willy

j-f1 · 2023-05-29T10:27:05Z

j-f1
May 29, 2023
Collaborator

If you’re using the C++ API, I believe you’d be able to ingest your prompt into the model on the first run, then save that state out to a file. On subsequent runs, you can load the state in from the file, then ingest the commit message, then run the prediction.

If you’re batch processing commits in a single process, you can reset n_past to the length of the static prompt each time to reset the model instead of having to load it from the file again.

0 replies

wtarreau · 2023-05-29T10:48:33Z

wtarreau
May 29, 2023
Author

Thanks. I'm not using the C++ API (in fact C++ is still totally unparsable to me while C is natural). For this reason I'm launching "main" from the command line for each file. And given the speed at which the project evolves, I think it'd be wise not to depend too much on an API for now.

My script looks like this, it takes a patch produced by git-format-patch as $1:

#!/bin/bash
F="$1"
CPU="${CPU:-$(nproc)}"
MODEL="${MODEL:-../models/vicuna-13b/ggml-vic13b-q5_1.bin}"
# Input format: git-format-patch with lines in this order:
#   1: From cid ...
#   2: From: author user@...
#   3: Date:
#   4: Subject:
#   ...
#   n: ^---$
# It will emit a preliminary line with the commit ID in brackets followed by
# the subject, then the whole commit message indented. The output can be
# searched using grep '^\(Bot:\|#id:\)'
LANG=C ./main --ctx_size 2048 --temp 0.36 --top_k 12 --top_p 1 --repeat_last_n 256 --batch_size 16384 --repeat_penalty 1.05 --model "$MODEL" --threads $CPU --n_predict 200 --multiline-input --prompt "$(cat prompt11.txt;cat "$F" | sed -e '/^---/,$d' -e '2,3d' -e '/^Signed-off-by:/d' -e '/^Cc:/d' -e '/^Reported-by:/d' -e '/^Acked-by:/d' -e 's/From \([0-9a-f]\{8\}\)\([0-9a-f]\{32\}\).*/\1/' | sed -ne '1h;1d;2x;2G;2s/^\([^\n]*\)\nSubject: \(.*\)/#id: \1 \2\n\n\2/;p' | sed -e '3,$s/^/    \0/'; echo; echo "Human: Bot, what do you recommend for this one?"; echo -n "Bot:")"

A complete prompt+patch processed this way looks like this (output of main actuall):

click

main: build = 597 (a670464)
main: seed  = 1685352146
llama.cpp: loading model from ../models/vicuna-13b/ggml-vic13b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 12 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 256, repeat_penalty = 1.050000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 12, tfs_z = 1.000000, top_p = 1.000000, typical_p = 1.000000, temp = 0.360000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 200, n_keep = 0


 This describes a single question/response exchange between a human ("Human") to the AI assistant ("Bot"), which responds accurately.

HAProxy's development cycle consists in a development branch where all changes are applied (new features and fixes), and maintenance branches, which only receive fixes for bugs that affect them. Branches are numbered in 0.1 increments. Every 6 months the development branch switches to maintenance and a new development branch is created with a new, higher version. The current development branch is 2.8-dev, and stable branches are 2.7 and below. When a fix applied to the development branch also needs to be applied to maintenance branches, it is applied in descending order (2.7 first, then 2.6, then 2.5 and so on) until reaching a branch that does not need it. This operation is called backporting. A fix is never backported beyond the branch that introduced the issue. Fixes consist of patches managed using the Git version control tool and are identified by a Git commit ID and a commit message. For this reason we indistinctly talk about backporting fixes, commits, or patches; all mean the same. It happens that some fixes depend on changes that were brought by other patches that were not in some branches and that will need to be backported as well. These information are provided in the commit messages by the patch's author in natural language. If a commit message doesn't mention backport instructions, it means that the commit does not need to be backported.

The human in charge of the backports is already expert on HAProxy and everything related to Git, patch management, and the risks associated with backports, so he doesn't want to be told how to proceed nor to review the contents of the patch. For him it already takes a lot of time to carefully read the commit message, and in order to do a better job they need an accurate summary of the instructions found in it. Specifically he need to figure if the patch fixes a problem or not, if it needs to be backported, if so to which branches, and if other patches need to be backported along with it.

The indented block below is a commit message from the HAProxy development branch that describes a patch applied to that branch. The human wants to know if he needs to backport this patch, for which branches, and what possible dependencies might be mentioned in the commit message. Carefully study this commit message and its backporting instructions if any, then provide a very concise and short summary that will help the human decide to backport it or to skip it.

#id: 28a23617 BUG/MINOR: stats: fix show stats field ctx for servers

    BUG/MINOR: stats: fix show stats field ctx for servers
    
    In ("MINOR: stats: introduce stats field ctx"), we forgot
    to apply the patch to servers.
    
    This prevents "BUG/MINOR: stats: fix show stat json buffer limitation"
    from working with servers dump.
    
    We're adding the missing part related to servers dump.
    
    This commit should be backported with the aforementioned commits.

Human: Bot, what do you recommend for this one?
Bot: Based on the information provided in the commit message, it appears that this patch fixes an issue related to the "show stats" command for servers, and it should be backported to the maintenance branches 2.7 and below. It may also require backporting of other patches that were introduced in the development branch before this patch. However, as the human in charge of backports is already familiar with HAProxy and Git, they can review the commit message and decide whether or not to proceed with the backporting. [end of text]

llama_print_timings:        load time = 130749.10 ms
llama_print_timings:      sample time =   391.55 ms /   109 runs   (    3.59 ms per token)
llama_print_timings: prompt eval time = 184797.73 ms /   726 tokens (  254.54 ms per token)
llama_print_timings:        eval time = 38625.52 ms /   108 runs   (  357.64 ms per token)
llama_print_timings:       total time = 224442.20 ms

Do you think there are techinical limitations that would make it impossible to append input after the prompt is loaded without entering interactive mode ? For example I'm thinking that I could possibly try to add --input "string" to complement the loaded prompt with a string that's not part of the prompt. It would posisbly be close enough to what you suggest with resetting the input on n_past and appending. But I don't even know if that sounds like a right way to proceed.

4 replies

j-f1 May 29, 2023
Collaborator

I'm not using the C++ API (in fact C++ is still totally unparsable to me while C is natural).

Sorry, I misspoke. The API is implemented in C++ (and many of the examples are C++ too), but all the actual functions only use C features.

And given the speed at which the project evolves, I think it'd be wise not to depend too much on an API for now.

I’ve built a tool that uses the API and have only had a couple of breaking changes over this project’s lifetime. Also, once you have something that works well there’s no need to keep updating llama.

Do you think there are techinical limitations that would make it impossible to append input after the prompt is loaded without entering interactive mode ?

Not that I can think of.

wtarreau May 29, 2023
Author

Thanks for your insights, that's very useful and can definitely help me steer this in one direction or another. What would you recommend me to read first to get a foot into the API ? (don't waste your time searching links, I'm mostly looking for "I'm routinely reading this one", nothing more).

j-f1 May 29, 2023
Collaborator

I would recommend reading through examples/common.cpp and examples/main.cpp to see how the inference flow works. Here’s how I implemented inference from Swift, if that is at all helpful.

wtarreau May 29, 2023
Author

That's where I started, so I'll get back to this. Thanks for your link.

DannyDaemonic · 2023-05-29T10:56:11Z

DannyDaemonic
May 29, 2023

I'm confused as to why you don't want to update the cache. The prompt cache does allow you to use partial cache. You shouldn't have to delete the old cache.

For example:
./main --prompt-cache prompt.cache -p "In the muted light of the evening, Eleanor heard a knock that wasn't there. Lost within the folds of a mysterious temporal anomaly, Eleanor perceived sounds from alternate timelines, hearing echoes from events that might have been, but in this universe, never were."

What the cache contained:

What cache was used:

What the cache now contains:

In the muted light of the evening, Eleanor heard a knock that wasn't there. Lost within the folds of a mysterious temporal anomaly, Eleanor perceived sounds from alternate timelines, hearing echoes from events that might have been, but in this universe, never were.

Then:
./main --prompt-cache prompt.cache -p "In the muted light of the evening, Eleanor heard a knock that wasn't there. But Eleanor was no ordinary woman, for she could hear the silent whispers of spirits, who reached out from the realms beyond our understanding, etching their presence into the canvas of her mind."

What the cache contained:

In the muted light of the evening, Eleanor heard a knock that wasn't there. Lost within the folds of a mysterious temporal anomaly, Eleanor perceived sounds from alternate timelines, hearing echoes from events that might have been, but in this universe, never were.

What cache was used:

In the muted light of the evening, Eleanor heard a knock that wasn't there.

What the cache now contains:

In the muted light of the evening, Eleanor heard a knock that wasn't there. But Eleanor was no ordinary woman, for she could hear the silent whispers of spirits, who reached out from the realms beyond our understanding, etching their presence into the canvas of her mind.

2 replies

wtarreau May 29, 2023
Author

So after testing it looks pretty clear that the prompt is replaced with the new one I'm passing. Not only responses are way more vague, but in addition, the resulting cache file is much smaller after the second run, and directly proportional in size to the new prompt. This indicates to me that they are not concatenated.

wtarreau May 29, 2023
Author

I didn't understand that you actually have to preserve the part of the prompt that's found in the cache. I thought it was saved there, but indeed it serves more as a cache. I have no idea how it's used internally but if I use a prompt that begins exactly like the previously cached one, then it's indeed significantly faster and accurate, indicating that what was already cached could be reused. I'll have a look at a way to implement a read-only cache to make sure not to update it when completing the prompt with the contents of the files to test. Thanks!

wtarreau · 2023-05-29T11:08:08Z

wtarreau
May 29, 2023
Author

Interesting, I haven't managed to use it like this. I felt like the new prompt was replacing the cache instead of completing it. I'll need to experiment with this. However what I mentioned previously is that it's sad that we always overwrite it when exiting because it forces us to keep a copy of it and restore it after exiting. But if it works with your method above, I guess adding a "--prompt-cache-read-only" option should not be hard to implement.

I'll give this a try, thank you!

0 replies

wtarreau · 2023-05-29T13:01:37Z

wtarreau
May 29, 2023
Author

OK I managed to implement the read-only prompt cache mode, that's excellent as it reduces my analysis time from 112s to 49.7s here on a given patch! I'm attaching the patch here. Can it be taken as-is or is it absolutely mandatory to go through the pain of forking the repo just to create a PR ?
0001-main-add-the-possibility-to-open-the-prompt-cache-re.patch.txt

2 replies

j-f1 May 29, 2023
Collaborator

I don’t think we’ve accepted any patch-only contributions — please open a PR!

wtarreau May 29, 2023
Author

ok will do, thanks for letting me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to most efficiently use llama.cpp in scripts ? #1636

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to most efficiently use llama.cpp in scripts ? #1636

wtarreau May 29, 2023

Replies: 5 comments · 8 replies

j-f1 May 29, 2023 Collaborator

wtarreau May 29, 2023 Author

j-f1 May 29, 2023 Collaborator

wtarreau May 29, 2023 Author

j-f1 May 29, 2023 Collaborator

wtarreau May 29, 2023 Author

DannyDaemonic May 29, 2023

wtarreau May 29, 2023 Author

wtarreau May 29, 2023 Author

wtarreau May 29, 2023 Author

wtarreau May 29, 2023 Author

j-f1 May 29, 2023 Collaborator

wtarreau May 29, 2023 Author

wtarreau
May 29, 2023

Replies: 5 comments 8 replies

j-f1
May 29, 2023
Collaborator

wtarreau
May 29, 2023
Author

j-f1 May 29, 2023
Collaborator

wtarreau May 29, 2023
Author

j-f1 May 29, 2023
Collaborator

wtarreau May 29, 2023
Author

DannyDaemonic
May 29, 2023

wtarreau May 29, 2023
Author

wtarreau May 29, 2023
Author

wtarreau
May 29, 2023
Author

wtarreau
May 29, 2023
Author

j-f1 May 29, 2023
Collaborator

wtarreau May 29, 2023
Author