Replies: 5 comments 8 replies
-
If you’re using the C++ API, I believe you’d be able to ingest your prompt into the model on the first run, then save that state out to a file. On subsequent runs, you can load the state in from the file, then ingest the commit message, then run the prediction. If you’re batch processing commits in a single process, you can reset |
Beta Was this translation helpful? Give feedback.
-
Thanks. I'm not using the C++ API (in fact C++ is still totally unparsable to me while C is natural). For this reason I'm launching "main" from the command line for each file. And given the speed at which the project evolves, I think it'd be wise not to depend too much on an API for now. My script looks like this, it takes a patch produced by git-format-patch as $1:
A complete prompt+patch processed this way looks like this (output of main actuall): clickmain: build = 597 (a670464)
main: seed = 1685352146
llama.cpp: loading model from ../models/vicuna-13b/ggml-vic13b-q5_1.bin
llama_model_load_internal: format = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 12 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 256, repeat_penalty = 1.050000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 12, tfs_z = 1.000000, top_p = 1.000000, typical_p = 1.000000, temp = 0.360000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 200, n_keep = 0
This describes a single question/response exchange between a human ("Human") to the AI assistant ("Bot"), which responds accurately.
HAProxy's development cycle consists in a development branch where all changes are applied (new features and fixes), and maintenance branches, which only receive fixes for bugs that affect them. Branches are numbered in 0.1 increments. Every 6 months the development branch switches to maintenance and a new development branch is created with a new, higher version. The current development branch is 2.8-dev, and stable branches are 2.7 and below. When a fix applied to the development branch also needs to be applied to maintenance branches, it is applied in descending order (2.7 first, then 2.6, then 2.5 and so on) until reaching a branch that does not need it. This operation is called backporting. A fix is never backported beyond the branch that introduced the issue. Fixes consist of patches managed using the Git version control tool and are identified by a Git commit ID and a commit message. For this reason we indistinctly talk about backporting fixes, commits, or patches; all mean the same. It happens that some fixes depend on changes that were brought by other patches that were not in some branches and that will need to be backported as well. These information are provided in the commit messages by the patch's author in natural language. If a commit message doesn't mention backport instructions, it means that the commit does not need to be backported.
The human in charge of the backports is already expert on HAProxy and everything related to Git, patch management, and the risks associated with backports, so he doesn't want to be told how to proceed nor to review the contents of the patch. For him it already takes a lot of time to carefully read the commit message, and in order to do a better job they need an accurate summary of the instructions found in it. Specifically he need to figure if the patch fixes a problem or not, if it needs to be backported, if so to which branches, and if other patches need to be backported along with it.
The indented block below is a commit message from the HAProxy development branch that describes a patch applied to that branch. The human wants to know if he needs to backport this patch, for which branches, and what possible dependencies might be mentioned in the commit message. Carefully study this commit message and its backporting instructions if any, then provide a very concise and short summary that will help the human decide to backport it or to skip it.
#id: 28a23617 BUG/MINOR: stats: fix show stats field ctx for servers
BUG/MINOR: stats: fix show stats field ctx for servers
In ("MINOR: stats: introduce stats field ctx"), we forgot
to apply the patch to servers.
This prevents "BUG/MINOR: stats: fix show stat json buffer limitation"
from working with servers dump.
We're adding the missing part related to servers dump.
This commit should be backported with the aforementioned commits.
Human: Bot, what do you recommend for this one?
Bot: Based on the information provided in the commit message, it appears that this patch fixes an issue related to the "show stats" command for servers, and it should be backported to the maintenance branches 2.7 and below. It may also require backporting of other patches that were introduced in the development branch before this patch. However, as the human in charge of backports is already familiar with HAProxy and Git, they can review the commit message and decide whether or not to proceed with the backporting. [end of text]
llama_print_timings: load time = 130749.10 ms
llama_print_timings: sample time = 391.55 ms / 109 runs ( 3.59 ms per token)
llama_print_timings: prompt eval time = 184797.73 ms / 726 tokens ( 254.54 ms per token)
llama_print_timings: eval time = 38625.52 ms / 108 runs ( 357.64 ms per token)
llama_print_timings: total time = 224442.20 ms Do you think there are techinical limitations that would make it impossible to append input after the prompt is loaded without entering interactive mode ? For example I'm thinking that I could possibly try to add |
Beta Was this translation helpful? Give feedback.
-
I'm confused as to why you don't want to update the cache. The prompt cache does allow you to use partial cache. You shouldn't have to delete the old cache. For example: What the cache contained:
What cache was used:
What the cache now contains:
Then: What the cache contained:
What cache was used:
What the cache now contains:
|
Beta Was this translation helpful? Give feedback.
-
Interesting, I haven't managed to use it like this. I felt like the new prompt was replacing the cache instead of completing it. I'll need to experiment with this. However what I mentioned previously is that it's sad that we always overwrite it when exiting because it forces us to keep a copy of it and restore it after exiting. But if it works with your method above, I guess adding a "--prompt-cache-read-only" option should not be hard to implement. I'll give this a try, thank you! |
Beta Was this translation helpful? Give feedback.
-
OK I managed to implement the read-only prompt cache mode, that's excellent as it reduces my analysis time from 112s to 49.7s here on a given patch! I'm attaching the patch here. Can it be taken as-is or is it absolutely mandatory to go through the pain of forking the repo just to create a PR ? |
Beta Was this translation helpful? Give feedback.
-
Hello,
I've been experimenting with llama.cpp for a few week-ends now with one goal in mind, to use an LLM's understanding of natural language to read commit messages and try to figure which ones need to be backported and which ones not, because in the project (haproxy) we have all the info there, and it's a boringly repetitive task for developers who waste a lot of precious time on this and sometimes do mistakes due to the intense repetition.
Until yesterday I never managed to get anything meaningful out of it because prompts lead to random garbage being generated as completion, and interactive mode is simply unusable since there's no way to end it after the end of the generation (or at least that I found).
Yesterday I managed to write quite long a prompt which, combined with vicuna-13b, does exactly what I need. It contains a description of how the project works, the rules to backport patches, what info the developers need etc, a dump of the commit message, and I arranged it as a conversation made of a single question/response between the human and the machine. It works amazingy well, providing accurate justifications for its choices, and I would say its judgement is on par with humans' on this extremely boring task. Here's an example of what I get after some trivial grep/sed post-processing of the output:
As-is, this summary already provides tremendous value to improve the developer's experience.
However, it takes about 1 minute on a 24-core machine just to process the prompt. This is something I can live with, but I'm still thinking I'm missing something there. I found the
--prompt-cache
option (which, by the way seems to suffer from a design mistake since we'd rather need separate--prompt-cache-load
and--prompt-cache-save
options so as not to overwrite a cache we're trying to start up with), and found that it loads incredibly faster. The problem is that I cannot store the commit message into it anymore, I'd just have to only place instructions there. And if I do this, then again I cannot find a way to append the commit message and the human's question after the prompt; llama starts to generate an answer immediately. I can prevent it from doing so by switching to interactive mode or interactive-first but then there is no way to stop it.I tried to concatenate prompts (directs or from file), etc, but to no avail. In the end I'm still finding myself generating ultra-long prompts on the fly that cannot be cache, while the initial constant part takes 1 minute.
What I'm really trying to do is to preload a partial prompt from the cache (the part that describes how the project works), then use either a complementary prompt, or user input to ask the question (possibly in interactive mode) and quit after the response is provided.
Am I looking at the wrong approach ? I don't know if it's even technically possible to save a prompt to be reused before user inputs, so that the two still deliver something coherent together. I thought that maybe having the ability to force to exit on a matching reverse-prompt could approach what I need, but I'm not sure the engine is designed to work this way.
I can continue to waste one minute (24 minutes of CPU) per patch if that's the only solution, but it make me feel that it's a terrible waste of CPU resources. I can live with it if I'm told that there are strong technical limitations that leave no other choice, but any idea to address this the correct way would be nice.
Thanks!
Willy
Beta Was this translation helpful? Give feedback.
All reactions