Skip to content

Commit ab89b98

Browse files
committed
server: rename legacy --ctx-size to --kv-size
1 parent 5bf2b94 commit ab89b98

File tree

2 files changed

+14
-3
lines changed

2 files changed

+14
-3
lines changed

examples/server/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Command line options:
88
- `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.
99
- `-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`).
1010
- `-a ALIAS`, `--alias ALIAS`: Set an alias for the model. The alias will be returned in API responses.
11-
- `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.
11+
- `-kv N`, `--kv-size N`: Specify the total size of the KV cache. This corresponds to the total amount of tokens that can be stored across all independent sequences / slots. `llama.cpp` implements a "unified" cache strategy, the KV cache size is actually shared across all sequences. It's allowed to have sequences with more than `T` tokens as long as the sum of all tokens does not exceed `P*T`. The default is 512.
1212
- `-ngl N`, `--n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
1313
- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
1414
- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
@@ -33,7 +33,7 @@ see https://github.com/ggerganov/llama.cpp/issues/1437
3333
- `--api-key`: Set an api key for request authorization. By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token. May be used multiple times to enable multiple valid keys.
3434
- `--api-key-file`: path to file containing api keys delimited by new lines. If set, requests must include one of the keys for access. May be used in conjunction with `--api-key`'s.
3535
- `--embedding`: Enable embedding extraction, Default: disabled.
36-
- `-np N`, `--parallel N`: Set the number of slots for process requests (default: 1)
36+
- `-np N`, `--parallel N`: Set the number of slots / sequences for process requests (default: 1). Each sequence can have a maximum of `T` tokens, use together with `--kv-size`.
3737
- `-cb`, `--cont-batching`: enable continuous batching (a.k.a dynamic batching) (default: disabled)
3838
- `-spf FNAME`, `--system-prompt-file FNAME` Set a file to load "a system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
3939
- `--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA.

examples/server/server.cpp

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1857,7 +1857,7 @@ static void server_print_usage(const char *argv0, const gpt_params &params,
18571857
printf(" -v, --verbose verbose output (default: %s)\n", server_verbose ? "enabled" : "disabled");
18581858
printf(" -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
18591859
printf(" -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)\n");
1860-
printf(" -c N, --ctx-size N size of the prompt context (default: %d)\n", params.n_ctx);
1860+
printf(" -kv N, --kv-size N Specify the total size of the KV cache (default: %d)\n", params.n_ctx);
18611861
printf(" --rope-scaling {none,linear,yarn}\n");
18621862
printf(" RoPE frequency scaling method, defaults to linear unless specified by the model\n");
18631863
printf(" --rope-freq-base N RoPE base frequency (default: loaded from model)\n");
@@ -2026,6 +2026,17 @@ static void server_params_parse(int argc, char **argv, server_params &sparams,
20262026
exit(0);
20272027
}
20282028
else if (arg == "-c" || arg == "--ctx-size" || arg == "--ctx_size")
2029+
{
2030+
if (++i >= argc)
2031+
{
2032+
invalid_param = true;
2033+
break;
2034+
}
2035+
params.n_ctx = std::stoi(argv[i]);
2036+
LOG_WARNING("-c,--ctx-size,--ctx_size option is deprecated, use --kv-size instead",
2037+
{{"--ctx_size", params.n_ctx}});
2038+
}
2039+
else if (arg == "-kv" || arg == "--kv-size" || arg == "--kv_size")
20292040
{
20302041
if (++i >= argc)
20312042
{

0 commit comments

Comments
 (0)