Skip to content

Commit 147440b

Browse files
authored
docs: add reference for concurrent requests
Signed-off-by: Ettore Di Giacinto <[email protected]>
1 parent baff5ff commit 147440b

File tree

1 file changed

+25
-1
lines changed

1 file changed

+25
-1
lines changed

docs/content/docs/advanced/advanced-usage.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -498,4 +498,28 @@ When using the `-core` container image it is possible to prepare the python back
498498

499499
```bash
500500
docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master-ffmpeg-core
501-
```
501+
```
502+
503+
### Concurrent requests
504+
505+
LocalAI supports parallel requests for the backends that supports it. For instance, vLLM and llama.cpp supports parallel requests, and thus LocalAI allows to run multiple requests in parallel.
506+
507+
In order to enable parallel requests, you have to pass `--parallel-requests` or set the `PARALLEL_REQUEST` to true as environment variable.
508+
509+
A list of the environment variable that tweaks parallelism is the following:
510+
511+
```
512+
### Python backends GRPC max workers
513+
### Default number of workers for GRPC Python backends.
514+
### This actually controls wether a backend can process multiple requests or not.
515+
# PYTHON_GRPC_MAX_WORKERS=1
516+
517+
### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
518+
# LLAMACPP_PARALLEL=1
519+
520+
### Enable to run parallel requests
521+
# LOCALAI_PARALLEL_REQUESTS=true
522+
```
523+
524+
Note that, for llama.cpp you need to set accordingly `LLAMACPP_PARALLEL` to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set `PYTHON_GRPC_MAX_WORKERS` to the number of parallel requests.
525+

0 commit comments

Comments
 (0)