File tree 1 file changed +7
-3
lines changed
1 file changed +7
-3
lines changed Original file line number Diff line number Diff line change @@ -9,12 +9,14 @@ You can use multiple `trtllm-serve` commands to launch the context and generatio
9
9
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
10
10
11
11
```
12
+ echo -e "pytorch_backend_config:\n enable_overlap_scheduler: False" > extra-llm-api-config.yml
13
+
12
14
export TRTLLM_USE_UCX_KVCACHE=1
13
15
#Context servers
14
- trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8001 --backend pytorch &> log_ctx_0 &
15
- trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8002 --backend pytorch &> log_ctx_1 &
16
+ CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --backend pytorch --extra_llm_api_options ./extra-llm-api-config.yml &> log_ctx_0 &
17
+ CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --backend pytorch --extra_llm_api_options ./extra-llm-api-config.yml &> log_ctx_1 &
16
18
#Generation servers
17
- trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8003 --backend pytorch &> log_gen_0 &
19
+ CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --backend pytorch &> log_gen_0 &
18
20
```
19
21
Once the context and generation servers are launched, you can launch the disaggregated
20
22
server, which will accept requests from clients and do the orchestration between context
@@ -30,10 +32,12 @@ hostname: localhost
30
32
port: 8000
31
33
backend: pytorch
32
34
context_servers:
35
+ num_instances: 2
33
36
urls:
34
37
- "localhost:8001"
35
38
- "localhost:8002"
36
39
generation_servers:
40
+ num_instances: 1
37
41
urls:
38
42
- "localhost:8003"
39
43
```
You can’t perform that action at this time.
0 commit comments