File tree 1 file changed +11
-2
lines changed
1 file changed +11
-2
lines changed Original file line number Diff line number Diff line change @@ -9,11 +9,18 @@ You can use multiple `trtllm-serve` commands to launch the context and generatio
9
9
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
10
10
11
11
```
12
+ echo -e "pytorch_backend_config:\n enable_overlap_scheduler: true" > extra-llm-api-config.yml
13
+
12
14
export TRTLLM_USE_UCX_KVCACHE=1
13
15
#Context servers
14
- trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8001 --backend pytorch &> log_ctx_0 &
15
- trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8002 --backend pytorch &> log_ctx_1 &
16
+ export CUDA_VISIBLE_DEVICES=0
17
+ trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8001 --backend pytorch --extra_llm_api_options ./extra-llm-api-config.yml &> log_ctx_0 &
18
+
19
+ export CUDA_VISIBLE_DEVICES=1
20
+ trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8002 --backend pytorch --extra_llm_api_options ./extra-llm-api-config.yml &> log_ctx_1 &
21
+
16
22
#Generation servers
23
+ export CUDA_VISIBLE_DEVICES=2
17
24
trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8003 --backend pytorch &> log_gen_0 &
18
25
```
19
26
Once the context and generation servers are launched, you can launch the disaggregated
@@ -30,10 +37,12 @@ hostname: localhost
30
37
port: 8000
31
38
backend: pytorch
32
39
context_servers:
40
+ num_instances: 2
33
41
urls:
34
42
- "localhost:8001"
35
43
- "localhost:8002"
36
44
generation_servers:
45
+ num_instances: 1
37
46
urls:
38
47
- "localhost:8003"
39
48
```
You can’t perform that action at this time.
0 commit comments