update readme for disaggregated

chuangz0 · chuangz0 · commit c79853e1a070 · 2025-04-07T09:56:23.000Z
Signed-off-by: Chuang Zhu &lt;111838961+chuangz0@users.noreply.github.com&gt;
diff --git a/examples/disaggregated/README.md b/examples/disaggregated/README.md
@@ -9,11 +9,18 @@ You can use multiple `trtllm-serve` commands to launch the context and generatio
 for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
 
 ```
+echo -e "pytorch_backend_config:\n  enable_overlap_scheduler: true" > extra-llm-api-config.yml
+
 export TRTLLM_USE_UCX_KVCACHE=1
 #Context servers
-trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8001 --backend pytorch &> log_ctx_0 &
-trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8002 --backend pytorch &> log_ctx_1 &
+export CUDA_VISIBLE_DEVICES=0
+trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8001 --backend pytorch --extra_llm_api_options ./extra-llm-api-config.yml &> log_ctx_0 &
+
+export CUDA_VISIBLE_DEVICES=1
+trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8002 --backend pytorch --extra_llm_api_options ./extra-llm-api-config.yml &> log_ctx_1 &
+
 #Generation servers
+export CUDA_VISIBLE_DEVICES=2
 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhsot --port 8003 --backend pytorch &> log_gen_0 &
 ```
 Once the context and generation servers are launched, you can launch the disaggregated
@@ -30,10 +37,12 @@ hostname: localhost
 port: 8000
 backend: pytorch
 context_servers:
+  num_instances: 2
   urls:
       - "localhost:8001"
       - "localhost:8002"
 generation_servers:
+  num_instances: 1
   urls:
       - "localhost:8003"
 ```