You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/advanced/disaggregated-service.md
+33-1
Original file line number
Diff line number
Diff line change
@@ -144,7 +144,7 @@ When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will tr
144
144
*Q. Why do some profiling tools show that TRT-LLM's KV cache transfer does not utilize NVLink even on devices equipped with NVLink?*
145
145
146
146
A. Ensure TRT-LLM is running with `UCX`-backend `CUDA-aware MPI` , and check version of `UCX` with `ucx_info -v`.
147
-
If the version of UCX <=1.17, set the environment variables `UCX_RNDV_FRAG_MEM_TYPE=cuda` and `UCX_MEMTYPE_CACHE=n` to enable NVLink.
147
+
If the version of UCX <=1.17, set the environment variables `UCX_RNDV_FRAG_MEM_TYPE=cuda` and `UCX_MEMTYPE_CACHE=n` to enable NVLink. For BlackWell architecture GPUs, UCX version >=1.19 is required to enable NVLink.
148
148
If the version of UCX >=1.18, there are several ways to enable NVLink:
149
149
1. Set the environment variables `UCX_CUDA_COPY_ASYNC_MEM_TYPE=cuda`, `UCX_CUDA_COPY_DMABUF=no`, `UCX_MEMTYPE_CACHE=n` and `UCX_RNDV_PIPELINE_ERROR_HANDLING=y`.
150
150
2. Set the environment variables `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE=$Size`, `UCX_MEMTYPE_CACHE=n` and `UCX_RNDV_PIPELINE_ERROR_HANDLING=y`. $Size represents the size of the buffer for KV cache transfer, which is recommended to be larger than the size of the KV cache for the longest request.
@@ -155,3 +155,35 @@ A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer,
155
155
1. Set the environment variables `UCX_RNDV_FRAG_MEM_TYPE=cuda`, `UCX_MEMTYPE_CACHE=n` and `UCX_RNDV_PIPELINE_ERROR_HANDLING=y`.
156
156
2. Set the environment variables `TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE=$Size`, `UCX_MEMTYPE_CACHE=n` and `UCX_RNDV_PIPELINE_ERROR_HANDLING=y`, $Size represents the size of the buffer for KV cache transfer, which is recommended to be larger than the size of the KV cache for the longest request.
157
157
To achieve the optimal performance when using GPU direct RDMA, it is advisable to create CUDA context before MPI initialization when TRTLLM_USE_MPI_KVCACHE=1 is set. One possible approach is to rely on MPI environment variables to set the correct device before MPI initialization.
158
+
159
+
*Q. Are there any guidelines for performance tuning of KV cache transfer?*
160
+
161
+
A. Depending on the user's use case, certain sets of environment variables can help avoid poor KV cache transfer performance.
162
+
163
+
Environment Variable Set A
164
+
165
+
```
166
+
export UCX_RNDV_FRAG_MEM_TYPES=cuda
167
+
export UCX_MEMTYPE_CACHE=n
168
+
export UCX_RNDV_PIPELINE_ERROR_HANDLING=y
169
+
```
170
+
This set allows KV cache transfers to utilize NVLink within nodes and GDRDMA between nodes.
171
+
172
+
Environment Variable Set B
173
+
174
+
```
175
+
export UCX_CUDA_COPY_ASYNC_MEM_TYPE=cuda
176
+
export UCX_CUDA_COPY_DMABUF=no
177
+
export UCX_MEMTYPE_CACHE=n
178
+
export UCX_RNDV_PIPELINE_ERROR_HANDLING=y
179
+
```
180
+
Set B may provide slightly better performance on a single node compared to Set A. However, when transferring KV cache across multiple nodes, it may cause program instability.
181
+
182
+
Environment Variable Set C
183
+
184
+
```
185
+
export TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE=$Size
186
+
export UCX_MEMTYPE_CACHE=n
187
+
export UCX_RNDV_PIPELINE_ERROR_HANDLING=y
188
+
```
189
+
Set C can achieve better performance than Sets A and B, both within and between nodes. However, if the KV cache size exceeds the specified $Size, performance may degrade.
0 commit comments