You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Expected Result Format](#expected-result-format)
24
+
-[H200 min-latency](#h200-min-latency)
25
+
-[Expected Result Format](#expected-result-format-1)
26
+
-[H200 max-throughput](#h200-max-throughput)
27
+
-[Expected Result Format](#expected-result-format-2)
28
+
-[Exploring more ISL/OSL combinations](#exploring-more-islosl-combinations)
29
+
-[WIP: Enable more features by default](#wip-enable-more-features-by-default)
30
+
-[WIP: Chunked context support on DeepSeek models](#wip-chunked-context-support-on-deepseek-models)
31
+
-[Out of memory issues](#out-of-memory-issues)
32
+
33
+
7
34
## Prerequisites: Install TensorRT-LLM and download models
8
35
9
36
This section can be skipped if you already have TensorRT-LLM installed and have already downloaded the DeepSeek R1 model checkpoint.
@@ -324,3 +351,25 @@ Total Token Throughput (tokens/sec): 15707.0888
324
351
Total Latency (ms): 993548.8470
325
352
Average request latency (ms): 197768.0434
326
353
```
354
+
355
+
## Exploring more ISL/OSL combinations
356
+
357
+
To benchmark TensorRT-LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT-LLM is working on enhancements that can make the benchmark process smoother.
358
+
### WIP: Enable more features by default
359
+
360
+
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
361
+
362
+
Note that, `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
363
+
364
+
Generally, you should make sure that `max_batch_size` is not too low to bottleneck the throughput, and `max_num_tokens` needs to be large enough so that it covers the max input sequence length of the samples in dataset, as mentioned in below section "WIP: Chunked context support on DeepSeek models".
365
+
366
+
For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
367
+
368
+
### WIP: Chunked context support on DeepSeek models
369
+
370
+
TensorRT-LLM team is actively working on chunked context support for DeepSeek models. Because of that missing feature, there is currently a limitation that `max_num_tokens` has to be at least larger than the max input sequence length of the samples in dataset.
371
+
For more details on `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
372
+
373
+
### Out of memory issues
374
+
375
+
It's possible seeing OOM issues on some cases. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem.
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container(https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
54
+
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
0 commit comments