Skip to content

Commit b088016

Browse files
authored
Update TensorRT-LLM v0.14.0 (#2401)
1 parent 9078696 commit b088016

File tree

384 files changed

+14603
-5392
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

384 files changed

+14603
-5392
lines changed

.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ tensorrt_llm/bindings.pyi
3737
tensorrt_llm/bindings/*.pyi
3838
*docs/cpp_docs*
3939
*docs/source/_cpp_gen*
40-
docs/source/llm-api
40+
docs/source/llm-api/*.rst
4141
docs/source/llm-api-examples/llm_*.rst
4242
*.swp
4343

.gitmodules

+4-1
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,7 @@
1313
url = https://github.com/NVIDIA/NVTX.git
1414
[submodule "3rdparty/ucxx"]
1515
path = 3rdparty/ucxx
16-
url = https://github.com/GuanLuo/ucxx.git
16+
url = https://github.com/rapidsai/ucxx.git
17+
[submodule "3rdparty/pybind11"]
18+
path = 3rdparty/pybind11
19+
url = https://github.com/pybind/pybind11.git

.pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,5 @@ repos:
4646
args:
4747
- --skip=".git,3rdparty"
4848
- --exclude-file=examples/whisper/tokenizer.py
49-
- --ignore-words-list=rouge,inout,atleast,strat,nd,subtile,thrid
49+
- --ignore-words-list=rouge,inout,atleast,strat,nd,subtile,thrid,improbe
5050
exclude: 'tests/llm-test-defs/turtle/test_input_files'

3rdparty/pybind11

Submodule pybind11 added at f99ffd7

README.md

+23-8
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ TensorRT-LLM
88
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
99
[![cuda](https://img.shields.io/badge/cuda-12.5.1-green)](https://developer.nvidia.com/cuda-downloads)
1010
[![trt](https://img.shields.io/badge/TRT-10.4.0-green)](https://developer.nvidia.com/tensorrt)
11-
[![version](https://img.shields.io/badge/release-0.13.0-green)](./tensorrt_llm/version.py)
11+
[![version](https://img.shields.io/badge/release-0.14.0-green)](./tensorrt_llm/version.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

1414
[Architecture](./docs/source/architecture/overview.md)   |   [Results](./docs/source/performance/perf-overview.md)   |   [Examples](./examples/)   |   [Documentation](./docs/source/)
@@ -17,6 +17,24 @@ TensorRT-LLM
1717
<div align="left">
1818

1919
## Latest News
20+
* [2024/09/29] 🌟 AI at Meta PyTorch + TensorRT v2.4 🌟 ⚡TensorRT 10.1 ⚡PyTorch 2.4 ⚡CUDA 12.4 ⚡Python 3.12
21+
[➡️ link](https://github.com/pytorch/TensorRT/releases/tag/v2.4.0)
22+
<div align="center">
23+
<img src="docs/source/media/image-09-29-2024.png" width="50%">
24+
<div align="left">
25+
26+
* [2024/09/17] ✨ NVIDIA TensorRT-LLM Meetup
27+
[➡️ link](https://drive.google.com/file/d/1RR8GqC-QbuaKuHj82rZcXb3MS20SWo6F/view?usp=share_link)
28+
29+
* [2024/09/17] ✨ Accelerating LLM Inference at Databricks with TensorRT-LLM
30+
[➡️ link](https://drive.google.com/file/d/1NeSmrLaWRJAY1rxD9lJmzpB9rzr38j8j/view?usp=sharing)
31+
32+
* [2024/09/17] ✨ TensorRT-LLM @ Baseten
33+
[➡️ link](https://drive.google.com/file/d/1Y7L2jqW-aRmt31mCdqhwvGMmCSOzBUjG/view?usp=share_link)
34+
35+
* [2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML
36+
[➡️ link](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml)
37+
2038
* [2024/08/20] 🏎️SDXL with #TensorRT Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12
2139
[➡️ link](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/)
2240

@@ -43,6 +61,9 @@ TensorRT-LLM
4361
* [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100.
4462
[➡️ Tech blog](https://developer.nvidia.com/blog/achieving-high-mixtral-8x7b-performance-with-nvidia-h100-tensor-core-gpus-and-tensorrt-llm?ncid=so-twit-928467)
4563

64+
<details close>
65+
<summary>Previous News</summary>
66+
4667
* [2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.ai’s solar-10.7B-instruct is ready to power your developer projects through our API catalog 🏎️. ✨[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )
4768

4869
* [2024/06/18] CYMI: 🤩 Stable Diffusion 3 dropped last week 🎊 🏎️ Speed up your SD3 with #TensorRT INT8 Quantization[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )
@@ -55,10 +76,6 @@ Technical Deep Dive for serious coders ✅+99% compression ✅1 set of weights
5576
* [2024/06/04] ✨ #TensorRT and GeForce #RTX unlock ComfyUI SD superhero powers 🦸⚡ 🎥 Demo: [➡️ link](https://youtu.be/64QEVfbPHyg)
5677
📗 DIY notebook: [➡️ link](https://console.brev.dev/launchable/deploy?userID=2x2sil999&orgID=ktj33l4xj&name=ComfyUI_TensorRT&instance=L4%40g2-standard-4%3Anvidia-l4%3A1&diskStorage=500&cloudID=GCP&baseImage=docker.io%2Fpytorch%2Fpytorch%3A2.2.0-cuda12.1-cudnn8-runtime&ports=ComfUI%3A8188&file=https%3A%2F%2Fgithub.jpy.wang%2Fbrevdev%2Fnotebooks%2Fblob%2Fmain%2Ftensorrt-comfyui.ipynb&launchableID=env-2hQX3n7ae5mq3NjNZ32DfAG0tJf)
5778

58-
<details close>
59-
<summary>Previous News</summary>
60-
61-
6279
* [2024/05/28] ✨#TensorRT weight stripping for ResNet-50 ✨ ✅+99% compression
6380
✅1 set of weights → ** GPUs\ ✅0 performance loss ✅** models…LLM, CNN, etc
6481
👀 📚 DIY [➡️ link](https://console.brev.dev/launchable/deploy?userID=2x2sil999&orgID=ktj33l4xj&launchableID=env-2h6bym7h5GFNho3vpWQQeUYMwTM&instance=L4%40g6.xlarge&diskStorage=500&cloudID=devplane-brev-1&baseImage=nvcr.io%2Fnvidia%2Ftensorrt%3A24.05-py3&file=https%3A%2F%2Fgithub.jpy.wang%2FNVIDIA%2FTensorRT%2Fblob%2Frelease%2F10.0%2Fsamples%2Fpython%2Fsample_weight_stripping%2Fnotebooks%2Fweight_stripping.ipynb&name=tensorrt_weight_stripping_resnet50)
@@ -68,10 +85,8 @@ Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
6885

6986
* [2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
7087

71-
7288
* [2024/05/07] 🦙🦙🦙 24,000 tokens per second 🛫Meta Llama 3 takes off with #TensorRT #LLM 📚[➡️ link](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/)
7389

74-
7590
* [2024/02/06] [🚀 Speed up inference with SOTA quantization techniques in TRT-LLM](./docs/source/blogs/quantization-in-TRT-LLM.md)
7691
* [2024/01/30] [ New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget](./docs/source/blogs/XQA-kernel.md)
7792
* [2023/12/04] [Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100](./docs/source/blogs/Falcon180B-H200.md)
@@ -88,7 +103,7 @@ Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
88103
## TensorRT-LLM Overview
89104

90105
TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference.
91-
It provides state-of-the-art optimziations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ++) and much more, to perform inference efficiently on NVIDIA GPUs
106+
It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ++) and much more, to perform inference efficiently on NVIDIA GPUs
92107

93108
TensorRT-LLM provides a Python API to build LLMs into optimized
94109
[TensorRT](https://developer.nvidia.com/tensorrt) engines.

benchmarks/README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,6 @@ There are currently three workflows to benchmark TensorRT-LLM:
77
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
88
* [Python benchmarks](./python)
99
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
10-
* [The Python benchmarking suite](./Suite.md)
11-
- This benchmarking suite is a current work in progress and is prone to large changes.
10+
* [The Python benchmarking suite](../docs/source/performance/perf-benchmarking.md)
11+
- This benchmarker is native to TensorRT-LLM and is a Python benchmarker for reproducing and testing the performance of TensorRT-LLM.
12+
- _NOTE_: This benchmarking suite is a current work in progress and is prone to large changes.

benchmarks/cpp/gptManagerBenchmark.cpp

+86-13
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,7 @@ struct BenchmarkParams
145145
{
146146
std::optional<SizeType32> maxTokensInPagedKvCache{std::nullopt};
147147
std::optional<float> freeGpuMemoryFraction{std::nullopt};
148+
std::optional<float> crossKvCacheFraction{std::nullopt};
148149
bool enableTrtOverlap{false};
149150
bool enableBlockReuse{false};
150151
bool enableChunkedContext{false};
@@ -159,6 +160,8 @@ struct BenchmarkParams
159160
std::optional<int> sinkTokenLength{std::nullopt};
160161
bool multiBlockMode{true};
161162
bool enableContextFMHAFP32Acc{false};
163+
bool cudaGraphMode{false};
164+
SizeType32 cudaGraphCacheSize{0};
162165

163166
// lora / peft params
164167
std::optional<std::string> loraDir{std::nullopt};
@@ -470,7 +473,38 @@ class Recorder
470473
mRequestBenchInfos[requestId].firstTokenSeen = true;
471474
}
472475

473-
mRequestBenchInfos[requestId].outputLength += 1;
476+
mRequestBenchInfos[requestId].decodingIter += 1;
477+
}
478+
479+
void recordToken(uint64_t requestId, std::list<NamedTensor> const& responseTensors)
480+
{
481+
int32_t outputLength = 1;
482+
for (auto& tensor : responseTensors)
483+
{
484+
if (tensor.name == inference_request::kSequenceLengthTensorName)
485+
{
486+
// Tensor of shape nBeams, and we only need the first one
487+
outputLength = *(bufferCast<int32_t>(*(tensor.tensor)));
488+
break;
489+
}
490+
}
491+
492+
mRequestBenchInfos[requestId].outputLength += outputLength;
493+
this->recordToken(requestId);
494+
}
495+
496+
void recordToken(uint64_t requestId, texec::Response const& response)
497+
{
498+
auto outputTokenIds = response.getResult().outputTokenIds;
499+
500+
int32_t outputLength = 1;
501+
for (auto const& beam : outputTokenIds)
502+
{
503+
outputLength = std::max(static_cast<int32_t>(beam.size()), outputLength);
504+
}
505+
506+
mRequestBenchInfos[requestId].outputLength += outputLength;
507+
this->recordToken(requestId);
474508
}
475509

476510
void recordEnd(uint64_t requestId, std::list<NamedTensor> const& responseTensors, bool hasError)
@@ -500,7 +534,7 @@ class Recorder
500534
}
501535
else
502536
{
503-
this->recordToken(requestId);
537+
this->recordToken(requestId, responseTensors);
504538
}
505539
}
506540

@@ -532,7 +566,7 @@ class Recorder
532566
}
533567
else
534568
{
535-
this->recordToken(requestId);
569+
this->recordToken(requestId, response);
536570
}
537571
}
538572
}
@@ -818,11 +852,13 @@ class ExecutorServer
818852
texec::SchedulerConfig schedulerConfig(capacitySchedulerPolicy);
819853
texec::KvCacheConfig kvCacheConfig(benchmarkParams.enableBlockReuse, benchmarkParams.maxTokensInPagedKvCache,
820854
benchmarkParams.maxAttentionWindowVec, benchmarkParams.sinkTokenLength,
821-
benchmarkParams.freeGpuMemoryFraction, benchmarkParams.kvHostCacheSize, benchmarkParams.kvOnboardBlocks);
855+
benchmarkParams.freeGpuMemoryFraction, benchmarkParams.kvHostCacheSize, benchmarkParams.kvOnboardBlocks,
856+
benchmarkParams.crossKvCacheFraction);
822857
texec::PeftCacheConfig peftCacheConfig(0, benchmarkParams.loraDeviceNumModLayers, 8, 64, 4, 4, 4, 24, 8,
823858
std::nullopt, benchmarkParams.loraHostCacheSize);
824-
texec::ExtendedRuntimePerfKnobConfig extendedRuntimePerfKnobConfig(
825-
benchmarkParams.multiBlockMode, benchmarkParams.enableContextFMHAFP32Acc);
859+
texec::ExtendedRuntimePerfKnobConfig extendedRuntimePerfKnobConfig(benchmarkParams.multiBlockMode,
860+
benchmarkParams.enableContextFMHAFP32Acc, benchmarkParams.cudaGraphMode,
861+
benchmarkParams.cudaGraphCacheSize);
826862
texec::ExecutorConfig executorConfig(
827863
maxBeamWidth, schedulerConfig, kvCacheConfig, benchmarkParams.enableChunkedContext, true);
828864
executorConfig.setGpuWeightsPercent(benchmarkParams.gpuWeightsPercent);
@@ -940,7 +976,7 @@ class ExecutorServer
940976
{
941977
if (!warmup && !response.hasError())
942978
{
943-
mRecorder->recordToken(reqId);
979+
mRecorder->recordToken(reqId, response);
944980
}
945981
}
946982
}
@@ -1228,7 +1264,7 @@ class GptServer
12281264
{
12291265
if (errMsg.empty())
12301266
{
1231-
mRecorder->recordToken(requestId);
1267+
mRecorder->recordToken(requestId, response_tensors);
12321268
}
12331269
}
12341270
}
@@ -1430,6 +1466,10 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
14301466
{
14311467
optionalParams.kvCacheConfig.freeGpuMemoryFraction = benchmarkParams.freeGpuMemoryFraction;
14321468
}
1469+
if (benchmarkParams.crossKvCacheFraction)
1470+
{
1471+
optionalParams.kvCacheConfig.crossKvCacheFraction = benchmarkParams.crossKvCacheFraction;
1472+
}
14331473
if (benchmarkParams.maxAttentionWindowVec)
14341474
{
14351475
optionalParams.kvCacheConfig.maxAttentionWindowVec = benchmarkParams.maxAttentionWindowVec;
@@ -1458,8 +1498,8 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
14581498
: benchmarkParams.executorLookaheadConfig.has_value() ? texec::DecodingMode::Lookahead()
14591499
: texec::DecodingMode::Auto(),
14601500
benchmarkParams.executorLookaheadConfig, benchmarkParams.medusaChoices);
1461-
optionalParams.extendedRuntimePerfKnobConfig = texec::ExtendedRuntimePerfKnobConfig(
1462-
benchmarkParams.multiBlockMode, benchmarkParams.enableContextFMHAFP32Acc);
1501+
optionalParams.extendedRuntimePerfKnobConfig = texec::ExtendedRuntimePerfKnobConfig(benchmarkParams.multiBlockMode,
1502+
benchmarkParams.enableContextFMHAFP32Acc, benchmarkParams.cudaGraphMode, benchmarkParams.cudaGraphCacheSize);
14631503

14641504
auto const jsonConfig = GptJsonConfig::parse(engineDir / "config.json");
14651505
auto const worldConfig = WorldConfig::mpi(jsonConfig.getGpusPerNode(), jsonConfig.getTensorParallelism(),
@@ -1874,6 +1914,8 @@ int main(int argc, char* argv[])
18741914
"random_seed", "integer random seed for exponential time delays.", cxxopts::value<int>()->default_value("420"));
18751915
options.add_options()(
18761916
"kv_cache_free_gpu_mem_fraction", "K-V Cache Free Gpu Mem Fraction.", cxxopts::value<float>());
1917+
options.add_options()(
1918+
"cross_kv_cache_fraction", "Cross K-V Cache Fraction (from 0.0 to 1.0).", cxxopts::value<float>());
18771919
options.add_options()("request_rate",
18781920
"request rate in reqs/sec. Skipping this arg or negative value will trigger offline/0-delay.",
18791921
cxxopts::value<float>());
@@ -1895,7 +1937,8 @@ int main(int argc, char* argv[])
18951937
options.add_options()("return_generation_logits", "Whether to return generation logits.",
18961938
cxxopts::value<bool>()->default_value("false"));
18971939

1898-
options.add_options()("scheduler_policy", "Choose scheduler policy between max_utilization/guaranteed_no_evict.",
1940+
options.add_options()("scheduler_policy",
1941+
"Choose scheduler policy between max_utilization/guaranteed_no_evict/static_batch.",
18991942
cxxopts::value<std::string>()->default_value("guaranteed_no_evict"));
19001943

19011944
options.add_options()("first_batch_delay",
@@ -1946,6 +1989,12 @@ int main(int argc, char* argv[])
19461989
cxxopts::value<bool>()->default_value("true"));
19471990
options.add_options()(
19481991
"encoder_engine_dir", "Directory that store the engines of the encoder models.", cxxopts::value<std::string>());
1992+
options.add_options()("cuda_graph_mode", "When enabled, inference is executed with cuda graph.",
1993+
cxxopts::value<bool>()->default_value("false"));
1994+
options.add_options()("cuda_graph_cache_size",
1995+
"Specify how many cuda graphs are cached in the runtime. Larger cache gives better perf, but consumes more GPU "
1996+
"memory.",
1997+
cxxopts::value<SizeType32>()->default_value("0"));
19491998

19501999
options.add_options()("enable_context_fmha_fp32_acc", "Enable FMHA runner FP32 accumulation",
19512000
cxxopts::value<bool>()->default_value("false"));
@@ -2040,6 +2089,20 @@ int main(int argc, char* argv[])
20402089
{
20412090
benchmarkParams.freeGpuMemoryFraction = result["kv_cache_free_gpu_mem_fraction"].as<float>();
20422091
}
2092+
// Argument: K-V Cache Cross Attention Fraction. Only applicable to enc-dec models.
2093+
if (result.count("encoder_engine_dir") && result.count("decoder_engine_dir"))
2094+
{
2095+
if (result.count("cross_kv_cache_fraction"))
2096+
{
2097+
benchmarkParams.crossKvCacheFraction = result["cross_kv_cache_fraction"].as<float>();
2098+
}
2099+
else
2100+
{
2101+
benchmarkParams.crossKvCacheFraction
2102+
= 0.5f; // default value if not set. but non enc-dec should not even have this param set
2103+
}
2104+
}
2105+
20432106
// Argument: Enable TRT overlap
20442107
benchmarkParams.enableTrtOverlap = result["enable_trt_overlap"].as<bool>();
20452108

@@ -2131,6 +2194,12 @@ int main(int argc, char* argv[])
21312194
// Argument: enable_context_fmha_fp32_acc
21322195
benchmarkParams.enableContextFMHAFP32Acc = result["enable_context_fmha_fp32_acc"].as<bool>();
21332196

2197+
// Argument: cuda_graph_mode
2198+
benchmarkParams.cudaGraphMode = result["cuda_graph_mode"].as<bool>();
2199+
2200+
// Argument: cuda_graph_mode
2201+
benchmarkParams.cudaGraphCacheSize = result["cuda_graph_cache_size"].as<SizeType32>();
2202+
21342203
std::optional<TokenIdType> padId;
21352204
// Argument: Padding token id
21362205
if (result.count("pad_id"))
@@ -2168,6 +2237,10 @@ int main(int argc, char* argv[])
21682237
{
21692238
capacitySchedulerPolicy = texec::CapacitySchedulerPolicy::kGUARANTEED_NO_EVICT;
21702239
}
2240+
else if (capacitySchedulerPolicyArg == "static_batch")
2241+
{
2242+
capacitySchedulerPolicy = texec::CapacitySchedulerPolicy::kSTATIC_BATCH;
2243+
}
21712244
else
21722245
{
21732246
TLLM_LOG_ERROR("Unexpected scheduler policy: " + capacitySchedulerPolicyArg);
@@ -2246,14 +2319,14 @@ int main(int argc, char* argv[])
22462319
{
22472320
texec::ModelType executorModelType;
22482321
std::optional<std::string> decoderEngineDir = std::nullopt, encoderEngineDir = std::nullopt;
2249-
if (result.count("encoder_engine_dir") && result.count("engine_dir"))
2322+
if (result.count("encoder_engine_dir") && result.count("decoder_engine_dir"))
22502323
{
22512324
TLLM_CHECK_WITH_INFO(api == "executor", "encoder-decoder only support executor api.");
22522325
TLLM_CHECK_WITH_INFO(
22532326
modelType == TrtGptModelType::InflightFusedBatching, "encoder-decoder only support inflight batching.");
22542327
executorModelType = texec::ModelType::kENCODER_DECODER;
2255-
decoderEngineDir = result["engine_dir"].as<std::string>();
22562328
encoderEngineDir = result["encoder_engine_dir"].as<std::string>();
2329+
decoderEngineDir = result["decoder_engine_dir"].as<std::string>();
22572330
}
22582331
else if (result.count("engine_dir"))
22592332
{

benchmarks/cpp/utils/prepare_real_data.py

-2
Original file line numberDiff line numberDiff line change
@@ -231,8 +231,6 @@ def dataset(root_args, **kwargs):
231231
}, root_args.output)
232232
else:
233233
print_dataset(
234-
task_ids,
235234
input_ids,
236235
output_lens,
237-
tokenizer=None,
238236
)

benchmarks/python/gpt_benchmark.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ def __init__(self, args, batch_sizes, in_out_lens, gpu_weights_percents,
8080

8181
kv_cache_type = KVCacheType.CONTINUOUS
8282
if hasattr(self, 'kv_cache_type'):
83-
kv_cache_type = self.kv_cache_type
83+
kv_cache_type = KVCacheType(self.kv_cache_type)
8484
else:
8585
if hasattr(self, 'paged_kv_cache'):
8686
kv_cache_type = KVCacheType.PAGED if self.paged_kv_cache == True else KVCacheType.CONTINUOUS

0 commit comments

Comments
 (0)