Skip to content

Assertion failed: Must set crossKvCacheFraction for encoder-decoder model #2419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
Saeedmatt3r opened this issue Nov 6, 2024 · 4 comments
Open
2 of 4 tasks
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@Saeedmatt3r
Copy link

Saeedmatt3r commented Nov 6, 2024

System Info

GPU: A10
Base Image: FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
Tensorrt-llm:

  • 0.12.0 : It's working, but I can't use it because of a version mismatch in TRT and trt-llm-backend
  • 0.13.0: It's working, but I can't use it because of a version mismatch in TRT and trt-llm-backend
  • 0.14.0: not working: Assertion failed: Must set crossKvCacheFraction for encoder-decoder model
  • 0.15.0.dev2024110500 : not working

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the problem:

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

RUN apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

RUN pip3 install tensorrt_llm==0.14.0 -U --pre --extra-index-url https://pypi.nvidia.com

then by running the official whisper example:

INFERENCE_PRECISION=float16
WEIGHT_ONLY_PRECISION=int8
MAX_BEAM_WIDTH=4
MAX_BATCH_SIZE=8
checkpoint_dir=weights/whisper_large_v3_weights_${WEIGHT_ONLY_PRECISION}
output_dir=weights/whisper_large_v3_${WEIGHT_ONLY_PRECISION}

# Convert the large-v3 model weights into TensorRT-LLM format.
python3 convert_checkpoint.py \
                --use_weight_only \
                --weight_only_precision $WEIGHT_ONLY_PRECISION \
                --output_dir $checkpoint_dir

# Build the large-v3 model using trtllm-build
trtllm-build  --checkpoint_dir ${checkpoint_dir}/encoder \
              --output_dir ${output_dir}/encoder \
              --moe_plugin disable \
              --enable_xqa disable \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --gemm_plugin disable \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --max_input_len 3000 --max_seq_len=3000

trtllm-build  --checkpoint_dir ${checkpoint_dir}/decoder \
              --output_dir ${output_dir}/decoder \
              --moe_plugin disable \
              --enable_xqa disable \
              --max_beam_width ${MAX_BEAM_WIDTH} \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --max_seq_len 114 \
              --max_input_len 14 \
              --max_encoder_input_len 3000 \
              --gemm_plugin ${INFERENCE_PRECISION} \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --gpt_attention_plugin ${INFERENCE_PRECISION}

python3 run.py --engine_dir "$output_dir" --dataset hf-internal-testing/librispeech_asr_dummy --num_beams $MAX_BEAM_WIDTH --batch_size $MAX_BATCH_SIZE --enable_warmup --name librispeech_dummy_large_v3

Expected behavior

It should run on the dataset without any problem:

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.14.0
[TensorRT-LLM][INFO] Engine version 0.14.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features.
[TensorRT-LLM][INFO] Engine version 0.14.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.14.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.14.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.14.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2999 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 622 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 292.97 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 615 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 3000 from build config.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 4
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 114
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (114) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 912
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 113 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 1066 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 833.73 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1672 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.54 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 16.82 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 21.98 GiB, available: 18.88 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1740
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
Traceback (most recent call last):
  File "/app/TensorRT-LLM/examples/whisper/run.py", line 479, in <module>
    model = WhisperTRTLLM(args.engine_dir, args.debug, args.assets_dir,
  File "/app/TensorRT-LLM/examples/whisper/run.py", line 327, in __init__
    self.model_runner_cpp = ModelRunnerCpp.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 206, in from_dir
    executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Must set crossKvCacheFraction for encoder-decoder model (/home/jenkins/agent/workspace/LLM/release-0.14/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:207)
1       0x7493ec8cccc7 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82
2       0x7493ec914e4a /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x769e4a) [0x7493ec914e4a]
3       0x7493eea3bac4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 756
4       0x7493eeac1d7d tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 125
5       0x7493eeac233a tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::string, tensorrt_llm::executor::Tensor, std::less<std::string>, std::allocator<std::pair<std::string const, tensorrt_llm::executor::Tensor> > > > const&) + 954
6       0x7493eeac3377 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2135
7       0x7493eeaaeb33 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 99
8       0x749518615609 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xdd609) [0x749518615609]
9       0x749518598a8f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x60a8f) [0x749518598a8f]
10      0x61696fec6b2e python3(+0x15cb2e) [0x61696fec6b2e]
11      0x61696febd2db _PyObject_MakeTpCall + 603
12      0x61696fed56b0 python3(+0x16b6b0) [0x61696fed56b0]
13      0x61696fed1ad7 python3(+0x167ad7) [0x61696fed1ad7]
14      0x61696febd68b python3(+0x15368b) [0x61696febd68b]
15      0x7495185928bb /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5a8bb) [0x7495185928bb]
16      0x61696febd2db _PyObject_MakeTpCall + 603
17      0x61696feb5d27 _PyEval_EvalFrameDefault + 27415
18      0x61696fed5281 python3(+0x16b281) [0x61696fed5281]
19      0x61696fed5f22 PyObject_Call + 290
20      0x61696feb1a6e _PyEval_EvalFrameDefault + 10334
21      0x61696febc474 _PyObject_FastCallDictTstate + 196
22      0x61696fed14b4 python3(+0x1674b4) [0x61696fed14b4]
23      0x61696febd27c _PyObject_MakeTpCall + 508
24      0x61696feb56e6 _PyEval_EvalFrameDefault + 25814
25      0x61696feac016 python3(+0x142016) [0x61696feac016]
26      0x61696ffa18b6 PyEval_EvalCode + 134
27      0x61696ffcc918 python3(+0x262918) [0x61696ffcc918]
28      0x61696ffc61db python3(+0x25c1db) [0x61696ffc61db]
29      0x61696ffcc665 python3(+0x262665) [0x61696ffcc665]
30      0x61696ffcbb48 _PyRun_SimpleFileObject + 424
31      0x61696ffcb793 _PyRun_AnyFileObject + 67
32      0x61696ffbe2ce Py_RunMain + 702
33      0x61696ff9470d Py_BytesMain + 45
34      0x7496408cfd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7496408cfd90]
35      0x7496408cfe40 __libc_start_main + 128
36      0x61696ff94605 _start + 37

Using the latest available pip package(0.15.0.dev2024110500)

+ trtllm-build --checkpoint_dir weights/whisper_large_v3_weights_int8/encoder --output_dir weights/whisper_large_v3_int8/encoder --moe_plugin disable --enable_xqa disable --max_batch_size 8 --gemm_plugin disable --bert_attention_plugin float16 --max_input_len 3000 --max_seq_len=3000
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024110500
[11/06/2024-12:28:13] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set gemm_plugin to None.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set nccl_plugin to auto.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set lora_plugin to None.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set moe_plugin to None.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set context_fmha to True.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set remove_input_padding to True.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set reduce_fusion to False.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set enable_xqa to False.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set tokens_per_block to 64.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set multiple_profiles to False.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set paged_state to True.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set streamingllm to False.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set use_fused_mlp to True.
[11/06/2024-12:28:13] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[11/06/2024-12:28:13] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[11/06/2024-12:28:13] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500
[11/06/2024-12:28:13] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 602, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 360, in build_model
    model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 635, in from_checkpoint
    model = cls(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 564, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1939, in __init__
    self.position_embedding = Embedding(self.config.max_position_embeddings,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/embedding.py", line 63, in __init__
    shape = (math.ceil(self.num_embeddings / self.tp_size),
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

additional notes

Also I checked the trt-llm-backend for the whisper example and also that was not working, with the following error:

c_python_backend_utils.TritonModelException: [TensorRT-LLM][ERROR] Assertion failed: input tokens tensor not provided (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/utils.cc:107)
@Saeedmatt3r Saeedmatt3r added the bug Something isn't working label Nov 6, 2024
@hello-11 hello-11 added triaged Issue has been triaged by maintainers Investigating labels Nov 6, 2024
@yuekaizhang
Copy link

@Saeedmatt3r Thanks for reporting the issue. The fix would be synced to github next week. For a quick fix, you need to modify here https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L370.

Image

@Saeedmatt3r
Copy link
Author

Saeedmatt3r commented Nov 7, 2024

@yuekaizhang Thanks, to be honest, I've done that actually, but Just wanted to report the issues in 0.14 and 0.15, also I think the official whisper on trt-llm-backend is also not working as expected, I used 0.15 for engine creation and it was not working. I will try to create another ticket in the repo.

@HPUedCSLearner
Copy link

I tried the latest offical image,and follow offcial tutial, get the same bug:

root@9d0fd755a252:/ws# I0224 03:28:06.809225 3081 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
I0224 03:28:06.809270 3081 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
I0224 03:28:06.821670 3081 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2560
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2560) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 5120
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2559 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0224 03:28:08.935412 3081 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][INFO] Loaded engine size: 12860 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 402.52 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12855 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.88 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.18 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 8.70 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 251
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
E0224 03:28:22.059961 3081 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1       0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2       0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4       0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23      0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24      0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25      0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26      0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c]"
E0224 03:28:22.060118 3081 model_lifecycle.cc:654] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1       0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2       0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4       0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23      0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24      0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25      0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26      0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c]"
I0224 03:28:22.060173 3081 model_lifecycle.cc:789] "failed to load 'tensorrt_llm'"
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1
[02/24/2025-03:28:28] [TRT] [I] Loaded engine size: 599 MiB
[02/24/2025-03:28:28] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +29, now: CPU 0, GPU 624 (MiB)
I0224 03:28:28.368156 3081 model_lifecycle.cc:849] "successfully loaded 'multimodal_encoders'"
E0224 03:28:28.368315 3081 model_repository_manager.cc:703] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1       0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2       0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4       0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23      0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24      0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25      0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26      0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c];"
I0224 03:28:28.368439 3081 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0224 03:28:28.368472 3081 server.cc:631] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                             |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/ |
|             |                                                                 | backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_", |
|             |                                                                 | "default-max-batch-size":"4"}}                                                     |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/ |
|             |                                                                 | backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+

I0224 03:28:28.368547 3081 server.cc:674] 
+---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+
| Model               | Version | Status                                                                                                                             |
+---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+
| multimodal_encoders | 1       | READY                                                                                                                              |
| postprocessing      | 1       | READY                                                                                                                              |
| preprocessing       | 1       | READY                                                                                                                              |
| tensorrt_llm        | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set cross |
|                     |         | KvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281 |
|                     |         | )                                                                                                                                  |
|                     |         | 3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_l |
|                     |         | lm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt |
|                     |         | _llm::batch_manager::TrtGptModelOptionalParams const&) + 489                                                                       |
|                     |         | 6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::file |
|                     |         | system::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474            |
|                     |         | 3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489 |
|                     |         | 9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_bat |
|                     |         | cher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185                                                                        |
|                     |         | 5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229 |
|                     |         | 6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474 |
|                     |         | 7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87 |
|                     |         | 8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]                  |
|                     |         | 9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185 |
|                     |         | 10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 |
|                     |         | 11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153                                                                 |
|                     |         | 12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]                                 |
|                     |         | 13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]                                 |
|                     |         | 14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]                                 |
|                     |         | 15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]                                 |
|                     |         | 16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]                                 |
|                     |         | 17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]                                              |
|                     |         | 18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]                                 |
|                     |         | 19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]                                 |
|                     |         | 20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]                                 |
|                     |         | 21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]                                 |
|                     |         | 22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]                                 |
| tensorrt_llm_bls    | 1       | READY                                                                                                                              |
+---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+

I0224 03:28:28.720425 3081 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA A10"
I0224 03:28:28.759813 3081 metrics.cc:783] "Collecting CPU metrics"
I0224 03:28:28.760016 3081 tritonserver.cc:2598] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                          |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                         |
| server_version                   | 2.54.0                                                                                                                         |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared |
|                                  | _memory cuda_shared_memory binary_tensor_data parameters statistics trace logging                                              |
| model_repository_path[0]         | multimodal_ifb/                                                                                                                |
| model_control_mode               | MODE_NONE                                                                                                                      |
| strict_model_config              | 1                                                                                                                              |
| model_config_name                |                                                                                                                                |
| rate_limit                       | OFF                                                                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                      |
| cuda_memory_pool_byte_size{0}    | 300000000                                                                                                                      |
| min_supported_compute_capability | 6.0                                                                                                                            |
| strict_readiness                 | 1                                                                                                                              |
| exit_timeout                     | 30                                                                                                                             |
| cache_enabled                    | 0                                                                                                                              |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+

I0224 03:28:28.760047 3081 server.cc:305] "Waiting for in-flight requests to complete."
I0224 03:28:28.760062 3081 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0224 03:28:28.760969 3081 server.cc:336] "All models are stopped, unloading models"
I0224 03:28:28.760980 3081 server.cc:345] "Timeout 30: Found 4 live models and 0 in-flight non-inference requests"
I0224 03:28:29.761084 3081 server.cc:345] "Timeout 29: Found 4 live models and 0 in-flight non-inference requests"
[02/24/2025-03:28:29] [TRT-LLM] [I] Cleaning up...
Cleaning up...
Cleaning up...
Cleaning up...
I0224 03:28:30.152836 3081 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm_bls' version 1"
I0224 03:28:30.761212 3081 server.cc:345] "Timeout 28: Found 3 live models and 0 in-flight non-inference requests"
I0224 03:28:31.033774 3081 model_lifecycle.cc:636] "successfully unloaded 'preprocessing' version 1"
I0224 03:28:31.091577 3081 model_lifecycle.cc:636] "successfully unloaded 'postprocessing' version 1"
I0224 03:28:31.406108 3081 model_lifecycle.cc:636] "successfully unloaded 'multimodal_encoders' version 1"
I0224 03:28:31.761284 3081 server.cc:345] "Timeout 27: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models

@frederikdangel
Copy link

@HPUedCSLearner
Did you find a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants