🐛 [Bug] Loading Torch-TensorRT models (.ts) on multiple GPUs (in TorchServe) #1888

emilwallner · 2023-05-05T10:38:06Z

Bug Description

Everything works well when I'm using 1 GPU, but as soon as I try to load a model on 4 separate GPUs, I get this error:

MODEL_LOG - RuntimeError: [Error thrown at core/runtime/TRTEngine.cpp:42] Expected most_compatible_device to be true but got false
MODEL_LOG - No compatible device was found for instantiating TensorRT engine

To Reproduce

Steps to reproduce the behavior:

Create a (.ts) model and load it on 4 different GPUs. I don't know if this is specific to TorchServe, or a general issue.

Here's the simple version (TorchServe Handler):

def initialize(self, ctx):
        properties = ctx.system_properties
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        self.model = torch.jit.load('model.ts')

I'm not sure if it relates to this issue. From what I can tell it seems like I need to restrict the CUDA context, however, the GPU is assigned in the handler. I tried these things, but it's still giving me the same problem.

def initialize(self, ctx):
        properties = ctx.system_properties
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        torch.cuda.set_device(self.device)
        torch_tensorrt.set_device(int(properties.get("gpu_id")))

        with torch.cuda.device(int(properties.get("gpu_id"))):
              self.model = torch.jit.load('model.ts')
              self.model.to(self.device)
              self.model.eval()

I also tried mapping the model straight to the GPU on load, but with the same problem.

Expected behavior

Load a .ts model by specifying the GPU Id without any issues.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Official PyTorch image: nvcr.io/nvidia/pytorch:22.12-py3
GPUs: 4x NVIDIA A10G
Pytorch: 1.14.0a0+410ce96
NVIDIA CUDA 11.8.0
TensorRT 8.5.1
Ubuntu 20.04 including Python 3.8
NVIDIA CUDA® 11.8.0
NVIDIA cuBLAS 11.11.3.6
NVIDIA cuDNN 8.7.0.84
NVIDIA NCCL 2.15.5 (optimized for NVIDIA NVLink®)
NVIDIA RAPIDS™ 22.10.01 (For x86, only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph.)
Apex
rdma-core 36.0
NVIDIA HPC-X 2.13
OpenMPI 4.1.4+
GDRCopy 2.3
TensorBoard 2.9.0
Nsight Compute 2022.3.0.0
Nsight Systems 2022.4.2.1
NVIDIA TensorRT™ 8.5.1
Torch-TensorRT 1.1.0a0
NVIDIA DALI® 1.20.0
MAGMA 2.6.2
JupyterLab 2.3.2 including Jupyter-TensorBoard
TransformerEngine 0.3.0

Additional context

The text was updated successfully, but these errors were encountered:

narendasan · 2023-05-08T16:46:06Z

@gs-olive can you try to replicate this?

gs-olive · 2023-05-09T17:07:10Z

Hello - I tried the following minimal example to reproduce the error:

Compile resnet18 on GPU 0
Load two instances of the same saved model (one on GPU 0, another on GPU 1 which is the same type)
Run inference with both

While I was unable to reproduce the exact error as described, I did notice that the compiled model would only return results stored on GPU 0 (the GPU index which it was compiled with), and not other GPUs of the same type with other indices. This is an issue on our end, which I am looking into. Based on this, it might make sense to try recompiling the model for each unique GPU ID, and saving the models as "model_gpu0.ts", "model_gpu1.ts",..., as a temporary workaround, and to see if this resolves the issue.

I will also continue trying to reproduce the Expected most_compatible_device to be true but got false error.

emilwallner · 2023-05-10T11:16:13Z

Very much appreciate you looking into this and thanks for the suggested workaround! 🙌

NothingToSay99 · 2023-05-17T02:43:06Z

I met the same issue!
With the same env, using nvidia A100 create the model, then loading it on nvidia 3090, the error
‘’RuntimeError: [Error thrown at core/runtime/TRTEngine.cpp:42] Expected most_compatible_device to be true but got false
No compatible device was found for instantiating TensorRT engine‘’
came up.

Both A100 and 3090 have the same Ampere architecture.

emilwallner · 2023-05-17T14:50:47Z

For further context, I used the same docker image (nvcr.io/nvidia/pytorch:22.12-py3) to compile and run the model, but it was compiled on Ampere RTX A6000 and run on A10. As mentioned earlier, it worked well with one GPU, but not with a multi-gpu configuration.

gs-olive · 2023-05-17T21:37:01Z

Thank you both for the follow-up. After corresponding with @narendasan on this, the reason for which compiling the model on A100 and instantiating on 3090 is an issue is due to the difference in compute capability (A100 having Compute Capability 8.0 and 3090 having Compute Capability 8.6, source).

As of TensorRT 8.6, there is a newly added support for Hardware Compatibility, which should resolve this issue once we add support for the feature in Torch-TensorRT. There is a feature request already filed for this: #1929.

NothingToSay99 · 2023-05-18T08:36:41Z

Thanks for your reply,
looking forward to your work!

github-actions · 2023-08-17T00:02:10Z

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

gs-olive · 2023-10-24T20:08:59Z

Hello - as an update on this issue, we recently added #2325 to main which addresses compilation of the model on one GPU and loading on a different (or multiple) GPUs of the same kind. This PR was intended to fix cases where the model would always load to GPU 0. The feature to add hardware compatibility support (build on one GPU, functional on a variety) is still planned for implementation in #1929.

emilwallner · 2023-10-25T12:17:06Z

Excellent, thanks for the hard work and update!

gs-olive · 2024-01-12T22:20:32Z

Hello - we recently added #2445 which enables the hardware_compatibility feature for TRT Engines generated with ir="torch_compile" or ir="dynamo". If you are able to test out multi-GPU usage with hardware_compatible=True and ir="dynamo" (which also allows serialization via TorchScript), it would be much appreciated

emilwallner · 2024-01-14T10:41:52Z

Thanks @gs-olive!! I'm currently low on bandwidth, but I'll give this a spin for my next model!

emilwallner added the bug Something isn't working label May 5, 2023

narendasan assigned gs-olive May 8, 2023

github-actions bot added the No Activity label Aug 17, 2023

laikhtewari removed the No Activity label Aug 18, 2023

gs-olive mentioned this issue Nov 8, 2023

feat: Add hardware compatibility option in Dynamo #2445

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug] Loading Torch-TensorRT models (.ts) on multiple GPUs (in TorchServe) #1888

🐛 [Bug] Loading Torch-TensorRT models (.ts) on multiple GPUs (in TorchServe) #1888

emilwallner commented May 5, 2023

narendasan commented May 8, 2023

gs-olive commented May 9, 2023

emilwallner commented May 10, 2023

NothingToSay99 commented May 17, 2023 •

edited

Loading

emilwallner commented May 17, 2023

gs-olive commented May 17, 2023

NothingToSay99 commented May 18, 2023 •

edited

Loading

github-actions bot commented Aug 17, 2023

gs-olive commented Oct 24, 2023

emilwallner commented Oct 25, 2023

gs-olive commented Jan 12, 2024

emilwallner commented Jan 14, 2024

🐛 [Bug] Loading Torch-TensorRT models (.ts) on multiple GPUs (in TorchServe) #1888

🐛 [Bug] Loading Torch-TensorRT models (.ts) on multiple GPUs (in TorchServe) #1888

Comments

emilwallner commented May 5, 2023

Bug Description

To Reproduce

Expected behavior

Environment

Additional context

narendasan commented May 8, 2023

gs-olive commented May 9, 2023

emilwallner commented May 10, 2023

NothingToSay99 commented May 17, 2023 • edited Loading

emilwallner commented May 17, 2023

gs-olive commented May 17, 2023

NothingToSay99 commented May 18, 2023 • edited Loading

github-actions bot commented Aug 17, 2023

gs-olive commented Oct 24, 2023

emilwallner commented Oct 25, 2023

gs-olive commented Jan 12, 2024

emilwallner commented Jan 14, 2024

NothingToSay99 commented May 17, 2023 •

edited

Loading

NothingToSay99 commented May 18, 2023 •

edited

Loading