ChatRTX APIs

ChatRTX APIs allow developers to seamlessly integrate their applications with the TensorRT-LLM powered inference engine and utilize the various AI models supported by ChatRTX. This integration enables developers to incorporate advanced AI inference and RAG features into their applications. These APIs serve as the foundation for the ChatRTX application. Please refer API Examples for examples of using the APIs.

Key Features

NVIDIA NIM: The ChatRTX APIs enable use of NVIDIA NIMs for LLMs on supported GPUs.
TensorRT-LLM Inference Backend: The ChatRTX APIs enable the use of the TensorRT-LLM inference backend, allowing for efficient and optimized AI model performance.
Download and Build TensorRT-LLM Checkpoints: With these APIs, you can download TensorRT-LLM checkpoints from NGC (NVIDIA GPU Cloud), build the TRT-LLM engine, and provide the necessary infrastructure to run inference and RAG with various AI models.
Streaming and Non-Streaming Inference APIs: Supports both streaming and non-streaming inference APIs, offering flexibility depending on the application’s requirements.
RAG with Llama Index: Provides a TRT-LLM connector to use TRT-LLM as the inference backend for RAG. It also includes the basic RAG pipeline with Llama Index and TRT-LLM. With the Llama Index TRT-LLM connector, users are free to add high-level RAG features.

Models supported:

Model	Supported GPUs
LlaMa 3.1 8B NIM	RTX 6000 Ada, RTX GPUs 4080, 4090, 5080, 5090
RIVA Parakeet 0.6B NIM (for supporting voice input)	RTX 6000 Ada, RTX GPUs 4080, 4090, 5080, 5090
CLIP (for images)	RTX 6000 Ada, RTX 3xxx, RTX 4xxx, RTX 5080, RTX 5090
Whisper Medium (for supporting voice input)	RTX 6000 Ada, RTX 3xxx and RTX 4xxx series GPUs that have at least 8GB of GPU memory
Mistral 7B	RTX 6000 Ada, RTX 3xxx and RTX 4xxx series GPUs that have at least 8GB of GPU memory
ChatGLM3 6B	RTX 6000 Ada, RTX 3xxx and RTX 4xxx series GPUs that have at least 8GB of GPU memory
LLaMa 2 13B	RTX 6000 Ada, RTX 3xxx and RTX 4xxx series GPUs that have at least 16GB of GPU memory
Gemma 7B	RTX 6000 Ada, RTX 3xxx and RTX 4xxx series GPUs that have at least 16GB of GPU memory

Setup

If you are using ChatRTX installer, the below steps can be skipped as setup is done by installer. If not using installer, manually install the components using following steps:

Run NIM Setup from here
Install Python 3.10.11
Download and install Microsoft MPI. You will be prompted to choose between an exe, which installs the MPI executable, and an msi, which installs the MPI SDK. Download and install both
Download 'dependencies.zip' from here. Extract the zip and download below wheels into same 'dependencies' folder along with other wheels.
Skip this step if using RTX 5xxx series GPU. Download TensorRT-LLM wheel ('tensorrt_llm-0.9.0-cp310-cp310-win_amd64.whl') from here.
Download the ChatRTX API SDK ('ChatRTX-0.5.0-py3-none-any.whl') from here
Start commandline in dependencies folder with downloaded wheels and run following commands.

Skip this step if using RTX 5xxx series GPU. For others run following commands:

python -m pip install tensorrt-bindings==9.3.0.post12.dev1 tensorrt-libs==9.3.0.post12.dev1 --no-index --find-links .
python -m pip install tensorrt==9.3.0.post12.dev1 --no-index --find-links .
python -m pip install tensorrt_llm-0.9.0-cp310-cp310-win_amd64.whl --no-index --find-links .

Then run following commands:

python -m pip install ngcsdk-3.57.0-py3-none-any.whl --no-index --find-links .
python -m pip install ChatRTX-0.5.0-py3-none-any.whl --no-index --find-links .
python -m pip install grpcio==1.67.1 --no-index --find-links .

Only if using RTX 5xxx series GPU run following commands:

python -m pip install torch-2.6.0+cu128.nv-cp310-cp310-win_amd64.whl --no-index --find-links .
python -m pip install torchvision-0.20.0a0+cu128.nv-cp310-cp310-win_amd64.whl --no-index --find-links .

Note: This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

API Examples

There are example files in the ./example directory that demonstrate how to use the APIs for different models and features:

nim_inference.py: Demonstrates how to set up and run NIMs like Llama 3.1 on Windows.
nim_rag.py: Demonstrates how to enable the RAG pipeline with NIM LLM connector using Llama Index framework.
inference.py: Demonstrates how to set up and run an inference pipeline for LLM models like Llama, Mistral, Gemma, and ChatGLM using TRTLLM natively on Windows.
inference_streaming.py: Shows how to use APIs to enable the streaming feature for inference.
rag.py: Demonstrates how to enable the RAG pipeline with the TRT-LLM connector using the Llama Index framework.
clip.py: Provides examples of how to use the CLIP model with the provided APIs.

Note: This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Troubleshooting

After the installation if you face error

No module named 'tensorrt_bindings' or No module named 'tensorrt'

Use the below commands to reinstall the tensortRT

python -m pip uninstall -y tensorrt
python -m pip install --pre --extra-index-url https://pypi.nvidia.com/ tensorrt==9.3.0.post12.dev1 --no-cache-dir
Check: python -c "import tensorrt_llm; print(tensorrt_llm._utils.trt_version())"

NVIDIA NIMs consume several gigabytes of data. To reclaim this storage space follow instructions to delete NIMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ChatRTX APIs

Key Features

Setup

API Examples

Troubleshooting

Files

README.md

Latest commit

History

README.md

File metadata and controls

ChatRTX APIs

Key Features

Setup

API Examples

Troubleshooting