Skip to main content

NVIDIA Triton Inference Server

Provided texts come from the official website

NVIDIA Tritonβ„’, an open-source inference serving software, standardizes AI model deployment and execution and delivers fast and scalable AI in production. Triton is part of NVIDIA AI Enterprise, an NVIDIA software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

How to Put AI Models into Production​

NVIDIA Triton, also known as NVIDIA Triton Inference Server, streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained ML or DL models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the right framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-prem, edge, and embedded devices.

Achieve High-Throughput Inference​

Triton executes multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, Triton automatically creates an instance of each model on each GPU to increase utilization.

It also optimizes serving for real-time inferencing under strict latency constraints with dynamic batching, supports batch inferencing to maximize GPU and CPU utilization, and includes built-in support for audio and video streaming input. Triton supports model ensemble for use cases that require a pipeline of multiple models with pre- and postprocessing to perform end-to-end inference, such as conversational AI.

Models can be updated live in production without restarting Triton or the application. Triton enables multi-GPU, multi-node inference on very large models that cannot fit in a single GPU’s memory.

Triton supports a set of well known backends like: TensorRT, ONNX Runtime, TensorFlow, PyTorch, OpenVINO, Python, FIL but is also extensible and allows to develop your own backend.

triton high throughput

How to run it​

NVIDIA Tritonβ„’ allows multiple models and/or multiple instances of the same model to execute in parallel on the same system. The system may have zero, one, or many GPUs. Don't forget to mount your the data volume with your models at the start.
The amount of CPU and RAM depends on the type and quantity of the chosen GPU.
It should be at least RAM β©Ύ sum(vRAM) + 2GB - but remember, this is only a recommendation, you can always start small and grow with your problem.

For simple inference, you probably won't need more than 1 A5000 GPU

Warning

The Triton Inference Server image is only available to nvcr.io users. To gain access, you must first create an access key to the NVidia repository and save it in CGC.

$ cgc compute create --name <name> -c <cpu_cores> -m <RAM GiB> -g <gpu_count> -gt <gpu_type> nvidia-triton --repository-secret <secret_name>

Next you have to mount your model repository. The volume should be mounted with the full-path flag set to the /models dir.

$ cgc volume mount <model_repo_name> -t <triton_name> -fp /models

Model repository​

The Triton inference server expects a certain way of providing models. Your model-repository container should follow this structure:

models-repository/
└── yolov8n
β”œβ”€β”€ 1
β”‚ └── model.plan
└── config.pbtxt

Where

  • yolov8n - name of your model
  • 1 - model version
  • model.plan - engine created with TensorRT. Check out supported backends here. For a guide how to prepare an engine for your model visit our Use cases section.
  • config.pbtxt - configuration of your model. See example configuration here
caution

When preparing the model-repository volume using a Jupyter notebook, ensure the absence of additional directories such as .Trash0 and ipynb_checkpoints. Remove these directories before mounting the repository into the Triton Inference Server to prevent potential issues.

Example​

To see extended instruction, please visit our Use cases section.

Create model repository​

First you need to create and prepare the model repository. Start with creating a new volume.

$ cgc volume create -s 10 models-repository

Then put your models on your volume using filebrowser or one of our apps. You can mount this volume to as many apps as you want. To see the whole process please visit here.

Config file​

name: "yolov8n"
platform: "tensorrt_plan"
max_batch_size: 1
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, 928, 928 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 6, 17661 ]
}
]

Run an instance​

$ cgc compute create --name triton01 -c 2 -m 26 -g 1 -gt A5000 nvidia-triton

Next, mount your model repository created in the previous step. The volume should be mounted with the full-path flag set to the /models dir.

$ cgc volume mount models-repository -t triton01 -fp /models

Run inference​

First, install tritonclient with pip

!pip install tritonclient[http]

Then run inference with installed client

import tritonclient.http as httpclient

# Initialize client
triton_client = httpclient.InferenceServerClient(url=self.url, verbose=False, ssl=False)

# Prepare space for inputs and outputs
inputs = [
httpclient.InferInput(self.input_name, [*processed_image.shape], self.fp)
]
outputs = [httpclient.InferRequestedOutput(self.output_name)]
# Insert inputs
inputs[0].set_data_from_numpy(processed_image)

# Inference
results = self.triton_client.infer(
model_name=self.model_name, inputs=inputs, outputs=outputs
)
info

You need pre and postprocess functions to match your input and outputs