NVIDIA Triton Inference Server
Provided texts come from the official website
NVIDIA Tritonβ’, an open-source inference serving software, standardizes AI model deployment and execution and delivers fast and scalable AI in production. Triton is part of NVIDIA AI Enterprise, an NVIDIA software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.
How to Put AI Models into Productionβ
NVIDIA Triton, also known as NVIDIA Triton Inference Server, streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained ML or DL models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the right framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-prem, edge, and embedded devices.
Achieve High-Throughput Inferenceβ
Triton executes multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, Triton automatically creates an instance of each model on each GPU to increase utilization.
It also optimizes serving for real-time inferencing under strict latency constraints with dynamic batching, supports batch inferencing to maximize GPU and CPU utilization, and includes built-in support for audio and video streaming input. Triton supports model ensemble for use cases that require a pipeline of multiple models with pre- and postprocessing to perform end-to-end inference, such as conversational AI.
Models can be updated live in production without restarting Triton or the application. Triton enables multi-GPU, multi-node inference on very large models that cannot fit in a single GPUβs memory.
Triton supports a set of well known backends like: TensorRT, ONNX Runtime, TensorFlow, PyTorch, OpenVINO, Python, FIL but is also extensible and allows to develop your own backend.
How to run itβ
NVIDIA Tritonβ’ allows multiple models and/or multiple instances of the same model to execute in parallel on the same system. The system may have zero, one, or many GPUs.
Don't forget to mount your the data volume with your models at the start.
The amount of CPU and RAM depends on the type and quantity of the chosen GPU.
It should be at least RAM β©Ύ sum(vRAM) + 2GB
- but remember, this is only a recommendation, you can always start small and grow with your problem.
For simple inference, you probably won't need more than 1 A5000 GPU
The Triton Inference Server image is only available to nvcr.io users. To gain access, you must first create an access key to the NVidia repository and save it in CGC.
$ cgc compute create --name <name> -c <cpu_cores> -m <RAM GiB> -g <gpu_count> -gt <gpu_type> nvidia-triton --repository-secret <secret_name>
Next you have to mount your model repository. The volume should be mounted with the full-path
flag set to the /models
dir.
$ cgc volume mount <model_repo_name> -t <triton_name> -fp /models
Model repositoryβ
The Triton inference server expects a certain way of providing models. Your model-repository
container should follow this structure:
models-repository/
βββ yolov8n
βββ 1
β βββ model.plan
βββ config.pbtxt
Where
yolov8n
- name of your model1
- model versionmodel.plan
- engine created with TensorRT. Check out supported backends here. For a guide how to prepare an engine for your model visit our Use cases section.config.pbtxt
- configuration of your model. See example configuration here
When preparing the model-repository volume using a Jupyter notebook, ensure the absence of additional directories such as .Trash0
and ipynb_checkpoints
. Remove these directories before mounting the repository into the Triton Inference Server to prevent potential issues.
Exampleβ
To see extended instruction, please visit our Use cases section.
Create model repositoryβ
First you need to create and prepare the model repository. Start with creating a new volume.
$ cgc volume create -s 10 models-repository
Then put your models on your volume using filebrowser
or one of our apps. You can mount this volume to as many apps as you want. To see the whole process please visit here.
Config fileβ
name: "yolov8n"
platform: "tensorrt_plan"
max_batch_size: 1
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, 928, 928 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 6, 17661 ]
}
]
Run an instanceβ
$ cgc compute create --name triton01 -c 2 -m 26 -g 1 -gt A5000 nvidia-triton
Next, mount your model repository created in the previous step. The volume should be mounted with the full-path
flag set to the /models
dir.
$ cgc volume mount models-repository -t triton01 -fp /models
Run inferenceβ
First, install tritonclient
with pip
!pip install tritonclient[http]
Then run inference with installed client
import tritonclient.http as httpclient
# Initialize client
triton_client = httpclient.InferenceServerClient(url=self.url, verbose=False, ssl=False)
# Prepare space for inputs and outputs
inputs = [
httpclient.InferInput(self.input_name, [*processed_image.shape], self.fp)
]
outputs = [httpclient.InferRequestedOutput(self.output_name)]
# Insert inputs
inputs[0].set_data_from_numpy(processed_image)
# Inference
results = self.triton_client.infer(
model_name=self.model_name, inputs=inputs, outputs=outputs
)
You need pre and postprocess functions to match your input and outputs