File size: 11,196 Bytes
7370e5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---

comments: true
description: Learn how to integrate Ultralytics YOLOv8 with NVIDIA Triton Inference Server for scalable, high-performance AI model deployment.
keywords: Triton Inference Server, YOLOv8, Ultralytics, NVIDIA, deep learning, AI model deployment, ONNX, scalable inference
---


# Triton Inference Server with Ultralytics YOLOv8

The [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) (formerly known as TensorRT Inference Server) is an open-source software solution developed by NVIDIA. It provides a cloud inference solution optimized for NVIDIA GPUs. Triton simplifies the deployment of AI models at scale in production. Integrating Ultralytics YOLOv8 with Triton Inference Server allows you to deploy scalable, high-performance deep learning inference workloads. This guide provides steps to set up and test the integration.

<p align="center">
  <br>
  <iframe loading="lazy" width="720" height="405" src="https://www.youtube.com/embed/NQDtfSi5QF4"

    title="Getting Started with NVIDIA Triton Inference Server" frameborder="0"

    allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"

    allowfullscreen>
  </iframe>
  <br>
  <strong>Watch:</strong> Getting Started with NVIDIA Triton Inference Server.
</p>

## What is Triton Inference Server?

Triton Inference Server is designed to deploy a variety of AI models in production. It supports a wide range of deep learning and machine learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, and many others. Its primary use cases are:

- Serving multiple models from a single server instance.
- Dynamic model loading and unloading without server restart.
- Ensemble inference, allowing multiple models to be used together to achieve results.
- Model versioning for A/B testing and rolling updates.

## Prerequisites

Ensure you have the following prerequisites before proceeding:

- Docker installed on your machine.
- Install `tritonclient`:
    ```bash

    pip install tritonclient[all]

    ```


## Exporting YOLOv8 to ONNX Format

Before deploying the model on Triton, it must be exported to the ONNX format. ONNX (Open Neural Network Exchange) is a format that allows models to be transferred between different deep learning frameworks. Use the `export` function from the `YOLO` class:

```python

from ultralytics import YOLO



# Load a model

model = YOLO("yolov8n.pt")  # load an official model



# Export the model

onnx_file = model.export(format="onnx", dynamic=True)

```

## Setting Up Triton Model Repository

The Triton Model Repository is a storage location where Triton can access and load models.

1. Create the necessary directory structure:

    ```python

    from pathlib import Path


    # Define paths

    model_name = "yolo"

    triton_repo_path = Path("tmp") / "triton_repo"

    triton_model_path = triton_repo_path / model_name


    # Create directories

    (triton_model_path / "1").mkdir(parents=True, exist_ok=True)

    ```


2. Move the exported ONNX model to the Triton repository:

    ```python

    from pathlib import Path


    # Move ONNX model to Triton Model path

    Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")


    # Create config file

    (triton_model_path / "config.pbtxt").touch()

    ```


## Running Triton Inference Server

Run the Triton Inference Server using Docker:

```python

import contextlib

import subprocess

import time



from tritonclient.http import InferenceServerClient



# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver

tag = "nvcr.io/nvidia/tritonserver:23.09-py3"  # 6.4 GB



# Pull the image

subprocess.call(f"docker pull {tag}", shell=True)



# Run the Triton server and capture the container ID

container_id = (

    subprocess.check_output(

        f"docker run -d --rm -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",

        shell=True,

    )

    .decode("utf-8")

    .strip()

)



# Wait for the Triton server to start

triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)



# Wait until model is ready

for _ in range(10):

    with contextlib.suppress(Exception):

        assert triton_client.is_model_ready(model_name)

        break

    time.sleep(1)

```

Then run inference using the Triton Server model:

```python

from ultralytics import YOLO



# Load the Triton Server model

model = YOLO("http://localhost:8000/yolo", task="detect")



# Run inference on the server

results = model("path/to/image.jpg")

```

Cleanup the container:

```python

# Kill and remove the container at the end of the test

subprocess.call(f"docker kill {container_id}", shell=True)

```

---

By following the above steps, you can deploy and run Ultralytics YOLOv8 models efficiently on Triton Inference Server, providing a scalable and high-performance solution for deep learning inference tasks. If you face any issues or have further queries, refer to the [official Triton documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html) or reach out to the Ultralytics community for support.

## FAQ

### How do I set up Ultralytics YOLOv8 with NVIDIA Triton Inference Server?

Setting up [Ultralytics YOLOv8](https://docs.ultralytics.com/models/yolov8) with [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) involves a few key steps:

1. **Export YOLOv8 to ONNX format**:

    ```python

    from ultralytics import YOLO


    # Load a model

    model = YOLO("yolov8n.pt")  # load an official model


    # Export the model to ONNX format

    onnx_file = model.export(format="onnx", dynamic=True)

    ```


2. **Set up Triton Model Repository**:

    ```python

    from pathlib import Path


    # Define paths

    model_name = "yolo"

    triton_repo_path = Path("tmp") / "triton_repo"

    triton_model_path = triton_repo_path / model_name


    # Create directories

    (triton_model_path / "1").mkdir(parents=True, exist_ok=True)

    Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")

    (triton_model_path / "config.pbtxt").touch()

    ```


3. **Run the Triton Server**:

    ```python

    import contextlib

    import subprocess

    import time


    from tritonclient.http import InferenceServerClient


    # Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver

    tag = "nvcr.io/nvidia/tritonserver:23.09-py3"


    subprocess.call(f"docker pull {tag}", shell=True)


    container_id = (

        subprocess.check_output(

            f"docker run -d --rm -v {triton_repo_path}/models -p 8000:8000 {tag} tritonserver --model-repository=/models",

            shell=True,

        )

        .decode("utf-8")

        .strip()

    )


    triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)


    for _ in range(10):

        with contextlib.suppress(Exception):

            assert triton_client.is_model_ready(model_name)

            break

        time.sleep(1)

    ```


This setup can help you efficiently deploy YOLOv8 models at scale on Triton Inference Server for high-performance AI model inference.

### What benefits does using Ultralytics YOLOv8 with NVIDIA Triton Inference Server offer?

Integrating [Ultralytics YOLOv8](../models/yolov8.md) with [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) provides several advantages:

- **Scalable AI Inference**: Triton allows serving multiple models from a single server instance, supporting dynamic model loading and unloading, making it highly scalable for diverse AI workloads.
- **High Performance**: Optimized for NVIDIA GPUs, Triton Inference Server ensures high-speed inference operations, perfect for real-time applications such as object detection.
- **Ensemble and Model Versioning**: Triton's ensemble mode enables combining multiple models to improve results, and its model versioning supports A/B testing and rolling updates.

For detailed instructions on setting up and running YOLOv8 with Triton, you can refer to the [setup guide](#setting-up-triton-model-repository).

### Why should I export my YOLOv8 model to ONNX format before using Triton Inference Server?

Using ONNX (Open Neural Network Exchange) format for your [Ultralytics YOLOv8](../models/yolov8.md) model before deploying it on [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) offers several key benefits:

- **Interoperability**: ONNX format supports transfer between different deep learning frameworks (such as PyTorch, TensorFlow), ensuring broader compatibility.
- **Optimization**: Many deployment environments, including Triton, optimize for ONNX, enabling faster inference and better performance.
- **Ease of Deployment**: ONNX is widely supported across frameworks and platforms, simplifying the deployment process in various operating systems and hardware configurations.

To export your model, use:

```python

from ultralytics import YOLO



model = YOLO("yolov8n.pt")

onnx_file = model.export(format="onnx", dynamic=True)

```

You can follow the steps in the [exporting guide](../modes/export.md) to complete the process.

### Can I run inference using the Ultralytics YOLOv8 model on Triton Inference Server?

Yes, you can run inference using the [Ultralytics YOLOv8](../models/yolov8.md) model on [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server). Once your model is set up in the Triton Model Repository and the server is running, you can load and run inference on your model as follows:

```python

from ultralytics import YOLO



# Load the Triton Server model

model = YOLO("http://localhost:8000/yolo", task="detect")



# Run inference on the server

results = model("path/to/image.jpg")

```

For an in-depth guide on setting up and running Triton Server with YOLOv8, refer to the [running triton inference server](#running-triton-inference-server) section.

### How does Ultralytics YOLOv8 compare to TensorFlow and PyTorch models for deployment?

[Ultralytics YOLOv8](https://docs.ultralytics.com/models/yolov8) offers several unique advantages compared to TensorFlow and PyTorch models for deployment:

- **Real-time Performance**: Optimized for real-time object detection tasks, YOLOv8 provides state-of-the-art accuracy and speed, making it ideal for applications requiring live video analytics.
- **Ease of Use**: YOLOv8 integrates seamlessly with Triton Inference Server and supports diverse export formats (ONNX, TensorRT, CoreML), making it flexible for various deployment scenarios.
- **Advanced Features**: YOLOv8 includes features like dynamic model loading, model versioning, and ensemble inference, which are crucial for scalable and reliable AI deployments.

For more details, compare the deployment options in the [model deployment guide](../modes/export.md).