Emile Courthoud commited on
Commit
0c01441
·
1 Parent(s): 3314500

feat(demo): add nebullvm inference accelerator (#1545)

Browse files
README.md CHANGED
@@ -210,12 +210,12 @@ python -m yolox.tools.eval -n yolox-s -c yolox_s.pth -b 1 -d 1 --conf 0.001 --f
210
  ## Deployment
211
 
212
 
213
- 1. [MegEngine in C++ and Python](./demo/MegEngine)
214
- 2. [ONNX export and an ONNXRuntime](./demo/ONNXRuntime)
215
- 3. [TensorRT in C++ and Python](./demo/TensorRT)
216
- 4. [ncnn in C++ and Java](./demo/ncnn)
217
- 5. [OpenVINO in C++ and Python](./demo/OpenVINO)
218
-
219
 
220
  ## Third-party resources
221
  * YOLOX for streaming perception: [StreamYOLO (CVPR 2022 Oral)](https://github.com/yancie-yjr/StreamYOLO)
 
210
  ## Deployment
211
 
212
 
213
+ 1. [MegEngine in C++ and Python](./demo/MegEngine)
214
+ 2. [ONNX export and an ONNXRuntime](./demo/ONNXRuntime)
215
+ 3. [TensorRT in C++ and Python](./demo/TensorRT)
216
+ 4. [ncnn in C++ and Java](./demo/ncnn)
217
+ 5. [OpenVINO in C++ and Python](./demo/OpenVINO)
218
+ 6. [Accelerate YOLOX inference with nebullvm in Python](./demo/nebullvm)
219
 
220
  ## Third-party resources
221
  * YOLOX for streaming perception: [StreamYOLO (CVPR 2022 Oral)](https://github.com/yancie-yjr/StreamYOLO)
demo/nebullvm/README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **Accelerate YOLOX inference with nebullvm in Python**
2
+
3
+ This document shows how to accelerate YOLOX inference time with nebullvm.
4
+
5
+ [nebullvm](https://github.com/nebuly-ai/nebullvm) is an open-source library designed to accelerate AI inference of deep learning models in a few lines of code. nebullvm leverages state-of-the-art model optimization techniques such as deep learning compilers (TensorRT, Openvino, ONNX Runtime, TVM, TF Lite, DeepSparse, etc.), various quantization and compression strategies to achieve the maximum physically possible acceleration on the user's hardware.
6
+
7
+ ## Benchmarks
8
+ Following are the results of the nebullvm optimization on YOLOX without loss of accuracy.
9
+ For each model-hardware pairing, response time was evaluated as the average over 100 predictions. The test was run on Nvidia Tesla T4 (g4dn.xlarge) and Intel XEON Scalable (m6i.24xlarge and c6i.12xlarge) on AWS.
10
+
11
+ | Model | Hardware | Unoptimized (ms)| Nebullvm optimized (ms) | Speedup |
12
+ |---------|--------------|-----------------|-------------------------|---------|
13
+ | YOLOX-s | g4dn.xlarge | 13.6 | 9.0 | 1.5x |
14
+ | YOLOX-s | m6i.24xlarge | 32.7 | 8.8 | 3.7x |
15
+ | YOLOX-s | c6i.12xlarge | 34.4 | 12.4 | 2.8x |
16
+ | YOLOX-m | g4dn.xlarge | 24.2 | 22.4 | 1.1x |
17
+ | YOLOX-m | m6i.24xlarge | 55.1 | 36.0 | 2.3x |
18
+ | YOLOX-m | c6i.12xlarge | 62.5 | 26.9 | 2.6x |
19
+ | YOLOX-l | g4dn.xlarge | 84.4 | 80.5 | 1.5x |
20
+ | YOLOX-l | m6i.24xlarge | 88.0 | 33.7 | 2.6x |
21
+ | YOLOX-l | c6i.12xlarge | 102.8 | 54.2 | 1.9x |
22
+ | YOLOX-x | g4dn.xlarge | 87.3 | 34.0 | 2.6x |
23
+ | YOLOX-x | m6i.24xlarge | 134.5 | 56.6 | 2.4x |
24
+ | YOLOX-x | c6i.12xlarge | 162.0 | 95.4 | 1.7x |
25
+
26
+ ## Steps to accelerate YOLOX with nebullvm
27
+ 1. Download a YOLOX model from the original [readme](https://github.com/Megvii-BaseDetection/YOLOX)
28
+ 2. Optimize YOLOX with nebullvm
29
+ 3. Perform inference and compare the latency of the optimized model with that of the original model
30
+
31
+ [Here](nebullvm_optimization.py) you can find a demo in python.
32
+
33
+
34
+ First, let's install nebullvm. The simplest way is by using pip.
35
+ ```
36
+ pip install nebullvm
37
+ ```
38
+ Now, let's download one of YOLOX models and optimize it with nebullvm.
39
+
40
+ ```python
41
+ # Import YOLOX model
42
+ from yolox.exp import get_exp
43
+ from yolox.data.data_augment import ValTransform
44
+
45
+ exp = get_exp(None, 'yolox-s') # select model name
46
+ model = exp.get_model()
47
+ model.cuda()
48
+ model.eval()
49
+
50
+
51
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
52
+ input_data = [((torch.randn(1, 3, 640, 640).to(device), ), 0) for i in range(100)]
53
+
54
+ # Run nebullvm optimization without performance loss
55
+ optimized_model = optimize_model(model, input_data=input_data, optimization_time="constrained")
56
+ ```
57
+ Find [here](nebullvm_optimize.py) the complete script in python with more details.
58
+
59
+ In this example, we optimized YOLOX without any loss in accuracy. To further speed up the model by means of more aggressive optimization techniques, proceed as follows:
60
+ - Set *optimization_time="unconstrained"*. With the unconstrained option, nebullvm will test time-consuming techniques such as pruning and quantization-aware training (QAT).
61
+ - Set the *metric_drop_ths* parameter to be greater than zero (by default, *metric_drop_ths=0*). In this way, we will allow nebullvm to test optimization techniques that involve a tradeoff of some trade-off of a certain metric. For example, to test maximum acceleration with a minimum loss of accuracy of 3%, set *metric_drop_ths=0.03* and *metric="accuracy"*.
62
+ For more information about nebullvm API, see [nebullvm documentation](https://github.com/nebuly-ai/nebullvm).
63
+
64
+
65
+ Let's now compare the latency of the optimized model with that of the original model.
66
+ Note that before testing latency of the optimized model, it is necessary to perform some warmup runs, as some optimizers fine-tune certain internal parameters during the first few inferences after optimization.
67
+
68
+ ```python
69
+ # Check perfomance
70
+ warmup_iters = 30
71
+ num_iters = 100
72
+
73
+ # Unoptimized model perfomance
74
+ with torch.no_grad():
75
+ for i in range(warmup_iters):
76
+ o = model(img)
77
+
78
+ start = time.time()
79
+ for i in range(num_iters):
80
+ o = model(img)
81
+ stop = time.time()
82
+ print(f"Average inference time of unoptimized YOLOX: {(stop - start)/num_iters*1000} ms")
83
+
84
+ # Optimized model perfomance
85
+ with torch.no_grad():
86
+ for i in range(warmup_iters):
87
+ res = model_opt(img)
88
+
89
+ start = time.time()
90
+ for i in range(num_iters):
91
+ res = model_opt(img)
92
+ stop = time.time()
93
+ print(f"Average inference time of YOLOX otpimized with nebullvm: {(stop - start)/num_iters*1000} ms")
94
+ ```
95
+ Find [here](nebullvm_optimization.py) the complete script in python with more details.
demo/nebullvm/nebullvm_optimization.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import time
3
+ from nebullvm.api.functions import optimize_model # Install DL compilers
4
+ from yolox.exp import get_exp
5
+
6
+ # Get YOLO model
7
+ exp = get_exp(None, 'yolox-s') # select model name
8
+ model = exp.get_model()
9
+ model.cuda()
10
+ model.eval()
11
+
12
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
13
+
14
+ # Create dummy data for the optimizer
15
+ input_data = [((torch.randn(1, 3, 640, 640).to(device), ), 0) for i in range(100)]
16
+
17
+ # ---------- Optimization ----------
18
+ optimized_model = optimize_model(model, input_data=input_data, optimization_time="constrained") # Optimization without performance loss
19
+
20
+
21
+ # ---------- Benchmarks ----------
22
+ # Select image to test the latency of the optimized model
23
+
24
+ # Create dummy image
25
+ img = torch.randn(1, 3, 640, 640).to(device)
26
+
27
+ # Check perfomance
28
+ warmup_iters = 30
29
+ num_iters = 100
30
+
31
+ # Unptimized model perfomance
32
+ with torch.no_grad():
33
+ for i in range(warmup_iters):
34
+ o = model(img)
35
+
36
+ start = time.time()
37
+ for i in range(num_iters):
38
+ o = model(img)
39
+ stop = time.time()
40
+ print(f"Average inference time of unoptimized YOLOX: {(stop - start)/num_iters*1000} ms")
41
+
42
+ # Optimized model perfomance
43
+ with torch.no_grad():
44
+ for i in range(warmup_iters):
45
+ res = optimized_model(img)
46
+
47
+ start = time.time()
48
+ for i in range(num_iters):
49
+ res = optimized_model(img)
50
+ stop = time.time()
51
+ print(f"Average inference time of YOLOX otpimized with nebullvm: {(stop - start)/num_iters*1000} ms")