happyme531 commited on
Commit
50704de
·
verified ·
1 Parent(s): fe8d810

Upload 12 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ sam2.1_hiera_large_encoder.rknn filter=lfs diff=lfs merge=lfs -text
37
+ sam2.1_hiera_small_encoder.rknn filter=lfs diff=lfs merge=lfs -text
38
+ sam2.1_hiera_tiny_encoder.rknn filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,158 @@
1
- ---
2
- license: agpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Segment Anything 2.1 RKNN2
2
+
3
+ ## (English README see below)
4
+
5
+ 在RK3588上运行强大的Segment Anything 2.1图像分割模型!
6
+
7
+ - 推理速度(RK3588):
8
+ - Encoder(Tiny)(单NPU核): 3s
9
+ - Encoder(Small)(单NPU核): 3.5s
10
+ - Encoder(Large)(单NPU核): 12s
11
+ - Decoder(CPU): 0.1s
12
+
13
+ - 内存占用(RK3588):
14
+ - Encoder(Tiny): 0.95GB
15
+ - Encoder(Small): 1.1GB
16
+ - Encoder(Large): 4.1GB
17
+ - Decoder: 非常小, 可以忽略不计
18
+
19
+ ## 使用方法
20
+
21
+ 1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间.
22
+
23
+ 2. 安装依赖
24
+
25
+ ```bash
26
+ pip install numpy<2 pillow matplotlib opencv-python onnxruntime rknn-toolkit-lite2
27
+ ```
28
+
29
+ 3. 运行
30
+
31
+ ```bash
32
+ python test_rknn.py
33
+ ```
34
+
35
+ 你可以修改`test_rknn.py`中这一部分
36
+ ```python
37
+ def main():
38
+ # 1. 加载原始图片
39
+ path = "dog.jpg"
40
+ orig_image, input_image, (scale, offset_x, offset_y) = load_image(path)
41
+ decoder_path = "sam2.1_hiera_small_decoder.onnx"
42
+ encoder_path = "sam2.1_hiera_small_encoder.rknn"
43
+ ...
44
+ ```
45
+
46
+ 来测试不同的模型和图片. 注意, 和SAM1不同, 这里的encoder和decoder必须使用同一个版本的模型.
47
+
48
+
49
+ ## 模型转换
50
+
51
+ 1. 安装依赖
52
+
53
+ ```bash
54
+ pip install numpy<2 onnxslim onnxruntime rknn-toolkit2 sam2
55
+ ```
56
+
57
+ 2. 下载SAM2.1的pt模型文件. 可以从[这里](https://github.com/facebookresearch/sam2?tab=readme-ov-file#model-description)下载.
58
+
59
+ 3. 转换pt模型到onnx模型. 以Tiny模型为例:
60
+
61
+ ```bash
62
+ python ./export_onnx.py --model_type sam2.1_hiera_tiny --checkpoint ./sam2.1_hiera_tiny.pt --output_encoder ./sam2.1_hiera_tiny_encoder.onnx --output_decoder sam2.1_hiera_tiny_decoder.onnx
63
+ ```
64
+
65
+ 4. 将onnx模型转换为rknn模型. 以Tiny模型为例:
66
+
67
+ ```bash
68
+ python ./convert_rknn.py sam2.1_hiera_tiny
69
+ ```
70
+ 如果在常量折叠时报错, 请尝试更新onnxruntime到最新版本.
71
+
72
+ ## 已知问题
73
+
74
+ - 只实现了图片分割, 没有实现视频分割.
75
+ - 由于RKNN-Toolkit2的问题, decoder模型在转换时会报错, 暂时需要使用CPU onnxruntime运行, 会略微增加CPU占用.
76
+
77
+ ## 参考
78
+
79
+ - [samexporter/export_sam21_cvat.py](https://github.com/hashJoe/samexporter/blob/cvat/samexporter/export_sam21_cvat.py)
80
+ - [SAM 2](https://github.com/facebookresearch/sam2)
81
+
82
+ ## English README
83
+
84
+ Run the powerful Segment Anything 2.1 image segmentation model on RK3588!
85
+
86
+ - Inference Speed (RK3588):
87
+ - Encoder(Tiny)(Single NPU Core): 3s
88
+ - Encoder(Small)(Single NPU Core): 3.5s
89
+ - Encoder(Large)(Single NPU Core): 12s
90
+ - Decoder(CPU): 0.1s
91
+
92
+ - Memory Usage (RK3588):
93
+ - Encoder(Tiny): 0.95GB
94
+ - Encoder(Small): 1.1GB
95
+ - Encoder(Large): 4.1GB
96
+ - Decoder: Negligible
97
+
98
+ ## Usage
99
+
100
+ 1. Clone or download this repository. Models are large, please ensure sufficient disk space.
101
+
102
+ 2. Install dependencies
103
+
104
+ ```bash
105
+ pip install numpy<2 pillow matplotlib opencv-python onnxruntime rknn-toolkit-lite2
106
+ ```
107
+
108
+ 3. Run
109
+
110
+ ```bash
111
+ python test_rknn.py
112
+ ```
113
+
114
+ You can modify this part in `test_rknn.py`
115
+ ```python
116
+ def main():
117
+ # 1. Load original image
118
+ path = "dog.jpg"
119
+ orig_image, input_image, (scale, offset_x, offset_y) = load_image(path)
120
+ decoder_path = "sam2.1_hiera_small_decoder.onnx"
121
+ encoder_path = "sam2.1_hiera_small_encoder.rknn"
122
+ ...
123
+ ```
124
+
125
+ to test different models and images. Note that unlike SAM1, the encoder and decoder must use the same version of the model.
126
+
127
+ ## Model Conversion
128
+
129
+ 1. Install dependencies
130
+
131
+ ```bash
132
+ pip install numpy<2 onnxslim onnxruntime rknn-toolkit2 sam2
133
+ ```
134
+
135
+ 2. Download SAM2.1 pt model files. You can download them from [here](https://github.com/facebookresearch/sam2?tab=readme-ov-file#model-description).
136
+
137
+ 3. Convert pt models to onnx models. Taking Tiny model as an example:
138
+
139
+ ```bash
140
+ python ./export_onnx.py --model_type sam2.1_hiera_tiny --checkpoint ./sam2.1_hiera_tiny.pt --output_encoder ./sam2.1_hiera_tiny_encoder.onnx --output_decoder sam2.1_hiera_tiny_decoder.onnx
141
+ ```
142
+
143
+ 4. Convert onnx models to rknn models. Taking Tiny model as an example:
144
+
145
+ ```bash
146
+ python ./convert_rknn.py sam2.1_hiera_tiny
147
+ ```
148
+ If you encounter errors during constant folding, try updating onnxruntime to the latest version.
149
+
150
+ ## Known Issues
151
+
152
+ - Only image segmentation is implemented, video segmentation is not supported.
153
+ - Due to issues with RKNN-Toolkit2, the decoder model conversion will fail. Currently, it needs to run on CPU using onnxruntime, which will slightly increase CPU usage.
154
+
155
+ ## References
156
+
157
+ - [samexporter/export_sam21_cvat.py](https://github.com/hashJoe/samexporter/blob/cvat/samexporter/export_sam21_cvat.py)
158
+ - [SAM 2](https://github.com/facebookresearch/sam2)
convert_rknn.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ import datetime
5
+ import argparse
6
+ from rknn.api import RKNN
7
+ from sys import exit
8
+ import os
9
+ import onnxslim
10
+
11
+ num_pointss = [1]
12
+ num_labelss = [1]
13
+
14
+ def convert_to_rknn(onnx_model, model_part, dataset="/home/zt/rk3588-nn/rknn_model_zoo/datasets/COCO/coco_subset_20.txt", quantize=False):
15
+ """转换单个ONNX模型到RKNN格式"""
16
+ rknn_model = onnx_model.replace(".onnx",".rknn")
17
+ timedate_iso = datetime.datetime.now().isoformat()
18
+
19
+ print(f"\n开始转换 {onnx_model} 到 {rknn_model}")
20
+
21
+ input_shapes = None
22
+
23
+ if model_part == "encoder":
24
+ input_shapes = None
25
+ elif model_part == "decoder":
26
+ input_shapes = [
27
+ [
28
+ [1, 256, 64, 64], # image_embedding
29
+ [1, 32, 256, 256], # high_res_feats_0
30
+ [1, 64, 128, 128], # high_res_feats_1
31
+ [num_labels, num_points, 2], # point_coords
32
+ [num_labels, num_points], # point_labels
33
+ [num_labels, 1, 256, 256], # mask_input
34
+ [num_labels], # has_mask_input
35
+ ]
36
+ for num_labels in num_labelss
37
+ for num_points in num_pointss
38
+ ]
39
+
40
+ rknn = RKNN(verbose=True)
41
+ rknn.config(
42
+ dynamic_input=input_shapes,
43
+ std_values=[[255,255,255]] if model_part == "encoder" else None,
44
+ quantized_dtype='w8a8',
45
+ quantized_algorithm='normal',
46
+ quantized_method='channel',
47
+ quantized_hybrid_level=0,
48
+ target_platform='rk3588',
49
+ quant_img_RGB2BGR = False,
50
+ float_dtype='float16',
51
+ optimization_level=3,
52
+ custom_string=f"converted at {timedate_iso}",
53
+ remove_weight=False,
54
+ compress_weight=False,
55
+ inputs_yuv_fmt=None,
56
+ single_core_mode=False,
57
+ model_pruning=False,
58
+ op_target=None,
59
+ quantize_weight=False,
60
+ remove_reshape=False,
61
+ sparse_infer=False,
62
+ enable_flash_attention=False,
63
+ )
64
+
65
+ ret = rknn.load_onnx(model=onnx_model)
66
+ ret = rknn.build(do_quantization=quantize, dataset=dataset, rknn_batch_size=None)
67
+ ret = rknn.export_rknn(rknn_model)
68
+ print(f"完成转换 {rknn_model}\n")
69
+
70
+ def main():
71
+ parser = argparse.ArgumentParser(description='转换SAM模型从ONNX到RKNN格式')
72
+ parser.add_argument('model_name', type=str, help='模型名称,例如: sam2.1_hiera_tiny')
73
+ args = parser.parse_args()
74
+
75
+ # 构建encoder和decoder的文件名
76
+ encoder_onnx = f"{args.model_name}_encoder.onnx"
77
+ decoder_onnx = f"{args.model_name}_decoder.onnx"
78
+
79
+ # 检查文件是否存在
80
+ for model in [encoder_onnx, decoder_onnx]:
81
+ if not os.path.exists(model):
82
+ print(f"错误: 找不到文件 {model}")
83
+ exit(1)
84
+
85
+ # 转换encoder和decoder
86
+ #encoder需要先跑一个onnxslim
87
+ print("开始转换encoder...")
88
+ onnxslim.slim(encoder_onnx, output_model="encoder_slim.onnx", skip_fusion_patterns=["EliminationSlice"])
89
+ convert_to_rknn("encoder_slim.onnx", model_part="encoder")
90
+ os.rename("encoder_slim.rknn", encoder_onnx.replace(".onnx", ".rknn"))
91
+ os.remove("encoder_slim.onnx")
92
+
93
+ # convert_to_rknn(decoder_onnx, model_part="decoder") # 坏的
94
+
95
+ print("所有模型转换完成!")
96
+
97
+ if __name__ == "__main__":
98
+ main()
dog.jpg ADDED
export_onnx.py ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any
2
+ import argparse
3
+ import pathlib
4
+
5
+ import torch
6
+ from torch import nn
7
+ from sam2.build_sam import build_sam2
8
+ from sam2.modeling.sam2_base import SAM2Base
9
+
10
+
11
+ class SAM2ImageEncoder(nn.Module):
12
+ def __init__(self, sam_model: SAM2Base) -> None:
13
+ super().__init__()
14
+ self.model = sam_model
15
+ self.image_encoder = sam_model.image_encoder
16
+ self.no_mem_embed = sam_model.no_mem_embed
17
+
18
+ def forward(self, x: torch.Tensor) -> tuple[Any, Any, Any]:
19
+ backbone_out = self.image_encoder(x)
20
+ backbone_out["backbone_fpn"][0] = self.model.sam_mask_decoder.conv_s0(
21
+ backbone_out["backbone_fpn"][0]
22
+ )
23
+ backbone_out["backbone_fpn"][1] = self.model.sam_mask_decoder.conv_s1(
24
+ backbone_out["backbone_fpn"][1]
25
+ )
26
+
27
+ feature_maps = backbone_out["backbone_fpn"][
28
+ -self.model.num_feature_levels :
29
+ ]
30
+ vision_pos_embeds = backbone_out["vision_pos_enc"][
31
+ -self.model.num_feature_levels :
32
+ ]
33
+
34
+ feat_sizes = [(x.shape[-2], x.shape[-1]) for x in vision_pos_embeds]
35
+
36
+ # flatten NxCxHxW to HWxNxC
37
+ vision_feats = [x.flatten(2).permute(2, 0, 1) for x in feature_maps]
38
+ vision_feats[-1] = vision_feats[-1] + self.no_mem_embed
39
+
40
+ feats = [
41
+ feat.permute(1, 2, 0).reshape(1, -1, *feat_size)
42
+ for feat, feat_size in zip(vision_feats[::-1], feat_sizes[::-1])
43
+ ][::-1]
44
+
45
+ return feats[0], feats[1], feats[2]
46
+
47
+
48
+ class SAM2ImageDecoder(nn.Module):
49
+ def __init__(self, sam_model: SAM2Base, multimask_output: bool) -> None:
50
+ super().__init__()
51
+ self.mask_decoder = sam_model.sam_mask_decoder
52
+ self.prompt_encoder = sam_model.sam_prompt_encoder
53
+ self.model = sam_model
54
+ self.img_size = sam_model.image_size
55
+ self.multimask_output = multimask_output
56
+
57
+ @torch.no_grad()
58
+ def forward(
59
+ self,
60
+ image_embed: torch.Tensor,
61
+ high_res_feats_0: torch.Tensor,
62
+ high_res_feats_1: torch.Tensor,
63
+ point_coords: torch.Tensor,
64
+ point_labels: torch.Tensor,
65
+ orig_im_size: torch.Tensor,
66
+ mask_input: torch.Tensor,
67
+ has_mask_input: torch.Tensor,
68
+ ):
69
+ sparse_embedding = self._embed_points(point_coords, point_labels)
70
+ self.sparse_embedding = sparse_embedding
71
+ dense_embedding = self._embed_masks(mask_input, has_mask_input)
72
+
73
+ high_res_feats = [high_res_feats_0, high_res_feats_1]
74
+ image_embed = image_embed
75
+
76
+ masks, iou_predictions, _, _ = self.mask_decoder.predict_masks(
77
+ image_embeddings=image_embed,
78
+ image_pe=self.prompt_encoder.get_dense_pe(),
79
+ sparse_prompt_embeddings=sparse_embedding,
80
+ dense_prompt_embeddings=dense_embedding,
81
+ repeat_image=False,
82
+ high_res_features=high_res_feats,
83
+ )
84
+
85
+ if self.multimask_output:
86
+ masks = masks[:, 1:, :, :]
87
+ iou_predictions = iou_predictions[:, 1:]
88
+ else:
89
+ masks, iou_predictions = (
90
+ self.mask_decoder._dynamic_multimask_via_stability(
91
+ masks, iou_predictions
92
+ )
93
+ )
94
+
95
+ masks = torch.clamp(masks, -32.0, 32.0)
96
+
97
+ return masks, iou_predictions
98
+
99
+ def _embed_points(
100
+ self, point_coords: torch.Tensor, point_labels: torch.Tensor
101
+ ) -> torch.Tensor:
102
+
103
+ point_coords = point_coords + 0.5
104
+
105
+ padding_point = torch.zeros(
106
+ (point_coords.shape[0], 1, 2), device=point_coords.device
107
+ )
108
+ padding_label = -torch.ones(
109
+ (point_labels.shape[0], 1), device=point_labels.device
110
+ )
111
+ point_coords = torch.cat([point_coords, padding_point], dim=1)
112
+ point_labels = torch.cat([point_labels, padding_label], dim=1)
113
+
114
+ point_coords[:, :, 0] = point_coords[:, :, 0] / self.model.image_size
115
+ point_coords[:, :, 1] = point_coords[:, :, 1] / self.model.image_size
116
+
117
+ point_embedding = self.prompt_encoder.pe_layer._pe_encoding(
118
+ point_coords
119
+ )
120
+ point_labels = point_labels.unsqueeze(-1).expand_as(point_embedding)
121
+
122
+ point_embedding = point_embedding * (point_labels != -1)
123
+ point_embedding = (
124
+ point_embedding
125
+ + self.prompt_encoder.not_a_point_embed.weight
126
+ * (point_labels == -1)
127
+ )
128
+
129
+ for i in range(self.prompt_encoder.num_point_embeddings):
130
+ point_embedding = (
131
+ point_embedding
132
+ + self.prompt_encoder.point_embeddings[i].weight
133
+ * (point_labels == i)
134
+ )
135
+
136
+ return point_embedding
137
+
138
+ def _embed_masks(
139
+ self, input_mask: torch.Tensor, has_mask_input: torch.Tensor
140
+ ) -> torch.Tensor:
141
+ mask_embedding = has_mask_input * self.prompt_encoder.mask_downscaling(
142
+ input_mask
143
+ )
144
+ mask_embedding = mask_embedding + (
145
+ 1 - has_mask_input
146
+ ) * self.prompt_encoder.no_mask_embed.weight.reshape(1, -1, 1, 1)
147
+ return mask_embedding
148
+
149
+
150
+ if __name__ == "__main__":
151
+ parser = argparse.ArgumentParser(
152
+ description="Export the SAM2 prompt encoder and mask decoder to an ONNX model."
153
+ )
154
+ parser.add_argument(
155
+ "--checkpoint",
156
+ type=str,
157
+ required=True,
158
+ help="The path to the SAM model checkpoint.",
159
+ )
160
+
161
+ parser.add_argument(
162
+ "--output_encoder",
163
+ type=str,
164
+ required=True,
165
+ help="The filename to save the encoder ONNX model to.",
166
+ )
167
+
168
+ parser.add_argument(
169
+ "--output_decoder",
170
+ type=str,
171
+ required=True,
172
+ help="The filename to save the decoder ONNX model to.",
173
+ )
174
+
175
+ parser.add_argument(
176
+ "--model_type",
177
+ type=str,
178
+ required=True,
179
+ help="In the form of sam2_hiera_{tiny, small, base_plus, large}.",
180
+ )
181
+
182
+ parser.add_argument(
183
+ "--opset",
184
+ type=int,
185
+ default=17,
186
+ help="The ONNX opset version to use. Must be >=11",
187
+ )
188
+
189
+ args = parser.parse_args()
190
+
191
+ input_size = (1024, 1024)
192
+ multimask_output = False
193
+ model_type = args.model_type
194
+ if model_type == "sam2.1_hiera_tiny":
195
+ model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"
196
+ elif model_type == "sam2.1_hiera_small":
197
+ model_cfg = "configs/sam2.1/sam2.1_hiera_s.yaml"
198
+ elif model_type == "sam2.1_hiera_base_plus":
199
+ model_cfg = "configs/sam2.1/sam2.1_hiera_b+.yaml"
200
+ elif model_type == "sam2.1_hiera_large":
201
+ model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
202
+ else:
203
+ model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
204
+
205
+ sam2_model = build_sam2(model_cfg, args.checkpoint, device="cpu")
206
+ img = torch.randn(1, 3, input_size[0], input_size[1]).cpu()
207
+ sam2_encoder = SAM2ImageEncoder(sam2_model).cpu()
208
+ high_res_feats_0, high_res_feats_1, image_embed = sam2_encoder(img)
209
+
210
+ pathlib.Path(args.output_encoder).parent.mkdir(parents=True, exist_ok=True)
211
+ torch.onnx.export(
212
+ sam2_encoder,
213
+ img,
214
+ args.output_encoder,
215
+ export_params=True,
216
+ opset_version=args.opset,
217
+ do_constant_folding=True,
218
+ input_names=["image"],
219
+ output_names=["high_res_feats_0", "high_res_feats_1", "image_embed"],
220
+ )
221
+ print("Saved encoder to", args.output_encoder)
222
+
223
+ sam2_decoder = SAM2ImageDecoder(
224
+ sam2_model, multimask_output=multimask_output
225
+ ).cpu()
226
+
227
+ embed_dim = sam2_model.sam_prompt_encoder.embed_dim
228
+ embed_size = (
229
+ sam2_model.image_size // sam2_model.backbone_stride,
230
+ sam2_model.image_size // sam2_model.backbone_stride,
231
+ )
232
+ mask_input_size = [4 * x for x in embed_size]
233
+ print(embed_dim, embed_size, mask_input_size)
234
+
235
+ point_coords = torch.randint(
236
+ low=0, high=input_size[1], size=(1, 5, 2), dtype=torch.float
237
+ )
238
+ point_labels = torch.randint(low=0, high=1, size=(1, 5), dtype=torch.float)
239
+ mask_input = torch.randn(1, 1, *mask_input_size, dtype=torch.float)
240
+ has_mask_input = torch.tensor([1], dtype=torch.float)
241
+ orig_im_size = torch.tensor([input_size[0], input_size[1]], dtype=torch.int)
242
+
243
+ pathlib.Path(args.output_decoder).parent.mkdir(parents=True, exist_ok=True)
244
+ torch.onnx.export(
245
+ sam2_decoder,
246
+ (
247
+ image_embed,
248
+ high_res_feats_0,
249
+ high_res_feats_1,
250
+ point_coords,
251
+ point_labels,
252
+ orig_im_size,
253
+ mask_input,
254
+ has_mask_input,
255
+ ),
256
+ args.output_decoder,
257
+ export_params=True,
258
+ opset_version=args.opset,
259
+ do_constant_folding=True,
260
+ input_names=[
261
+ "image_embed",
262
+ "high_res_feats_0",
263
+ "high_res_feats_1",
264
+ "point_coords",
265
+ "point_labels",
266
+ "orig_im_size",
267
+ "mask_input",
268
+ "has_mask_input",
269
+ ],
270
+ output_names=["masks", "iou_predictions"],
271
+ dynamic_axes={
272
+ "point_coords": {0: "num_labels", 1: "num_points"},
273
+ "point_labels": {0: "num_labels", 1: "num_points"},
274
+ "mask_input": {0: "num_labels"},
275
+ "has_mask_input": {0: "num_labels"},
276
+ },
277
+ )
278
+ print("Saved decoder to", args.output_decoder)
sam2.1_hiera_large_decoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c039b2455b4e92dfeb8cb8e4d10a98a92a79ec1550a7119c997bad4352811554
3
+ size 16526061
sam2.1_hiera_large_encoder.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ce5ae036eb273f4e017481c8cb744e50c84a93e81e2f6a84ff4b89a118e756a
3
+ size 1419024037
sam2.1_hiera_small_decoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e7ba7a80bfae89c1a660d3b64291fa4f5a2de15022a4e8eab933218d4f34582
3
+ size 16526003
sam2.1_hiera_small_encoder.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8b9efce9e5d12900a508dc1b79dfbd389057136a6d2ab4cb66654961f3106ef
3
+ size 374531749
sam2.1_hiera_tiny_decoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f594db10b3c7b4d9de7f8854693ea6f7a880e5e228ad08d7823393233e65f4fa
3
+ size 16525993
sam2.1_hiera_tiny_encoder.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3750eef90b87ab63cfefbf4f89858072a4891818c315d96dddeea172119cba1
3
+ size 339018597
test_onnx.py ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ os.chdir(os.path.dirname(os.path.abspath(__file__)))
3
+
4
+ import numpy as np
5
+ import torch
6
+ import onnxruntime
7
+ from PIL import Image
8
+ import requests
9
+ from io import BytesIO
10
+ import matplotlib.pyplot as plt
11
+ from sam2.build_sam import build_sam2
12
+ from sam2.sam2_image_predictor import SAM2ImagePredictor
13
+
14
+
15
+ def load_image(url):
16
+ """加载并预处理图片"""
17
+ response = requests.get(url)
18
+ image = Image.open(BytesIO(response.content)).convert("RGB")
19
+ print(f"Original image size: {image.size}")
20
+
21
+ # 计算resize后的尺寸,保持长宽比
22
+ target_size = (1024, 1024)
23
+ w, h = image.size
24
+ scale = min(target_size[0] / w, target_size[1] / h)
25
+ new_w = int(w * scale)
26
+ new_h = int(h * scale)
27
+ print(f"Scale factor: {scale}")
28
+ print(f"Resized dimensions: {new_w}x{new_h}")
29
+
30
+ # resize图片
31
+ resized_image = image.resize((new_w, new_h), Image.Resampling.LANCZOS)
32
+
33
+ # 创建1024x1024的黑色背景
34
+ processed_image = Image.new("RGB", target_size, (0, 0, 0))
35
+ # 将resized图片粘贴到中心位置
36
+ paste_x = (target_size[0] - new_w) // 2
37
+ paste_y = (target_size[1] - new_h) // 2
38
+ print(f"Paste position: ({paste_x}, {paste_y})")
39
+ processed_image.paste(resized_image, (paste_x, paste_y))
40
+
41
+ # 保存处理后的图片用于检查
42
+ processed_image.save("debug_processed_image.png")
43
+
44
+ # 转换为numpy数组并归一化到[0,1]
45
+ img_np = np.array(processed_image).astype(np.float32) / 255.0
46
+ # 调整维度顺序从HWC到CHW
47
+ img_np = img_np.transpose(2, 0, 1)
48
+ # 添加batch维度
49
+ img_np = np.expand_dims(img_np, axis=0)
50
+
51
+ print(f"Final input tensor shape: {img_np.shape}")
52
+
53
+ return image, img_np, (scale, paste_x, paste_y)
54
+
55
+ def prepare_point_input(point_coords, point_labels, image_size=(1024, 1024)):
56
+ """准备点击输入数据"""
57
+ point_coords = np.array(point_coords, dtype=np.float32)
58
+ point_labels = np.array(point_labels, dtype=np.float32)
59
+
60
+ # 添加batch维度
61
+ point_coords = np.expand_dims(point_coords, axis=0)
62
+ point_labels = np.expand_dims(point_labels, axis=0)
63
+
64
+ # 准备mask输入
65
+ mask_input = np.zeros((1, 1, 256, 256), dtype=np.float32)
66
+ has_mask_input = np.zeros(1, dtype=np.float32)
67
+ orig_im_size = np.array(image_size, dtype=np.int32)
68
+
69
+ return point_coords, point_labels, mask_input, has_mask_input, orig_im_size
70
+
71
+ def main():
72
+ # 1. 加载原始图片
73
+ url = "https://raw.githubusercontent.com/facebookresearch/segment-anything/main/notebooks/images/dog.jpg"
74
+ orig_image, input_image, (scale, offset_x, offset_y) = load_image(url)
75
+
76
+ # 2. 准备输入点 - 需要根据scale和offset调整点击坐标
77
+ input_point_orig = [[750, 400]]
78
+ input_point = [[
79
+ int(x * scale + offset_x),
80
+ int(y * scale + offset_y)
81
+ ] for x, y in input_point_orig]
82
+ print(f"Original point: {input_point_orig}")
83
+ print(f"Transformed point: {input_point}")
84
+ input_label = [1]
85
+
86
+ # 3. 运行PyTorch模型
87
+ print("Running PyTorch model...")
88
+ checkpoint = "sam2.1_hiera_large.pt"
89
+ model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
90
+ predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
91
+
92
+ with torch.inference_mode():
93
+ predictor.set_image(orig_image)
94
+ masks_pt, iou_scores_pt, low_res_masks_pt = predictor.predict(
95
+ point_coords=np.array(input_point),
96
+ point_labels=np.array(input_label),
97
+ multimask_output=True
98
+ )
99
+
100
+ # 4. 运行ONNX模型
101
+ print("Running ONNX model...")
102
+ encoder_path = "sam2.1_hiera_tiny_encoder.s.onnx"
103
+ decoder_path = "sam2.1_hiera_tiny_decoder.onnx"
104
+
105
+ # 创建ONNX Runtime会话
106
+ encoder_session = onnxruntime.InferenceSession(encoder_path)
107
+ decoder_session = onnxruntime.InferenceSession(decoder_path)
108
+
109
+ # 运行encoder
110
+ encoder_inputs = {'image': input_image}
111
+ high_res_feats_0, high_res_feats_1, image_embed = encoder_session.run(None, encoder_inputs)
112
+
113
+ # 准备decoder输入
114
+ point_coords, point_labels, mask_input, has_mask_input, orig_im_size = prepare_point_input(
115
+ input_point, input_label, orig_image.size[::-1]
116
+ )
117
+
118
+ # 运行decoder
119
+ decoder_inputs = {
120
+ 'image_embed': image_embed,
121
+ 'high_res_feats_0': high_res_feats_0,
122
+ 'high_res_feats_1': high_res_feats_1,
123
+ 'point_coords': point_coords,
124
+ 'point_labels': point_labels,
125
+ # 'orig_im_size': orig_im_size,
126
+ 'mask_input': mask_input,
127
+ 'has_mask_input': has_mask_input,
128
+ }
129
+
130
+ low_res_masks, iou_predictions = decoder_session.run(None, decoder_inputs)
131
+
132
+ # 后处理: 将low_res_masks缩放到原始图片尺寸
133
+ w, h = orig_image.size
134
+
135
+ # 1. 首先将mask缩放到1024x1024
136
+ masks_1024 = torch.nn.functional.interpolate(
137
+ torch.from_numpy(low_res_masks),
138
+ size=(1024, 1024),
139
+ mode="bilinear",
140
+ align_corners=False
141
+ )
142
+
143
+ # 2. 去除padding
144
+ new_h = int(h * scale)
145
+ new_w = int(w * scale)
146
+ start_h = (1024 - new_h) // 2
147
+ start_w = (1024 - new_w) // 2
148
+ masks_no_pad = masks_1024[..., start_h:start_h+new_h, start_w:start_w+new_w]
149
+
150
+ # 3. 缩放到原始图片尺寸
151
+ masks_onnx = torch.nn.functional.interpolate(
152
+ masks_no_pad,
153
+ size=(h, w),
154
+ mode="bilinear",
155
+ align_corners=False
156
+ )
157
+
158
+ # 4. 二值化
159
+ masks_onnx = masks_onnx > 0.0
160
+ masks_onnx = masks_onnx.numpy()
161
+
162
+ # 在运行ONNX模型后,打印输出的shape
163
+ print(f"\nOutput shapes:")
164
+ print(f"PyTorch masks shape: {masks_pt.shape}")
165
+ print(f"ONNX masks shape: {masks_onnx.shape}")
166
+
167
+ # 修改可视化部分,暂时注释掉差异图
168
+ plt.figure(figsize=(10, 5))
169
+
170
+ # PyTorch结果
171
+ plt.subplot(121)
172
+ plt.imshow(orig_image)
173
+ plt.imshow(masks_pt[0], alpha=0.5)
174
+ plt.plot(input_point_orig[0][0], input_point_orig[0][1], 'rx')
175
+ plt.title('PyTorch Output')
176
+ plt.axis('off')
177
+
178
+ # ONNX结果
179
+ plt.subplot(122)
180
+ plt.imshow(orig_image)
181
+ plt.imshow(masks_onnx[0,0], alpha=0.5)
182
+ plt.plot(input_point_orig[0][0], input_point_orig[0][1], 'rx')
183
+ plt.title('ONNX Output')
184
+ plt.axis('off')
185
+
186
+ plt.tight_layout()
187
+ plt.show()
188
+
189
+ # 6. 打印一些统计信息
190
+ print("\nStatistics:")
191
+ print(f"PyTorch IoU scores: {iou_scores_pt}")
192
+ print(f"ONNX IoU predictions: {iou_predictions}")
193
+
194
+ if __name__ == "__main__":
195
+ main()
test_rknn.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ os.chdir(os.path.dirname(os.path.abspath(__file__)))
4
+
5
+ import numpy as np
6
+ import onnxruntime
7
+ from rknnlite.api import RKNNLite
8
+ from PIL import Image
9
+ import matplotlib.pyplot as plt
10
+ import cv2
11
+
12
+
13
+ def load_image(path):
14
+ """加载并预处理图片"""
15
+ image = Image.open(path).convert("RGB")
16
+ print(f"Original image size: {image.size}")
17
+
18
+ # 计算resize后的尺寸,保持长宽比
19
+ target_size = (1024, 1024)
20
+ w, h = image.size
21
+ scale = min(target_size[0] / w, target_size[1] / h)
22
+ new_w = int(w * scale)
23
+ new_h = int(h * scale)
24
+ print(f"Scale factor: {scale}")
25
+ print(f"Resized dimensions: {new_w}x{new_h}")
26
+
27
+ # resize图片
28
+ resized_image = image.resize((new_w, new_h), Image.Resampling.LANCZOS)
29
+
30
+ # 创建1024x1024的黑色背景
31
+ processed_image = Image.new("RGB", target_size, (0, 0, 0))
32
+ # 将resized图片粘贴到中心位置
33
+ paste_x = (target_size[0] - new_w) // 2
34
+ paste_y = (target_size[1] - new_h) // 2
35
+ print(f"Paste position: ({paste_x}, {paste_y})")
36
+ processed_image.paste(resized_image, (paste_x, paste_y))
37
+
38
+ # 保存处理后的图片用于检查
39
+ processed_image.save("debug_processed_image.png")
40
+
41
+ # 转换为numpy数组并归一化到[0,1] # 归一化整合到模型了
42
+ img_np = np.array(processed_image).astype(np.float32) # / 255.0
43
+ # 调整维度顺序从HWC到CHW
44
+ img_np = img_np.transpose(2, 0, 1)
45
+ # 添加batch维度
46
+ img_np = np.expand_dims(img_np, axis=0)
47
+
48
+ print(f"Final input tensor shape: {img_np.shape}")
49
+
50
+ return image, img_np, (scale, paste_x, paste_y)
51
+
52
+ def prepare_point_input(point_coords, point_labels, image_size=(1024, 1024)):
53
+ """准备点击输入数据"""
54
+ point_coords = np.array(point_coords, dtype=np.float32)
55
+ point_labels = np.array(point_labels, dtype=np.float32)
56
+
57
+ # 添加batch维度
58
+ point_coords = np.expand_dims(point_coords, axis=0)
59
+ point_labels = np.expand_dims(point_labels, axis=0)
60
+
61
+ # 准备mask输入
62
+ mask_input = np.zeros((1, 1, 256, 256), dtype=np.float32)
63
+ has_mask_input = np.zeros(1, dtype=np.float32)
64
+ orig_im_size = np.array(image_size, dtype=np.int32)
65
+
66
+ return point_coords, point_labels, mask_input, has_mask_input, orig_im_size
67
+
68
+ def main():
69
+ # 1. 加载原始图片
70
+ path = "dog.jpg"
71
+ orig_image, input_image, (scale, offset_x, offset_y) = load_image(path)
72
+ decoder_path = "sam2.1_hiera_small_decoder.onnx"
73
+ encoder_path = "sam2.1_hiera_small_encoder.rknn"
74
+
75
+ # 2. 准备输入点
76
+ # input_point_orig = [[750, 400]]
77
+ input_point_orig = [[189, 394]]
78
+ input_point = [[
79
+ int(x * scale + offset_x),
80
+ int(y * scale + offset_y)
81
+ ] for x, y in input_point_orig]
82
+ input_label = [1]
83
+
84
+ # 3. 运行RKNN encoder
85
+ print("Running RKNN encoder...")
86
+ rknn_lite = RKNNLite(verbose=False)
87
+
88
+ ret = rknn_lite.load_rknn(encoder_path)
89
+ if ret != 0:
90
+ print('Load RKNN model failed')
91
+ exit(ret)
92
+
93
+ ret = rknn_lite.init_runtime()
94
+ if ret != 0:
95
+ print('Init runtime environment failed')
96
+ exit(ret)
97
+ start_time = time.time()
98
+ encoder_outputs = rknn_lite.inference(inputs=[input_image], data_format="nchw")
99
+ end_time = time.time()
100
+ print(f"RKNN encoder time: {end_time - start_time} seconds")
101
+ high_res_feats_0, high_res_feats_1, image_embed = encoder_outputs
102
+ rknn_lite.release()
103
+
104
+ # 4. 运行ONNX decoder
105
+ print("Running ONNX decoder...")
106
+ decoder_session = onnxruntime.InferenceSession(decoder_path)
107
+
108
+ point_coords, point_labels, mask_input, has_mask_input, orig_im_size = prepare_point_input(
109
+ input_point, input_label, orig_image.size[::-1]
110
+ )
111
+
112
+ decoder_inputs = {
113
+ 'image_embed': image_embed,
114
+ 'high_res_feats_0': high_res_feats_0,
115
+ 'high_res_feats_1': high_res_feats_1,
116
+ 'point_coords': point_coords,
117
+ 'point_labels': point_labels,
118
+ 'mask_input': mask_input,
119
+ 'has_mask_input': has_mask_input,
120
+ }
121
+ start_time = time.time()
122
+ low_res_masks, iou_predictions = decoder_session.run(None, decoder_inputs)
123
+ end_time = time.time()
124
+ print(f"ONNX decoder time: {end_time - start_time} seconds")
125
+ print(low_res_masks.shape)
126
+ # 5. 后处理
127
+ w, h = orig_image.size
128
+ masks_rknn = []
129
+
130
+ # 处理所有3个mask
131
+ for i in range(low_res_masks.shape[1]):
132
+ # 将mask缩放到1024x1024
133
+ masks_1024 = cv2.resize(
134
+ low_res_masks[0,i],
135
+ (1024, 1024),
136
+ interpolation=cv2.INTER_LINEAR
137
+ )
138
+
139
+ # 去除padding
140
+ new_h = int(h * scale)
141
+ new_w = int(w * scale)
142
+ start_h = (1024 - new_h) // 2
143
+ start_w = (1024 - new_w) // 2
144
+ masks_no_pad = masks_1024[start_h:start_h+new_h, start_w:start_w+new_w]
145
+
146
+ # 缩放到原始图片尺寸
147
+ mask = cv2.resize(
148
+ masks_no_pad,
149
+ (w, h),
150
+ interpolation=cv2.INTER_LINEAR
151
+ )
152
+
153
+ # 二值化
154
+ mask = mask > 0.0
155
+ masks_rknn.append(mask)
156
+
157
+ # 6. 可视化结果
158
+ plt.figure(figsize=(15, 5))
159
+
160
+ # 获取IoU分数排序的索引
161
+ sorted_indices = np.argsort(iou_predictions[0])[::-1] # 降序排序
162
+
163
+ for idx, mask_idx in enumerate(sorted_indices):
164
+ plt.subplot(1, 3, idx + 1)
165
+ plt.imshow(orig_image)
166
+ plt.imshow(masks_rknn[mask_idx], alpha=0.5)
167
+ plt.plot(input_point_orig[0][0], input_point_orig[0][1], 'rx')
168
+ plt.title(f'Mask {mask_idx+1}\nIoU: {iou_predictions[0][mask_idx]:.3f}')
169
+ plt.axis('off')
170
+
171
+ plt.tight_layout()
172
+ # plt.show()
173
+ plt.savefig("result.png")
174
+
175
+ print(f"\nIoU predictions: {iou_predictions}")
176
+
177
+ if __name__ == "__main__":
178
+ main()