happyme531 commited on
Commit
1ec6d57
1 Parent(s): 95dfa6c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -3
README.md CHANGED
@@ -1,3 +1,125 @@
1
- ---
2
- license: agpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: agpl-3.0
3
+ language:
4
+ - en
5
+ base_model: microsoft/Florence-2-base-ft
6
+ tags:
7
+ - rknn
8
+ ---
9
+
10
+ # Florence-2-base-ft-ONNX-RKNN2
11
+
12
+ ### (English README see below)
13
+
14
+ ONNX/RKNN2部署Florence-2视觉多模态大模型!
15
+
16
+ - 推理速度(RKNN2):RK3588推理一张512x512图片, 使用`<MORE_DETAILED_CAPTION>`指令, 总时间需要~2秒
17
+ - 内存占用(RKNN2):约1.9GB
18
+
19
+ ## 使用方法
20
+
21
+ 1. 克隆项目到本地
22
+
23
+ 2. 安装依赖
24
+
25
+ ```bash
26
+ pip install transformers onnxruntime pillow numpy<2
27
+ ```
28
+
29
+ 如果需要使用rknn推理, 还需要安装rknn-toolkit2-lite2.
30
+
31
+ 3. 修改项目路径
32
+ 分词器和预处理配置仍然需要使用原项目中的文件. 将onnx/onnxrun.py或onnx/rknnrun.py中的对应路径修改为项目所在路径.
33
+ ```python
34
+ AutoProcessor.from_pretrained(
35
+ "path/to/Florence-2-base-ft-ONNX-RKNN2",
36
+ trust_remote_code=True
37
+ )
38
+ ```
39
+
40
+ 4. 运行
41
+ ```bash
42
+ python onnx/onnxrun.py # 或 python onnx/rknnrun.py
43
+ ```
44
+
45
+ ## RKNN模型转换
46
+
47
+ 你需要提前安装rknn-toolkit2 v2.1.0或更高版本.
48
+
49
+ ```bash
50
+ cd onnx
51
+ python convert.py all
52
+ ```
53
+
54
+ 注意: RKNN模型不支持运行时任意长度的输入, 所以需要提前确定好输入的shape. 之后可以修改convert.py中的`vision_size`, `vision_tokens`, `prompt_tokens`来修改输入的shape.
55
+
56
+ ## 存在的问题(rknn)
57
+ - 已知转换vision encoder时输入分辨率>=640x640时会报错`Buffer overflow!`然后转换失败, 所以目前是使用512x512的输入, 推理质量会下降
58
+ - 在同分辨率下推理精度相比onnxruntime有显著下降, 已确定问题出在vision encoder部分. 如果对精度要求高, 这部分可以换成onnxruntime
59
+ - vision encoder推理需要1.5秒, 占比最大, 但其中大半时间都在做Transpose, 也许还可以优化
60
+ - decode阶段因为kvcache的长度不断变化, 貌似无法简单的使用NPU推理. 不过用onnxruntime应该也足够了
61
+
62
+ ## 参考
63
+ - [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
64
+ - [onnx-community/Florence-2-base-ft](https://huggingface.co/onnx-community/Florence-2-base-ft)
65
+ - [florence2-webgpu](https://huggingface.co/spaces/Xenova/florence2-webgpu)
66
+
67
+
68
+ # English README
69
+
70
+ # Florence-2-base-ft-ONNX-RKNN2
71
+
72
+ ONNX/RKNN2 deployment for Florence-2 visual-language multimodal large model!
73
+
74
+ - Inference speed (RKNN2): RK3588 inference with a 512x512 image, using the `<MORE_DETAILED_CAPTION>` instruction, takes ~2 seconds in total.
75
+ - Memory usage (RKNN2): Approximately 1.9GB
76
+
77
+ ## Usage
78
+
79
+ 1. Clone the project locally
80
+
81
+ 2. Install dependencies
82
+
83
+ ```bash
84
+ pip install transformers onnxruntime pillow numpy<2
85
+ ```
86
+
87
+ If you need to use rknn for inference, you also need to install rknn-toolkit2-lite2.
88
+
89
+ 3. Modify project paths
90
+ The tokenizer and preprocessing configurations still need to use files from the original project. Modify the corresponding paths in onnx/onnxrun.py or onnx/rknnrun.py to the project's location.
91
+ ```python
92
+ AutoProcessor.from_pretrained(
93
+ "path/to/Florence-2-base-ft-ONNX-RKNN2",
94
+ trust_remote_code=True
95
+ )
96
+ ```
97
+
98
+ 4. Run
99
+ ```bash
100
+ python onnx/onnxrun.py # or python onnx/rknnrun.py
101
+ ```
102
+
103
+ ## RKNN Model Conversion
104
+
105
+ You need to install rknn-toolkit2 v2.1.0 or higher in advance.
106
+
107
+ ```bash
108
+ cd onnx
109
+ python convert.py all
110
+ ```
111
+
112
+ Note: The RKNN model does not support arbitrary input lengths at runtime, so you need to determine the input shape in advance. You can modify `vision_size`, `vision_tokens`, and `prompt_tokens` in convert.py to change the input shape.
113
+
114
+ ## Existing Issues (rknn)
115
+
116
+ - It is known that converting the vision encoder with an input resolution >= 640x640 will result in a `Buffer overflow!` error and conversion failure. Therefore, the current input is 512x512, which will reduce inference quality.
117
+ - Inference accuracy is significantly lower compared to onnxruntime at the same resolution. The problem has been identified in the vision encoder part. If high accuracy is required, this part can be replaced with onnxruntime.
118
+ - Vision encoder inference takes 1.5 seconds, accounting for the largest proportion, but most of this time is spent on Transpose op, which may be further optimized.
119
+ - The decode phase seems unable to simply use NPU inference because the length of kvcache keeps changing. However, using onnxruntime should be sufficient.
120
+
121
+ ## References
122
+
123
+ - [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
124
+ - [onnx-community/Florence-2-base-ft](https://huggingface.co/onnx-community/Florence-2-base-ft)
125
+ - [florence2-webgpu](https://huggingface.co/spaces/Xenova/florence2-webgpu)