Florence-2-base-ft-ONNX-RKNN2

(English README see below)

ONNX/RKNN2部署Florence-2视觉多模态大模型!

推理速度(RKNN2)：RK3588推理一张768x768图片, 使用<MORE_DETAILED_CAPTION>指令, 总时间需要~4秒
内存占用(RKNN2)：约2GB

使用方法

克隆项目到本地(开发板上)
安装依赖

pip install transformers onnxruntime pillow numpy<2 rknn-toolkit-lite2

运行

cd onnx
python ./run.py ./test.jpg "<MORE_DETAILED_CAPTION>"

你可以修改run.py最上方的import来切换使用ONNX或RKNN推理。

RKNN模型转换

你需要提前安装rknn-toolkit2 v2.3.2或更高版本.

cd onnx
python convert.py all

注意: RKNN模型不支持运行时任意长度的输入, 所以需要提前确定好输入的shape. 之后可以修改convert.py中的vision_size, vision_tokens, prompt_tokens来修改输入的shape.

存在的问题(rknn)

~~已知转换vision encoder时输入分辨率>=640x640时会报错Buffer overflow!然后转换失败, 所以目前是使用512x512的输入, 推理质量会下降~~ (已解决)
~~在同分辨率下推理精度相比onnxruntime有显著下降, 已确定问题出在vision encoder部分. 如果对精度要求高, 这部分可以换成onnxruntime~~ (已解决, 不过现在貌似仍然会丢失一点精度)
~~vision encoder推理需要1.5秒, 占比最大, 但其中大半时间都在做Transpose, 也许还可以优化~~ (已优化)
decode阶段因为kvcache的长度不断变化, 貌似无法简单的使用NPU推理. 不过用onnxruntime应该也足够了
理论上可以使用rkllm推理decoder, 但因为rkllm缺少固定位置编码的支持暂时无法实现。参考：https://github.com/airockchip/rknn-llm/issues/296

参考

English README

Deploy Florence-2 vision multi-modal large model with ONNX/RKNN2!

Inference Speed (RKNN2): On RK3588, inferring a 768x768 image with the <MORE_DETAILED_CAPTION> instruction takes ~4 seconds in total.
Memory Usage (RKNN2): About 2GB.

Usage

Clone the project locally (on your development board).
Install dependencies:

pip install transformers onnxruntime pillow numpy<2 rknn-toolkit-lite2

Run:

cd onnx
python ./run.py ./test.jpg "<MORE_DETAILED_CAPTION>"

You can modify the import at the top of run.py to switch between ONNX and RKNN inference.

RKNN Model Conversion

You need to install rknn-toolkit2 v2.3.2 or a higher version beforehand.

cd onnx
python convert.py all

Note: RKNN models do not support dynamic input shapes at runtime, so you need to define the input shape in advance. You can modify vision_size, vision_tokens, and prompt_tokens in convert.py to change the input shape.

Known Issues (rknn)

When converting the vision encoder, an input resolution of >=640x640 would cause a Buffer overflow! error and fail the conversion. Therefore, a 512x512 input was used, which degraded inference quality. (Resolved)
Inference precision was significantly lower compared to onnxruntime at the same resolution. The issue was identified in the vision encoder part. For higher precision requirements, onnxruntime could be used for this part. (Resolved, though there might still be a slight loss of precision)
~~Vision encoder inference took 1.5 seconds, accounting for the largest portion of time, with most of it spent on Transpose operations. There might be room for optimization.~~ (Optimized)
In the decode stage, due to the constantly changing length of the kvcache, it seems that NPU inference cannot be simply used. However, onnxruntime should be sufficient.
Theoretically, rkllm could be used for decoder inference, but it's not currently feasible due to the lack of support for fixed positional encoding in rkllm. See: https://github.com/airockchip/rknn-llm/issues/296

happyme531
/

Florence-2-base-ft-ONNX-RKNN2