How can I use this for inference on my local machine, using Python.
I have been using Qwen2-VL-2B-Instruct to do some OCR work on my local machine having RTX 4090. I want to see if ONNX runtime improve the inference speed, and also my target is to deploy it on a server. I tried several ways to use the ONNX version but failed.
apart from this , I also converted the model into ONNX with the below file structure.
output/
merges.txt
added_tokens.json
config.json
chat_template.json
special_tokens_map.json
tokenizer.json
vocab.json
tokenizer_config.json
preprocessor_config.json
generation_config.json
onnx/
embed_tokens.onnx
decoder_model_merged.onnx
vision_encoder.onnx_data
decoder_model_merged.onnx_data
vision_encoder.onnx
Can I also use the generated ONNX for inference as well?