Could you publish results compared to Sonnet 3.5?
From personal tests it (sonnet 3.5) is the SoTA model for describing figures, graphs, etc.
With performing non-profit work (and Anthropic not yet having a consistent way to assist such non-profits in terms of API credits outside of some competitions which with limited developer time we don't engage in) for children with varying degrees of learning difficulties (eg: diagrams, blackboard to text), it would be extremely useful to know if your model would perform better!
Thanks for your advice, I will add sonnet 3.5 in the comparsion table.
I added sonnet 3.5 to the comparison table for the 76B model, see here:
https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B#image-benchmarks
Thank you! Great work on the model btw.
Bit off topic; whats your stance on models like VLM2Vec compared to VLMs? In an AI System it might make sense to route more complex queries to chunked visual embeddings; but use a VLM as a judge with guided_choice. Whats your take on the matter?