arxiv:2403.19322

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

Published on Mar 28, 2024

Authors:

Xiang An ,

Abstract

The surge of Multimodal Large Language Models (MLLMs), given their prominent emergent capabilities in instruction following and reasoning, has greatly advanced the field of visual <PRE_TAG>reasoning</POST_TAG>. However, constrained by their non-lossless image tokenization, most MLLMs fall short of comprehensively capturing details of text and objects, especially in high-resolution images. To address this, we propose P2G, a novel framework for plug-and-play grounding of reasoning in MLLMs. Specifically, P2G exploits the tool-usage potential of MLLMs to employ expert agents to achieve on-the-fly grounding to critical visual and textual objects of image, thus achieving deliberate reasoning via multimodal prompting. We further create <PRE_TAG>P2GB</POST_TAG>, a benchmark aimed at assessing MLLMs' ability to understand inter-object relationships and text in challenging high-resolution images. Comprehensive experiments on <PRE_TAG>visual <PRE_TAG>reasoning</POST_TAG> tasks</POST_TAG> demonstrate the superiority of P2G. Noteworthy, P2G achieved comparable performance with GPT-4V on <PRE_TAG>P2GB</POST_TAG>, with a 7B backbone. Our work highlights the potential of plug-and-play grounding of reasoning and opens up a promising alternative beyond model scaling.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2403.19322 in a model README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2403.19322 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.