Papers
arxiv:2403.19322

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

Published on Mar 28, 2024
Authors:
,
,
,
,
,

Abstract

The surge of Multimodal Large Language Models (MLLMs), given their prominent emergent capabilities in instruction following and reasoning, has greatly advanced the field of visual <PRE_TAG>reasoning</POST_TAG>. However, constrained by their non-lossless image tokenization, most MLLMs fall short of comprehensively capturing details of text and objects, especially in high-resolution images. To address this, we propose P2G, a novel framework for plug-and-play grounding of reasoning in MLLMs. Specifically, P2G exploits the tool-usage potential of MLLMs to employ expert agents to achieve on-the-fly grounding to critical visual and textual objects of image, thus achieving deliberate reasoning via multimodal prompting. We further create <PRE_TAG>P2GB</POST_TAG>, a benchmark aimed at assessing MLLMs' ability to understand inter-object relationships and text in challenging high-resolution images. Comprehensive experiments on <PRE_TAG>visual <PRE_TAG>reasoning</POST_TAG> tasks</POST_TAG> demonstrate the superiority of P2G. Noteworthy, P2G achieved comparable performance with GPT-4V on <PRE_TAG>P2GB</POST_TAG>, with a 7B backbone. Our work highlights the potential of plug-and-play grounding of reasoning and opens up a promising alternative beyond model scaling.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2403.19322 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.19322 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.