--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers --- # Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Jiwan Chung^* Junhyeok Kim^* Siyeol Kim Jaeyoung Lee Minsoo Kim Youngjae Yu

[![arXiv](https://img.shields.io/badge/arXiv-2505.18842-b31b1b.svg)](https://arxiv.org/abs/2505.18842) [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-kjunh/v1--7B-FFD21E)](https://huggingface.co/kjunh/v1-7B)

## Installation ```bash conda create -n v1 python=3.10 -y conda activate v1 pip install -r requirements.txt pip install flash-attn --no-build-isolation ``` ## Demo ### Gradio Web UI Highly Recommended as the copy tokens are displayed on image.

```bash python run_gradio.py ``` ### Inference ```bash python inference.py ``` The script uses a default image URL and text prompt. To use your own inputs, you can modify the `image` variable within the `messages` list and the `text` field for the user prompt. ## Coming Soon - [x] Inference code - [ ] Training data - [ ] Evaluation code - [ ] Training code ## Citation If you find our work valuable, please cite: ```bibtex @misc{chung2025dontlookoncemultimodal, title={Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation}, author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu}, year={2025}, eprint={2505.18842}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.18842}, } ```