---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---
# Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
Jiwan Chung*
Junhyeok Kim*
Siyeol Kim
Jaeyoung Lee
Minsoo Kim
Youngjae Yu
[](https://arxiv.org/abs/2505.18842) [](https://huggingface.co/kjunh/v1-7B)
## Installation
```bash
conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```
## Demo
### Gradio Web UI
Highly Recommended as the copy tokens are displayed on image.
```bash
python run_gradio.py
```
### Inference
```bash
python inference.py
```
The script uses a default image URL and text prompt. To use your own inputs, you can modify the `image` variable within the `messages` list and the `text` field for the user prompt.
## Coming Soon
- [x] Inference code
- [ ] Training data
- [ ] Evaluation code
- [ ] Training code
## Citation
If you find our work valuable, please cite:
```bibtex
@misc{chung2025dontlookoncemultimodal,
title={Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation},
author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
year={2025},
eprint={2505.18842},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18842},
}
```