File size: 5,654 Bytes
9f1911f 42d6573 6fd3186 42d6573 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
---
# ViGoRL: Visually Grounded Reinforcement Learning for Visual Reasoning
This model card describes the ViGoRL (**Vi**sually **G**r**o**unded **R**einforcement **L**earning) model, introduced in our paper ["Grounded Reinforcement Learning for Visual Reasoning"](https://arxiv.org/abs/2505.23678).
**Authors:** Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki
---
## Model Overview
ViGoRL is a vision-language model fine-tuned using reinforcement learning (RL) to explicitly anchor textual reasoning steps to visual coordinates. Inspired by human visual cognition, ViGoRL employs multi-turn visual grounding, dynamically zooming into image regions to perform fine-grained visual reasoning and grounding.
This model was trained using supervised fine-tuning (SFT) on visually-grounded reasoning traces generated via Monte Carlo Tree Search (MCTS), followed by reinforcement learning with Group Relative Policy Optimization (GRPO).
---
## Model Details
* **Base Architecture:** Qwen2.5-Vision-Language (3B or 7B parameters)
* **Training Paradigm:**
* Supervised Fine-Tuning on MCTS-generated reasoning traces
* Group Relative Policy Optimization (GRPO)
* Multi-turn visual grounding with dynamic zoom-in feedback (if "Multiturn" appears in name)
---
## Use Cases
This model excels in visual reasoning tasks that require precise visual grounding and region-level reasoning. Please see model name for specific domain.
* **Spatial Reasoning:** SAT-2, BLINK, RoboSpatial
* **Visual Search:** V\*Bench
* **Web Interaction and Grounding:** ScreenSpot (Pro and V2), VisualWebArena
---
## Usage
You can load this model easily using Hugging Face's Transformers library:
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# # default: Load the model on the available device(s)
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
# "gsarch/ViGoRL-Multiturn-3b-Visual-Search", torch_dtype="auto", device_map="auto"
# ) # replace with any of the ViGoRL models
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"gsarch/ViGoRL-Multiturn-3b-Visual-Search",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained("gsarch/ViGoRL-Multiturn-3b-Visual-Search")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("gsarch/ViGoRL-Multiturn-3b-Visual-Search", min_pixels=min_pixels, max_pixels=max_pixels)
# messages = [
# {
# "role": "user",
# "content": [
# {
# "type": "image",
# "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
# },
# {"type": "text", "text": "What color is the leash."},
# ],
# }
# ]
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/image.png",
},
{"type": "text", "text": "QUERY HERE"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text) # this will output a single tool call turn of the model if version is multiturn.
# Example output of gsarch/ViGoRL-Multiturn-3b-Visual-Search: ['<think> The leash appears to be red, as seen near the dog\'s paw and the person\'s hand. (1028, 1093). </think>\n<tool_call>\n{"name": "search_coordinate", "arguments": {"coordinate": [1028, 1093]}}\n</tool_call>']
```
**Important**: This model requires a system prompt for proper usage. Please see the model's chat template for details.
---
## Datasets and Training Data
Training datasets and generated reasoning chains are publicly available:
* [Code](https://github.com/Gabesarch/grounded-rl)
* [ViGoRL Datasets on Hugging Face](https://huggingface.co/datasets/gsarch/vigorl_datasets)
---
## Citation
If you use ViGoRL in your research or applications, please cite our paper:
```bibtex
@article{sarch2025vigorl,
title={Grounded Reinforcement Learning for Visual Reasoning},
author={Sarch, Gabriel and Saha, Snigdha and Khandelwal, Naitik and Jain, Ayush and Tarr, Michael J and Kumar, Aviral and Fragkiadaki, Katerina},
year={2025}
}
```
---
## Contact
For questions, feedback, or collaborations, please reach out to Gabriel Sarch or open an issue in our [GitHub repository](https://github.com/Gabesarch/grounded-rl).
---
|