--- library_name: transformers pipeline_tag: image-text-to-text base_model: - Qwen/Qwen2.5-VL-3B-Instruct --- # ViGoRL: Visually Grounded Reinforcement Learning for Visual Reasoning This model card describes the ViGoRL (**Vi**sually **G**r**o**unded **R**einforcement **L**earning) model, introduced in our paper ["Grounded Reinforcement Learning for Visual Reasoning"](https://arxiv.org/abs/2505.23678). **Authors:** Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki --- ## Model Overview ViGoRL is a vision-language model fine-tuned using reinforcement learning (RL) to explicitly anchor textual reasoning steps to visual coordinates. Inspired by human visual cognition, ViGoRL employs multi-turn visual grounding, dynamically zooming into image regions to perform fine-grained visual reasoning and grounding. This model was trained using supervised fine-tuning (SFT) on visually-grounded reasoning traces generated via Monte Carlo Tree Search (MCTS), followed by reinforcement learning with Group Relative Policy Optimization (GRPO). --- ## Model Details * **Base Architecture:** Qwen2.5-Vision-Language (3B or 7B parameters) * **Training Paradigm:** * Supervised Fine-Tuning on MCTS-generated reasoning traces * Group Relative Policy Optimization (GRPO) * Multi-turn visual grounding with dynamic zoom-in feedback (if "Multiturn" appears in name) --- ## Use Cases This model excels in visual reasoning tasks that require precise visual grounding and region-level reasoning. Please see model name for specific domain. * **Spatial Reasoning:** SAT-2, BLINK, RoboSpatial * **Visual Search:** V\*Bench * **Web Interaction and Grounding:** ScreenSpot (Pro and V2), VisualWebArena --- ## Usage You can load this model easily using Hugging Face's Transformers library: ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info import torch # # default: Load the model on the available device(s) # model = Qwen2_5_VLForConditionalGeneration.from_pretrained( # "gsarch/ViGoRL-Multiturn-3b-Visual-Search", torch_dtype="auto", device_map="auto" # ) # replace with any of the ViGoRL models # We recommend enabling flash_attention_2 for better acceleration and memory saving. model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "gsarch/ViGoRL-Multiturn-3b-Visual-Search", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) # default processer processor = AutoProcessor.from_pretrained("gsarch/ViGoRL-Multiturn-3b-Visual-Search") # The default range for the number of visual tokens per image in the model is 4-16384. # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. # min_pixels = 256*28*28 # max_pixels = 1280*28*28 # processor = AutoProcessor.from_pretrained("gsarch/ViGoRL-Multiturn-3b-Visual-Search", min_pixels=min_pixels, max_pixels=max_pixels) # messages = [ # { # "role": "user", # "content": [ # { # "type": "image", # "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", # }, # {"type": "text", "text": "What color is the leash."}, # ], # } # ] messages = [ { "role": "user", "content": [ { "type": "image", "image": "path/to/image.png", }, {"type": "text", "text": "QUERY HERE"}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=512) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) # this will output a single tool call turn of the model if version is multiturn. # Example output of gsarch/ViGoRL-Multiturn-3b-Visual-Search: [' The leash appears to be red, as seen near the dog\'s paw and the person\'s hand. (1028, 1093). \n\n{"name": "search_coordinate", "arguments": {"coordinate": [1028, 1093]}}\n'] ``` **Important**: This model requires a system prompt for proper usage. Please see the model's chat template for details. --- ## Datasets and Training Data Training datasets and generated reasoning chains are publicly available: * [Code](https://github.com/Gabesarch/grounded-rl) * [ViGoRL Datasets on Hugging Face](https://huggingface.co/datasets/gsarch/vigorl_datasets) --- ## Citation If you use ViGoRL in your research or applications, please cite our paper: ```bibtex @article{sarch2025vigorl, title={Grounded Reinforcement Learning for Visual Reasoning}, author={Sarch, Gabriel and Saha, Snigdha and Khandelwal, Naitik and Jain, Ayush and Tarr, Michael J and Kumar, Aviral and Fragkiadaki, Katerina}, year={2025} } ``` --- ## Contact For questions, feedback, or collaborations, please reach out to Gabriel Sarch or open an issue in our [GitHub repository](https://github.com/Gabesarch/grounded-rl). ---