license: mit
datasets:
- AI4Math/MathVista
- MathLLMs/MathVision
- AI4Math/MathVerse
- Racktic/dynamath
- lscpku/LogicVista
- nyu-visionx/CV-Bench
- nyu-visionx/VSI-Bench
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
π Technical Report | π€ Hugging Faceο½ π€ ModelScope
Introduction
We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.
π Updates
- [2025.07.07] π₯ We release M2-Reasoning on π€ Hugging Face and π€ ModelScope.
Key Features
- A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
- A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
- Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.
Evaluation
We conduct a comprehensive evaluation of our models across two key domains: general and spatial reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary capability they measure:
- General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.
Models | MathVista | MathVision | MathVerse | DynaMath | WeMath | LogicVista | Avg. (Ξ) |
---|---|---|---|---|---|---|---|
Base-Scale General Models | |||||||
InternVL3-8B | 70.5 | 30.0 | 38.5 | 25.7 | 39.5 | 44.5 | 41.4 |
InternVL3-9B | 69.0 | 29.3 | 37.9 | 25.1 | 34.8 | 49.0 | 40.8 |
Qwen2.5-VL-7B | 68.1 | 25.4 | 41.1 | 21.8 | 36.2 | 47.9 | 40.1 |
MUG-U-7B | 74.8 | 26.1 | 35.4 | 17.2 | 26.5 | 39.8 | 36.6 |
SAIL-VL-1.6-8B | 74.2 | 23.2 | 33.4 | 14.0 | 29.6 | 41.4 | 36.0 |
Base-Scale Reasoning Models | |||||||
WeThink-VL-7B | 71.6 | 26.0 | 44.2 | 24.8 | 48.0 | 51.2 | 44.3 (+4.2) |
Taichu-VLR-7B | 72.3 | 27.1 | 46.7 | 23.0 | 44.0 | 48.3 | 43.6 |
VLAA-Thinker-7B | 68.0 | 26.4 | 48.2 | 22.4 | 41.5 | 48.5 | 42.5 (+2.4) |
URSA-8B-PS-GRPO | 67.8 | 31.8 | 41.5 | 22.4 | 38.3 | 44.7 | 41.1 (+8.2) |
Ovis2-8B | 71.8 | 25.9 | 42.3 | 20.4 | 27.2 | 39.4 | 37.8 |
Our Models | |||||||
Base Model | 70.2 | 25.9 | 30.5 | 20.2 | 27.2 | 37.8 | 35.5 |
M2-Reasoning-CI-7B | 71.7 | 29.2 | 42.1 | 25.0 | 42.8 | 46.8 | 42.9 (+7.4) |
M2-Reasoning-7B | 75.0 | 31.5 | 44.7 | 26.8 | 41.8 | 50.0 | 45.0 (+9.5) |
Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench
- CV-Bench:
Models Count Relation Depth Distance Avg. Large-Scale Models GPT-4O 65.9 85.7 87.8 78.2 78.9 Gemini-1.5-pro 70.4 85.2 82.4 72.8 77.4 Base-Scale Models InternVL3-8B 74.0 90.6 84.3 81.0 82.0 Qwen2.5-VL-7B-Instruct 65.2 86.6 70.6 79.8 75.0 LLava-NEXT-Video-7B 59.3 77.0 71.3 54.7 65.2 Our Models M2-Reasoning-7B 66.6 92.8 89.3 84.3 82.3 - VSI-Bench:
OC AD OS RS RDs RDr RP AO Avg. Large-Scale Models Gemini-1.5-pro 56.2 30.9 64.1 43.6 51.3 46.3 36.0 34.6 45.4 GPT-4O 46.2 5.3 43.8 38.2 37.0 41.3 31.5 28.5 34.0 Base-Scale Models InternVL3-8B 68.1 39.0 48.4 33.6 48.3 36.4 27.3 35.4 42.1 Video-R1-7B - - - - - - - - 37.1 Qwen2.5-VL-7B-Instruct 37.7 20.1 49.7 37.4 38.5 40.4 31.4 32.0 35.9 LLava-NeXT-Video-7B 48.5 14.0 47.8 24.2 43.5 42.4 34.0 30.6 35.6 Our Models M2-Reasoning-7B 41.0 34.0 60.9 55.4 40.7 47.3 29.9 28.8 42.3
Installation
Please download our model following Model Downloads, then you can refer to the following codes to run M2-Reasoning model.
The basic environment is python=3.10
, torch=2.6.0+cu124
, transformers=4.49.0
Example Usage
We provide a small example on the usage of this repo. For detailed usage.
import os
import torch
from transformers import (
AutoProcessor,
AutoTokenizer,
)
import warnings
import argparse
from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration
from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor
warnings.filterwarnings("ignore")
class BailingMMInfer:
def __init__(self,
model_name_or_path,
device="cuda",
max_pixels=None,
min_pixels=None,
video_max_pixels=768 * 28 * 28,
video_min_pixels=128 * 28 * 28,
generation_config=None
):
super().__init__()
self.model_name_or_path = model_name_or_path
self.device = device
self.device_map = device
self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28
self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28
self.model, self.tokenizer, self.processor = self.load_model_processor()
if max_pixels is not None:
self.processor.max_pixels = max_pixels
if min_pixels is not None:
self.processor.min_pixels = min_pixels
if generation_config is None:
generation_config = {
"num_beams": 1,
"do_sample": True,
"temperature": 0.9
}
self.generation_config = generation_config
def load_model_processor(self):
model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained(
self.model_name_or_path,
torch_dtype=torch.bfloat16,
device_map=self.device_map,
_attn_implementation="flash_attention_2"
).eval()
tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True)
processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True)
return model, tokenizer, processor
def generate(self, messages, max_new_tokens=512):
text = self.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, use_system=True
)
image_inputs, video_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
return_tensors="pt",
)
# print(inputs)
print(self.tokenizer.decode(inputs['input_ids'][0]))
inputs = inputs.to(self.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
with torch.no_grad():
generated_ids = self.model.generate(
inputs,
max_new_tokens=max_new_tokens,
eos_token_id=self.processor.tokenizer.eos_token_id,
**self.generation_config,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)[0]
return output_text
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning")
parser.add_argument('--max_pixels', type=int, default=401408)
parser.add_argument('--min_pixels', type=int, default=401408)
parser.add_argument('--max_new_tokens', type=int, default=4096)
args = parser.parse_args()
device = "cuda" if torch.cuda.is_available() else "cpu"
# model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path)
bailing2 = BailingMMInfer(
args.model_name_or_path,
device=device,
max_pixels=args.max_pixels,
min_pixels=args.min_pixels
)
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]},
{
"role": "user",
"content": [
{"type": "image", "image": "./assets/example1.png"},
{"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"},
],
},
]
output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens)
print(output_text)
'''
[Output]:
<think>
To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:
\[
\text{Area} = \frac{1}{2} \times d_1 \times d_2
\]
where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:
- The area of the rhombus is 137.9 square meters.
- One of the diagonals, \( RT \), is 12.2 meters.
We need to find the length of the other diagonal, \( QS \).
Let's denote:
- \( d_1 = RT = 12.2 \) meters
- \( d_2 = QS \)
Substitute the known values into the area formula:
\[
137.9 = \frac{1}{2} \times 12.2 \times QS
\]
To solve for \( QS \), first multiply both sides by 2 to eliminate the fraction:
\[
275.8 = 12.2 \times QS
\]
Next, divide both sides by 12.2:
\[
QS = \frac{275.8}{12.2}
\]
Now, perform the division:
\[
QS \approx 22.6
\]
So, the length of \( QS \) is approximately 22.6 meters.
Looking at the options provided:
A. 11.3
B. 22.4
C. 22.6
D. 25.6
The correct answer is C. 22.6.
</think>
<answer>
\boxed{C. 22.6}
</answer><|im_end|>
'''
License and Legal Disclaimer
This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{M2reasoning2025,
title = {M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning},
author = {Inclusion AI},
year = {2025},
archivePrefix = {arXiv},
}