File size: 12,576 Bytes
04a59ec 6345362 2a64771 6345362 04a59ec 6345362 bed6ae8 2a64771 6345362 79b4131 04a59ec 6345362 04a59ec bed6ae8 6345362 04a59ec 6345362 04a59ec 6345362 bed6ae8 6345362 04a59ec 6345362 04a59ec 6345362 04a59ec 6345362 2a64771 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
# M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
π [Technical Report](./assets/M2-Reasoning.pdf) | π [arXiv](https://arxiv.org/abs/2507.08306) | π€ [Hugging Face](https://huggingface.co/inclusionAI/M2-Reasoning)ο½ π€ [ModelScope](https://www.modelscope.cn/models/inclusionAI/M2-Reasoning)
## Introduction
We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

## π Updates
- [2025.07.14] π₯ Our Technical Report is available on π [arXiv](https://arxiv.org/abs/2507.08306).
- [2025.07.11] π₯ We release M2-Reasoning on π€ [Hugging Face](https://huggingface.co/inclusionAI/M2-Reasoning) and π€ [ModelScope](https://www.modelscope.cn/models/inclusionAI/M2-Reasoning).
## Key Features
- A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
- A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
- Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.
## Evaluation
We conduct a comprehensive evaluation of our models across two key domains: general and spatial
reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary
capability they measure:
- General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.
|Models| MathVista| MathVision| MathVerse| DynaMath| WeMath| LogicVista| Avg. (Ξ)|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|***Base-Scale General Models***|
|InternVL3-8B | 70.5| 30.0| 38.5| 25.7 |39.5 |44.5 |41.4|
|InternVL3-9B | 69.0 | 29.3| 37.9 |25.1 |34.8| 49.0 |40.8|
|Qwen2.5-VL-7B |68.1 |25.4 |41.1 |21.8 |36.2| 47.9| 40.1|
|MUG-U-7B | 74.8 |26.1 |35.4 |17.2 |26.5 |39.8| 36.6|
|SAIL-VL-1.6-8B | 74.2 |23.2| 33.4 |14.0 |29.6 |41.4| 36.0|
|***Base-Scale Reasoning Models***|
|WeThink-VL-7B| 71.6 |26.0| 44.2 |24.8 |**48.0** |**51.2**| 44.3 (+4.2)|
|Taichu-VLR-7B | 72.3| 27.1 |46.7 |23.0 |44.0 |48.3 |43.6|
|VLAA-Thinker-7B | 68.0 |26.4| **48.2** |22.4 |41.5 |48.5 |42.5 (+2.4)|
|URSA-8B-PS-GRPO | 67.8 |**31.8** |41.5 |22.4| 38.3 |44.7 |41.1 (+8.2)|
|Ovis2-8B |71.8 |25.9| 42.3 |20.4 |27.2 |39.4| 37.8|
|***Our Models***|
|Base Model |70.2| 25.9| 30.5| 20.2| 27.2| 37.8| 35.5|
|M2-Reasoning-CI-7B| 71.7| 29.2| 42.1| 25.0 |42.8| 46.8 |42.9 (+7.4)|
|M2-Reasoning-7B | **75.0** |31.5| 44.7 |**26.8** |41.8 |50.0 |**45.0 (+9.5)**|
|M2-Reasoning-7B-HF* | 74.7 |30.5| 46.1 |26.8 |42.7 |49.2 |45.0 (+9.5)|
\* After converting the checkpoints to huggingface, the accuracies are slightly different.
- Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench
- CV-Bench:
| Models | Count | Relation | Depth | Distance | Avg. |
| :--- | :---: | :---: | :---: | :---: | :---: |
| ***Large-Scale Models*** | | | | | |
| GPT-4O | 65.9 | 85.7 | 87.8 | 78.2 | 78.9 |
| Gemini-1.5-pro | 70.4 | 85.2 | 82.4 | 72.8 | 77.4 |
| ***Base-Scale Models*** | | | | | |
| InternVL3-8B| **74.0** | 90.6 | 84.3 | 81.0 | 82.0 |
| Qwen2.5-VL-7B-Instruct | 65.2 | 86.6 | 70.6 | 79.8 | 75.0 |
| LLava-NEXT-Video-7B | 59.3 | 77.0 | 71.3 | 54.7 | 65.2 |
| ***Our Models*** | | | | | |
| M2-Reasoning-7B | 66.6 | **92.8** | **89.3** | **84.3** | **82.3** |
- VSI-Bench:
| | OC | AD| OS|RS |RDs |RDr |RP |AO |Avg. |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ***Large-Scale Models*** | | | | | | | | | |
| Gemini-1.5-pro | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 | 45.4 |
| GPT-4O | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 | 34.0 |
| ***Base-Scale Models*** | | | | | | | | | |
| InternVL3-8B | **68.1** | **39.0** | 48.4 | 33.6 | **48.3** | 36.4 | 27.3 | **35.4** | 42.1 |
| Video-R1-7B | - | - | - | - | - | - | - | - | 37.1 |
| Qwen2.5-VL-7B-Instruct| 37.7 | 20.1 | 49.7 | 37.4 | 38.5 | 40.4 | 31.4 | 32.0 | 35.9 |
| LLava-NeXT-Video-7B| 48.5 | 14.0 | 47.8 | 24.2 | 43.5 | 42.4 | **34.0** | 30.6 | 35.6 |
| ***Our Models*** | | | | | | | | | |
| M2-Reasoning-7B | 41.0 | 34.0 | **60.9** | **55.4** | 40.7 | **47.3** | 29.9 | 28.8 | **42.3** |
## Model Downloads
You can download the model from both π€ [Hugging Face](https://huggingface.co/inclusionAI/M2-Reasoning) and π€ [ModelScope](https://www.modelscope.cn/models/inclusionAI/M2-Reasoning).
## Installation
Please download our model following Model Downloads, then you can refer to the following codes to run M2-Reasoning model.
The basic environment is `python=3.10`, `torch=2.6.0+cu124`, `transformers=4.49.0`
## Example Usage
We provide a small example on the usage of this repo. For detailed usage.
``` python
import os
import torch
from transformers import (
AutoProcessor,
AutoTokenizer,
)
import warnings
import argparse
from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration
from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor
warnings.filterwarnings("ignore")
class BailingMMInfer:
def __init__(self,
model_name_or_path,
device="cuda",
max_pixels=None,
min_pixels=None,
video_max_pixels=768 * 28 * 28,
video_min_pixels=128 * 28 * 28,
generation_config=None
):
super().__init__()
self.model_name_or_path = model_name_or_path
self.device = device
self.device_map = device
self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28
self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28
self.model, self.tokenizer, self.processor = self.load_model_processor()
if max_pixels is not None:
self.processor.max_pixels = max_pixels
if min_pixels is not None:
self.processor.min_pixels = min_pixels
if generation_config is None:
generation_config = {
"num_beams": 1,
"do_sample": True,
"temperature": 0.9
}
self.generation_config = generation_config
def load_model_processor(self):
model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained(
self.model_name_or_path,
torch_dtype=torch.bfloat16,
device_map=self.device_map,
_attn_implementation="flash_attention_2"
).eval()
tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True)
processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True)
return model, tokenizer, processor
def generate(self, messages, max_new_tokens=512):
text = self.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, use_system=True
)
image_inputs, video_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
return_tensors="pt",
)
# print(inputs)
print(self.tokenizer.decode(inputs['input_ids'][0]))
inputs = inputs.to(self.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
with torch.no_grad():
generated_ids = self.model.generate(
inputs,
max_new_tokens=max_new_tokens,
eos_token_id=self.processor.tokenizer.eos_token_id,
**self.generation_config,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)[0]
return output_text
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning")
parser.add_argument('--max_pixels', type=int, default=401408)
parser.add_argument('--min_pixels', type=int, default=401408)
parser.add_argument('--max_new_tokens', type=int, default=4096)
args = parser.parse_args()
device = "cuda" if torch.cuda.is_available() else "cpu"
# model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path)
bailing2 = BailingMMInfer(
args.model_name_or_path,
device=device,
max_pixels=args.max_pixels,
min_pixels=args.min_pixels
)
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]},
{
"role": "user",
"content": [
{"type": "image", "image": "./assets/example1.png"},
{"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"},
],
},
]
output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens)
print(output_text)
'''
[Output]:
<think>
To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:
\[
\text{Area} = \frac{1}{2} \times d_1 \times d_2
\]
where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:
- The area of the rhombus is 137.9 square meters.
- One of the diagonals, \( RT \), is 12.2 meters.
We need to find the length of the other diagonal, \( QS \).
Let's denote:
- \( d_1 = RT = 12.2 \) meters
- \( d_2 = QS \)
Substitute the known values into the area formula:
\[
137.9 = \frac{1}{2} \times 12.2 \times QS
\]
To solve for \( QS \), first multiply both sides by 2 to eliminate the fraction:
\[
275.8 = 12.2 \times QS
\]
Next, divide both sides by 12.2:
\[
QS = \frac{275.8}{12.2}
\]
Now, perform the division:
\[
QS \approx 22.6
\]
So, the length of \( QS \) is approximately 22.6 meters.
Looking at the options provided:
A. 11.3
B. 22.4
C. 22.6
D. 25.6
The correct answer is C. 22.6.
</think>
<answer>
\boxed{C. 22.6}
</answer><|im_end|>
'''
```
## License and Legal Disclaimer
This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.
## Citation
If you find our work helpful, feel free to give us a cite.
```
@misc{M2reasoning2025,
title = {M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning},
author = {Inclusion AI},
year = {2025},
archivePrefix = {arXiv},
}
```
|