khhuang commited on
Commit
2299e1d
·
verified ·
1 Parent(s): 56198c1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +39 -3
README.md CHANGED
@@ -1,3 +1,39 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ base_model: OpenGVLab/InternVL2_5-1B-MPO
4
+ datasets:
5
+ - Salesforce/CogAlign
6
+ language:
7
+ - multilingual
8
+ model-index:
9
+ - name: cogalign-internvl2.5-mpo-1b
10
+ results: []
11
+ ---
12
+
13
+
14
+
15
+ # Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
16
+
17
+ Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the performance of three diverse VLMs on our proposed probing tasks. Furthermore, CogAlign enhances performance by an average of 4.6% on CHOCOLATE and 2.9% on MATH-VISION, outperforming or matching supervised fine-tuning methods while requiring only 60% less training data. These results highlight the effectiveness and generalizability of CogAlign in improving fundamental visual arithmetic capabilities and their transfer to downstream tasks.
18
+
19
+ ### License information
20
+ This release is for research purposes only in support of an academic paper. This repository is licensed under the noncommercial license [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
21
+
22
+
23
+ ### Citation
24
+ If you find CogAlign useful in your research, please consider citing:
25
+ ```
26
+ @misc{huang-etal-2025-cogalign,
27
+ title = "Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding",
28
+ author = "Huang, Kung-Hsiang and
29
+ Qin, Can and
30
+ Qiu, Haoyi and
31
+ Laban, Philippe and
32
+ Joty, Shafiq and
33
+ Xiong, Caiming and
34
+ Wu, Chien-Sheng",
35
+ year = "2025",
36
+ archivePrefix = "arXiv",
37
+ primaryClass={cs.AI}
38
+ }
39
+ ```