File size: 2,528 Bytes
6582ce4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: mit
---
## INF-MLLM2: High-Resolution Image and Document Understanding

In INF-MLLM2, we have introduced significant updates, particularly in high-resolution image processing, document understanding and OCR.
The key improvements include the following:
- Dynamic Image Resolution Support: The model now supports dynamic image resolution up to 1344x1344 pixels.
- Enhanced OCR Capabilities: The model has significantly improved OCR capabilities, enabling robust document parsing, table and formula recognition, document layout analysis, and key information extraction.
- Advanced Training Strategies: We employed a progressive multi-stage training strategy along with an enhanced data mixup strategy tailored for image and document multitask scenarios.

<p align="center">
<img src="docs/model.png" alt="" width="100%"/>
</p>

[Technical Report](docs/tech_report.pdf)

### Install

```bash
conda create -n infmllm2 python=3.9
conda activate infmllm2
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.1.2

pip install transformers==4.40.2 timm==0.5.4 pillow==10.4.0 sentencepiece==0.1.99
pip install bigmodelvis peft einops spacy
```

### Model Zoo
We have released the INF-MLLM2-7B model on Hugging Face.
- [INF-MLLM2-7B](https://huggingface.co/QianYEee/InfMLLM2_7B_chat)

### Evaluation
The comparison with general multimodal LLM across multiple benchmarks and OCR-related tasks.
<p align="center">
<img src="docs/results_1.jpg" alt="" width="90%"/>
</p>

The comparison with OCR-free multimodal LLM for content parsing of documents/tables/formulas.
<p align="center">
<img src="docs/results_2.jpg" alt="" width="90%"/>
</p>

The comparison with OCR-free multimodal LLM for key information extraction.
<p align="center">
<img src="docs/results_3.jpg" alt="" width="90%"/>
</p>

### Visualization

<p align="center">
<img src="docs/demo1.png" alt="" width="90%"/>
</p>

<p align="center">
<img src="docs/demo2.png" alt="" width="90%"/>
</p>

<p align="center">
<img src="docs/demo3.png" alt="" width="90%"/>
</p>

<p align="center">
<img src="docs/table_equation.png" alt="" width="90%"/>
</p>

### Usage

The inference process for INF-MLLM2 is straightforward. We also provide a simple [demo.py](demo.py) script as a reference.

```bash
CUDA_VISIBLE_DEVICES=0 python demo.py --model_path /path/to/InfMLLM2_7B_chat
```

## Acknowledgement

We thank the great work from [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT.git) and [InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer.git).