daviddongdong commited on
Commit
51a18a8
verified
1 Parent(s): aec823f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -40
README.md CHANGED
@@ -7,50 +7,95 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- # MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
11
 
12
 
13
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66337cb5bd8ef15a47e72ce0/hhAX190AZageb5Bqqr_lp.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
 
16
- ## 1. Abstract
17
- Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents.
18
- Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval.
19
- To address this gap, this work introduces a new benchmark, named as **MMDocIR**, encompassing two distinct tasks: **page-level** and **layout-level** retrieval.
20
- **The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis.**
21
- A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts.
22
- The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation.
23
- Through rigorous experiments, we reveal that
24
- (i) visual retrievers significantly outperform their text counterparts;
25
- (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval;
26
- (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text.
27
- These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
28
-
29
-
30
- ## 2. Task Setting
31
- ### Page-level Retrieval
32
- The page-level retrieval task is designed to identify the most relevant pages within a document in response to a user query.
33
-
34
- ### Layout-level Retrieval
35
- The layout-level retrieval aims to retrieve most relevant layouts.
36
- The layouts are defined as the fine-grained elements such as paragraphs, equations, figures, tables, and charts.
37
- This task allows for a more nuanced content retrieval, honing in on specific information that directly answers user queries.
38
-
39
-
40
-
41
- ## 3. Evaluation Set
42
- ### 3.1 Document Analysis
43
-
44
- **MMDocIR** evaluation set includes 313 long documents averaging 65.1 pages, categorized into ten main domains: research reports, administration&industry, tutorials&workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles.
45
- Different domains feature distinct distributions of multi-modal information. For instance, research reports, tutorials, workshops, and brochures predominantly contain images, whereas financial and industry documents are table-rich. In contrast, government and legal documents primarily comprise text. Overall, the modality distribution is: Text (60.4%), Image (18.8%), Table (16.7%), and other modalities (4.1%).
46
-
47
- ### 3.2 Question and Annotation Analysis
48
- **MMDocIR** evluation set encompasses 1,658 questions, 2,107 page labels, and 2,638 layout labels. The modalities required to answer these questions distribute across four categories: Text (44.7%), Image (21.7%), Table (37.4%), and Layout/Meta (11.5%). The ``Layout/Meta'' category encompasses questions related to layout information and meta-data statistics.
49
- Notably, the dataset poses several challenges: 254 questions necessitate cross-modal understanding, 313 questions demand evidence across multiple pages, and 637 questions require reasoning based on multiple layouts. These complexities highlight the need for advanced multi-modal reasoning and contextual understanding.
50
-
51
-
52
- ## 4. Train Set
53
-
54
 
55
 
56
 
@@ -66,4 +111,15 @@ If you use any datasets from this organization in your research, please cite the
66
  primaryClass={cs.IR},
67
  url={https://arxiv.org/abs/2501.08828},
68
  }
 
 
 
 
 
 
 
 
 
 
 
69
  ```
 
7
  pinned: false
8
  ---
9
 
 
10
 
11
 
12
+ # MMDocRAG Overview
13
+
14
+ MMDocRAG is built for (i) multimodal document retrieval and (ii) retrieval-augmented multimodal generation:
15
+
16
+ - **MMDocIR** (馃摉<a href="https://arxiv.org/abs/2501.08828">Paper</a> 馃彔<a href="https://mmdocrag.github.io/MMDocIR/">Homepage</a> 馃憠<a href="https://github.com/mmdocrag/MMDocIR">Github</a>): Benchmarking Multi-Modal Retrieval for Long Documents:
17
+ - encompass two distinct tasks: **page-level** and **layout-level** retrieval.
18
+ - [MMDocIR_Evaluation_Dataset](https://huggingface.co/datasets/MMDocIR/MMDocIR_Evaluation_Dataset): 1,685 expert-annotated questions for evaluation.
19
+ - [MMDocIR_Train_Dataset](https://huggingface.co/datasets/MMDocIR/MMDocIR_Train_Dataset): 173,843 bootstrapped questions for training.
20
+ - [Retriever Checkpoints](https://huggingface.co/MMDocIR/MMDocIR_Retrievers): 6 text and 4 vision retrievers.
21
+ - **MMDocRAG** (馃摉<a href="https://arxiv.org/abs/2505.16470">Paper</a> 馃彔<a href="https://mmdocrag.github.io/MMDocRAG/">Homepage</a> 馃憠<a href="https://github.com/mmdocrag/MMDocRAG">Github</a>): Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
22
+ - design for **multimodal quotes selection** and **multimodal integration**
23
+ - [MMDocRAG](https://huggingface.co/datasets/MMDocIR/MMDocRAG): 4,055 expert-annotated questions and multimodal answer.
24
+ - [MMDocRAG Training](https://huggingface.co/datasets/MMDocIR/MMDocRAG/blob/main/train.jsonl): 4,110 training samples derived from dev set.
25
+ - [Retriever Checkpoints](https://huggingface.co/MMDocIR/MMDocIR_Retrievers): 6 text and 4 vision retrievers.
26
+
27
+
28
+
29
+
30
+
31
+ <p align="center">
32
+ <h2 align="center">MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents</h2>
33
+ <p align="center">
34
+ <strong>Kuicai Dong*</strong></a>
35
+
36
+ <strong>Yujing Chang*</strong>
37
+
38
+ <strong>Derrick Xin Deik Goh*</strong>
39
+
40
+ <strong>Dexun Li</strong>
41
+
42
+ <a href="https://scholar.google.com/citations?user=fUtHww0AAAAJ&hl=en"><strong>Ruiming Tang</strong></a>
43
+
44
+ <a href="https://stephenliu0423.github.io/"><strong>Yong Liu</strong></a>
45
+ </p>
46
+ <p align="center">
47
+ 馃摉<a href="https://arxiv.org/abs/2501.08828">Paper</a> |
48
+ 馃彔<a href="https://mmdocrag.github.io/MMDocIR/">Homepage</a> |
49
+ 馃<a href="https://huggingface.co/MMDocIR">Huggingface</a> |
50
+ 馃憠<a href="https://github.com/mmdocrag/MMDocIR">Github</a>
51
+ </p>
52
+ <p align="left">
53
+ <p>
54
+ Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout
55
+ refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for
56
+ both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text.
57
+ </p>
58
+ <a href="">
59
+ <div style="text-align: center;">
60
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66337cb5bd8ef15a47e72ce0/hhAX190AZageb5Bqqr_lp.png" alt="Logo" width="80%">
61
+ </div>
62
+ </a>
63
+ <br>
64
+
65
+
66
+
67
+
68
+ <p align="center">
69
+ <h2 align="center">MMDocRAG: Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering</h2>
70
+ <p align="center">
71
+ <strong>Kuicai Dong</strong></a>
72
+
73
+ <strong>Yujing Chang</strong>
74
+
75
+ <strong>Shijie Huang</strong>
76
+
77
+ <a href="https://scholar.google.com/citations?user=x-UYeJ4AAAAJ&hl=en"><strong>Yasheng Wang</strong></a>
78
+
79
+ <a href="https://scholar.google.com/citations?user=fUtHww0AAAAJ&hl=en"><strong>Ruiming Tang</strong></a>
80
+
81
+ <a href="https://stephenliu0423.github.io/"><strong>Yong Liu</strong></a>
82
+ <p align="center">
83
+ 馃摉<a href="https://arxiv.org/abs/2505.16470">Paper</a> |
84
+ 馃彔<a href="https://mmdocrag.github.io/MMDocRAG/">Homepage</a> |
85
+ 馃<a href="https://huggingface.co/datasets/MMDocIR/MMDocRAG">Huggingface</a> |
86
+ 馃憠<a href="https://github.com/mmdocrag/MMDocRAG">Github</a>
87
+ </p>
88
+ <p align="left">
89
+ <p>
90
+ Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence integration and selection. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that combine text with relevant visual elements. Through large-scale experiments with 60 language/vision models and 14 retrieval systems, we identify persistent challenges in multimodal evidence handling. Key findings reveal proprietary vision-language models show
91
+ moderate advantages over text-only models, while open-source alternatives trail significantly. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems.
92
+ </p>
93
+ <a href="">
94
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66337cb5bd8ef15a47e72ce0/o-9uRuQyFJNU-bHsE3LSR.png" alt="Logo" width="100%">
95
+ </a>
96
+ <br>
97
 
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
 
101
 
 
111
  primaryClass={cs.IR},
112
  url={https://arxiv.org/abs/2501.08828},
113
  }
114
+
115
+ @misc{dong2025benchmarkingretrievalaugmentedmultimomalgeneration,
116
+ title={Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering},
117
+ author={Kuicai Dong and Yujing Chang and Shijie Huang and Yasheng Wang and Ruiming Tang and Yong Liu},
118
+ year={2025},
119
+ eprint={2505.16470},
120
+ archivePrefix={arXiv},
121
+ primaryClass={cs.IR},
122
+ url={https://arxiv.org/abs/2505.16470},
123
+ }
124
+
125
  ```