Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,41 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
# MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
|
11 |
+
|
12 |
+
## Abstract
|
13 |
+
Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents.
|
14 |
+
Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval.
|
15 |
+
To address this gap, this work introduces a new benchmark, named as **MMDocIR**, encompassing two distinct tasks: **page-level** and **layout-level** retrieval.
|
16 |
+
**The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis.**
|
17 |
+
A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts.
|
18 |
+
The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation.
|
19 |
+
Through rigorous experiments, we reveal that
|
20 |
+
(i) visual retrievers significantly outperform their text counterparts;
|
21 |
+
(ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval;
|
22 |
+
(iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text.
|
23 |
+
These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
|
24 |
+
|
25 |
+
|
26 |
+
## Evaluation Set
|
27 |
+
### Document Analysis
|
28 |
+
|
29 |
+
**MMDocIR** evluation set includes 313 long documents averaging 65.1 pages, categorized into ten main domains: research reports, administration&industry, tutorials&workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles.
|
30 |
+
Different domains feature distinct distributions of multi-modal information. For instance, research reports, tutorials, workshops, and brochures predominantly contain images, whereas financial and industry documents are table-rich. In contrast, government and legal documents primarily comprise text. Overall, the modality distribution is: Text (60.4%), Image (18.8%), Table (16.7%), and other modalities (4.1%).
|
31 |
+
|
32 |
+
### Question and Annotation Analysis
|
33 |
+
**MMDocIR** evluation set encompasses 1,658 questions, 2,107 page labels, and 2,638 layout labels. The modalities required to answer these questions distribute across four categories: Text (44.7%), Image (21.7%), Table (37.4%), and Layout/Meta (11.5%). The ``Layout/Meta'' category encompasses questions related to layout information and meta-data statistics.
|
34 |
+
Notably, the dataset poses several challenges: 254 questions necessitate cross-modal understanding, 313 questions demand evidence across multiple pages, and 637 questions require reasoning based on multiple layouts. These complexities highlight the need for advanced multi-modal reasoning and contextual understanding.
|
35 |
+
|
36 |
+
|
37 |
+
## Train Set
|
38 |
+
|
39 |
+
|
40 |
+
|
41 |
+
|
42 |
+
## Citation
|
43 |
+
If you use any datasets from this organization in your research, please cite the original dataset as follows:
|
44 |
+
```
|
45 |
+
@misc{,
|
46 |
+
}
|
47 |
+
```
|