daviddongdong commited on
Commit
fd10d42
·
verified ·
1 Parent(s): e5ab49f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -5
README.md CHANGED
@@ -13,7 +13,7 @@ pinned: false
13
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66337cb5bd8ef15a47e72ce0/hhAX190AZageb5Bqqr_lp.png)
14
 
15
 
16
- ## Abstract
17
  Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents.
18
  Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval.
19
  To address this gap, this work introduces a new benchmark, named as **MMDocIR**, encompassing two distinct tasks: **page-level** and **layout-level** retrieval.
@@ -27,18 +27,29 @@ Through rigorous experiments, we reveal that
27
  These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
28
 
29
 
30
- ## Evaluation Set
31
- ### Document Analysis
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  **MMDocIR** evluation set includes 313 long documents averaging 65.1 pages, categorized into ten main domains: research reports, administration&industry, tutorials&workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles.
34
  Different domains feature distinct distributions of multi-modal information. For instance, research reports, tutorials, workshops, and brochures predominantly contain images, whereas financial and industry documents are table-rich. In contrast, government and legal documents primarily comprise text. Overall, the modality distribution is: Text (60.4%), Image (18.8%), Table (16.7%), and other modalities (4.1%).
35
 
36
- ### Question and Annotation Analysis
37
  **MMDocIR** evluation set encompasses 1,658 questions, 2,107 page labels, and 2,638 layout labels. The modalities required to answer these questions distribute across four categories: Text (44.7%), Image (21.7%), Table (37.4%), and Layout/Meta (11.5%). The ``Layout/Meta'' category encompasses questions related to layout information and meta-data statistics.
38
  Notably, the dataset poses several challenges: 254 questions necessitate cross-modal understanding, 313 questions demand evidence across multiple pages, and 637 questions require reasoning based on multiple layouts. These complexities highlight the need for advanced multi-modal reasoning and contextual understanding.
39
 
40
 
41
- ## Train Set
42
 
43
 
44
 
 
13
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66337cb5bd8ef15a47e72ce0/hhAX190AZageb5Bqqr_lp.png)
14
 
15
 
16
+ ## 1. Abstract
17
  Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents.
18
  Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval.
19
  To address this gap, this work introduces a new benchmark, named as **MMDocIR**, encompassing two distinct tasks: **page-level** and **layout-level** retrieval.
 
27
  These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
28
 
29
 
30
+ ## 2. Task Setting
31
+ ### Page-level Retrieval
32
+ The page-level retrieval task is designed to identify the most relevant pages within a document in response to a user query.
33
+
34
+ ### Layout-level Retrieval
35
+ The layout-level retrieval aims to retrieve most relevant layouts.
36
+ The layouts are defined as the fine-grained elements such as paragraphs, equations, figures, tables, and charts.
37
+ This task allows for a more nuanced content retrieval, honing in on specific information that directly answers user queries.
38
+
39
+
40
+
41
+ ## 3. Evaluation Set
42
+ ### 3.1 Document Analysis
43
 
44
  **MMDocIR** evluation set includes 313 long documents averaging 65.1 pages, categorized into ten main domains: research reports, administration&industry, tutorials&workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles.
45
  Different domains feature distinct distributions of multi-modal information. For instance, research reports, tutorials, workshops, and brochures predominantly contain images, whereas financial and industry documents are table-rich. In contrast, government and legal documents primarily comprise text. Overall, the modality distribution is: Text (60.4%), Image (18.8%), Table (16.7%), and other modalities (4.1%).
46
 
47
+ ### 3.2 Question and Annotation Analysis
48
  **MMDocIR** evluation set encompasses 1,658 questions, 2,107 page labels, and 2,638 layout labels. The modalities required to answer these questions distribute across four categories: Text (44.7%), Image (21.7%), Table (37.4%), and Layout/Meta (11.5%). The ``Layout/Meta'' category encompasses questions related to layout information and meta-data statistics.
49
  Notably, the dataset poses several challenges: 254 questions necessitate cross-modal understanding, 313 questions demand evidence across multiple pages, and 637 questions require reasoning based on multiple layouts. These complexities highlight the need for advanced multi-modal reasoning and contextual understanding.
50
 
51
 
52
+ ## 4. Train Set
53
 
54
 
55