FuriouslyAsleep
commited on
Commit
•
52584f1
1
Parent(s):
07fd7ab
Update README.md
Browse files
README.md
CHANGED
@@ -1,9 +1,19 @@
|
|
1 |
-
# MarkupLM
|
2 |
|
3 |
-
**Multimodal (text +markup language) pre-training for [Document AI](https://www.microsoft.com/en-us/research/project/document-ai/)**
|
4 |
|
5 |
## Introduction
|
6 |
|
7 |
MarkupLM is a simple but effective multi-modal pre-training method of text and markup language for visually-rich document understanding and information extraction tasks, such as webpage QA and webpage information extraction. MarkupLM archives the SOTA results on multiple datasets. For more details, please refer to our paper:
|
8 |
|
9 |
[MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) Junlong Li, Yiheng Xu, Lei Cui, Furu Wei
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# MarkupLM Large fine-tuned on WebSRC to allow Question Answering
|
2 |
|
3 |
+
**Fine-tuned Multimodal (text +markup language) pre-training for [Document AI](https://www.microsoft.com/en-us/research/project/document-ai/)**
|
4 |
|
5 |
## Introduction
|
6 |
|
7 |
MarkupLM is a simple but effective multi-modal pre-training method of text and markup language for visually-rich document understanding and information extraction tasks, such as webpage QA and webpage information extraction. MarkupLM archives the SOTA results on multiple datasets. For more details, please refer to our paper:
|
8 |
|
9 |
[MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) Junlong Li, Yiheng Xu, Lei Cui, Furu Wei
|
10 |
+
|
11 |
+
Fine-tuning args:
|
12 |
+
--per_gpu_train_batch_size 4 --warmup_ratio 0.1 --num_train_epochs 4
|
13 |
+
|
14 |
+
Training was performed on only a small subset of the WebSRC:
|
15 |
+
The number of total websites is 60
|
16 |
+
The train websites list is ['ga09']
|
17 |
+
The test websites list is []
|
18 |
+
The dev websites list is ['ga12', 'ph04', 'au08', 'ga10', 'au01', 'bo17', 'mo02', 'jo11', 'sp09', 'sp10', 'ph03', 'ph01', 'un09', 'sp14', 'jo03', 'sp07', 'un07', 'bo07', 'mo04', 'bo09', 'jo10', 'un12', 're02', 'bo01', 'ca01', 'sp15', 'au12', 'un03', 're03', 'jo13', 'ph02', 'un10', 'au09', 'au10', 'un02', 'mo07', 'sp13', 'bo08', 'sp03', 're05', 'sp06', 'ca02', 'sp02', 'sp01', 'au03', 'sp11', 'mo06', 'bo10', 'un11', 'un06', 'ga01', 'un04', 'ph05', 'au11', 'sp12', 'jo05', 'sp04', 'jo12', 'sp08']
|
19 |
+
The number of processed websites is 60
|