Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,50 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
# Implementation of ACL 2024 findings "Improving Grammatical Error Correction via Contextual Data Augmentation"
|
5 |
+
|
6 |
+
[github link](https://github.com/wyxstriker/CDA4GEC)
|
7 |
+
|
8 |
+
# Model Weights
|
9 |
+
We release the model weights of each training stage.
|
10 |
+
Our model is trained based on the Fairseq framework, details of the weights and links to them are below.
|
11 |
+
|
12 |
+
|Name|Data Info|Download Link|
|
13 |
+
|:--:|--|--|
|
14 |
+
|Stage1|Pre-training on [C4 synthetic data](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) with 200M scale|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/stage1_checkpoint_best.pt|
|
15 |
+
|Stage2+|Fine-tuning on the augmented Lang8, NUCLE, FCE and W&I+L datasets|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/stage2_checkpoint_best.pt|
|
16 |
+
|Stage3+|Continue fine-tuning on the augmented W&I+L dataset|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/stage3_checkpoint_best.pt|
|
17 |
+
|
18 |
+
# Synthetic Data
|
19 |
+
> We only release the synthetic pseudo-data, please follow the official process to apply for the original annotated data.
|
20 |
+
|
21 |
+
|
22 |
+
|DataInfo|Amount|Source|Path|
|
23 |
+
|:--:|:--:|:--:|:--:|
|
24 |
+
|stage2+|2M|Lang-8 & NUCLE & FCE & W&I+L|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/pseudo/stage2|
|
25 |
+
|stage3+|200K|W&I+L|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/pseudo/stage3|
|
26 |
+
|
27 |
+
# Citation
|
28 |
+
If you find this work is useful for your research, please cite our paper:
|
29 |
+
|
30 |
+
```
|
31 |
+
@inproceedings{wang-etal-2024-improving-grammatical,
|
32 |
+
title = "Improving Grammatical Error Correction via Contextual Data Augmentation",
|
33 |
+
author = "Wang, Yixuan and
|
34 |
+
Wang, Baoxin and
|
35 |
+
Liu, Yijun and
|
36 |
+
Zhu, Qingfu and
|
37 |
+
Wu, Dayong and
|
38 |
+
Che, Wanxiang",
|
39 |
+
editor = "Ku, Lun-Wei and
|
40 |
+
Martins, Andre and
|
41 |
+
Srikumar, Vivek",
|
42 |
+
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
|
43 |
+
month = aug,
|
44 |
+
year = "2024",
|
45 |
+
address = "Bangkok, Thailand and virtual meeting",
|
46 |
+
publisher = "Association for Computational Linguistics",
|
47 |
+
url = "https://aclanthology.org/2024.findings-acl.647",
|
48 |
+
pages = "10898--10910",
|
49 |
+
}
|
50 |
+
```
|