Implementation of ACL 2024 findings "Improving Grammatical Error Correction via Contextual Data Augmentation"

Model Weights

We release the model weights of each training stage. Our model is trained based on the Fairseq framework, details of the weights and links to them are below.

Name	Data Info	Download Link
Stage1	Pre-training on C4 synthetic data with 200M scale	CDA4GEC/tree/main/stage1_checkpoint_best.pt
Stage2+	Fine-tuning on the augmented Lang8, NUCLE, FCE and W&I+L datasets	CDA4GEC/tree/main/stage2_checkpoint_best.pt
Stage3+	Continue fine-tuning on the augmented W&I+L dataset	CDA4GEC/tree/main/stage3_checkpoint_best.pt

Synthetic Data

We only release the synthetic pseudo-data, please follow the official process to apply for the original annotated data.

DataInfo	Amount	Source	Path
stage2+	2M	Lang-8 & NUCLE & FCE & W&I+L	CDA4GEC/tree/main/pseudo/stage2
stage3+	200K	W&I+L	CDA4GEC/tree/main/pseudo/stage3

Citation

If you find this work is useful for your research, please cite our paper:

@inproceedings{wang-etal-2024-improving-grammatical,
    title = "Improving Grammatical Error Correction via Contextual Data Augmentation",
    author = "Wang, Yixuan  and
      Wang, Baoxin  and
      Liu, Yijun  and
      Zhu, Qingfu  and
      Wu, Dayong  and
      Che, Wanxiang",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.647",
    pages = "10898--10910",
}