|
# LM-Combiner |
|
All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience! |
|
|
|
# Model Weight |
|
- cbart_large.zip |
|
- Weight of Bart baseline model. |
|
|
|
|
|
- lm_combiner.zip |
|
- Weight of LM-Combiner for Bart baseline on FCGEC dataset. |
|
|
|
|
|
# Requirements |
|
|
|
The part of the model is implemented using the huggingface framework and the required environment is as follows: |
|
- Python |
|
- torch |
|
- transformers |
|
- datasets |
|
- tqdm |
|
|
|
For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT). |
|
|
|
# Training Stage |
|
## Preprocessing |
|
### Baseline Model |
|
- Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format. |
|
```bash |
|
sh ./script/run_bart_baseline.sh |
|
``` |
|
### Candidate Datasets |
|
1. Candidate Sentence Generation |
|
- We use the baseline model to generate candidate sentences for the training and test sets |
|
- On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately. |
|
```bash |
|
python ./src/predict_bl_tsv.py |
|
``` |
|
2. Golden Labels Merging |
|
- We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels. |
|
```bash |
|
python ./scorer_wapper/golden_label_merging.py |
|
``` |
|
## LM-combiner (gpt2) |
|
- Subsequently, we train LM-Combiner on the constructed candidate dataset |
|
- In particular, we supplement the gpt2 vocab (mainly **double quotes**) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details. |
|
```bash |
|
sh ./script/run_lm_combiner.py |
|
``` |
|
|
|
# Evaluation |
|
- We use the official ChERRANT script to evaluate the model on the FCGEC-dev. |
|
```shell |
|
sh ./script/compute_score.sh |
|
``` |
|
|method|Prec|Rec|F0.5| |
|
|-|-|-|-| |
|
| bart_baseline|28.88|**38.95**|40.46| |
|
|+lm_combiner|**52.15**|37.41|**48.34**| |
|
# Citation |
|
|
|
If you find this work is useful for your research, please cite our paper: |
|
|
|
``` |
|
@inproceedings{wang-etal-2024-lm-combiner, |
|
title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction", |
|
author = "Wang, Yixuan and |
|
Wang, Baoxin and |
|
Liu, Yijun and |
|
Wu, Dayong and |
|
Che, Wanxiang", |
|
editor = "Calzolari, Nicoletta and |
|
Kan, Min-Yen and |
|
Hoste, Veronique and |
|
Lenci, Alessandro and |
|
Sakti, Sakriani and |
|
Xue, Nianwen", |
|
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", |
|
month = may, |
|
year = "2024", |
|
address = "Torino, Italia", |
|
publisher = "ELRA and ICCL", |
|
url = "https://aclanthology.org/2024.lrec-main.934", |
|
pages = "10675--10685", |
|
} |
|
``` |