File size: 2,836 Bytes
c0c6002
 
 
c6bb7f7
 
 
 
 
 
 
 
 
c0c6002
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# LM-Combiner
All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience!

# Model Weight
- cbart_large.zip
  - Weight of Bart baseline model.
 
  
- lm_combiner.zip
  - Weight of LM-Combiner for Bart baseline on FCGEC dataset.
 

# Requirements

The part of the model is implemented using the huggingface framework and the required environment is as follows:
- Python
- torch
- transformers
- datasets
- tqdm

For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT).

# Training Stage
## Preprocessing
### Baseline Model 
- Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format.
```bash
sh ./script/run_bart_baseline.sh
```
### Candidate Datasets
1. Candidate Sentence Generation
- We use the baseline model to generate candidate sentences for the training and test sets
- On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately.
```bash
python ./src/predict_bl_tsv.py
```
2. Golden Labels Merging
- We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels.
```bash
python ./scorer_wapper/golden_label_merging.py
```
## LM-combiner (gpt2)
- Subsequently, we train LM-Combiner on the constructed candidate dataset
- In particular, we supplement the gpt2 vocab (mainly **double quotes**) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details.
```bash
sh ./script/run_lm_combiner.py
```

# Evaluation
- We use the official ChERRANT script to evaluate the model on the FCGEC-dev.
```shell
sh ./script/compute_score.sh
```
|method|Prec|Rec|F0.5|
|-|-|-|-|
| bart_baseline|28.88|**38.95**|40.46|
|+lm_combiner|**52.15**|37.41|**48.34**|
# Citation

If you find this work is useful for your research, please cite our paper:

```
@inproceedings{wang-etal-2024-lm-combiner,
    title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction",
    author = "Wang, Yixuan  and
      Wang, Baoxin  and
      Liu, Yijun  and
      Wu, Dayong  and
      Che, Wanxiang",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.934",
    pages = "10675--10685",
}
```