DecoderImmortal
/

LM-Combiner

Model card Files Files and versions Community

LM-Combiner / README.md

DecoderImmortal's picture

DecoderImmortal

Update README.md

c6bb7f7 verified 8 months ago

|

history blame contribute delete

2.84 kB

	# LM-Combiner
	All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience!

	# Model Weight
	- cbart_large.zip
	- Weight of Bart baseline model.


	- lm_combiner.zip
	- Weight of LM-Combiner for Bart baseline on FCGEC dataset.


	# Requirements

	The part of the model is implemented using the huggingface framework and the required environment is as follows:
	- Python
	- torch
	- transformers
	- datasets
	- tqdm

	For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT).

	# Training Stage
	## Preprocessing
	### Baseline Model
	- Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format.
	```bash
	sh ./script/run_bart_baseline.sh
	```
	### Candidate Datasets
	1. Candidate Sentence Generation
	- We use the baseline model to generate candidate sentences for the training and test sets
	- On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately.
	```bash
	python ./src/predict_bl_tsv.py
	```
	2. Golden Labels Merging
	- We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels.
	```bash
	python ./scorer_wapper/golden_label_merging.py
	```
	## LM-combiner (gpt2)
	- Subsequently, we train LM-Combiner on the constructed candidate dataset
	- In particular, we supplement the gpt2 vocab (mainly double quotes) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details.
	```bash
	sh ./script/run_lm_combiner.py
	```

	# Evaluation
	- We use the official ChERRANT script to evaluate the model on the FCGEC-dev.
	```shell
	sh ./script/compute_score.sh
	```
	\|method\|Prec\|Rec\|F0.5\|
	\|-\|-\|-\|-\|
	\| bart_baseline\|28.88\|38.95\|40.46\|
	\|+lm_combiner\|52.15\|37.41\|48.34\|
	# Citation

	If you find this work is useful for your research, please cite our paper:

	```
	@inproceedings{wang-etal-2024-lm-combiner,
	title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction",
	author = "Wang, Yixuan and
	Wang, Baoxin and
	Liu, Yijun and
	Wu, Dayong and
	Che, Wanxiang",
	editor = "Calzolari, Nicoletta and
	Kan, Min-Yen and
	Hoste, Veronique and
	Lenci, Alessandro and
	Sakti, Sakriani and
	Xue, Nianwen",
	booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
	month = may,
	year = "2024",
	address = "Torino, Italia",
	publisher = "ELRA and ICCL",
	url = "https://aclanthology.org/2024.lrec-main.934",
	pages = "10675--10685",
	}
	```