twnlp/ChineseErrorCorrector3-4B

项目地址：https://github.com/TW-NLP/ChineseErrorCorrector

模型列表

模型名称	纠错类型	描述
twnlp/ChineseErrorCorrector3-4B	语法+拼写	使用200万纠错数据进行全量训练，适用于语法纠错和拼写纠错，效果最好，推荐使用。
twnlp/ChineseErrorCorrector2-7B	语法+拼写	使用200万纠错数据进行多轮迭代训练，适用于语法纠错和拼写纠错，效果较好。

模型评测（NaCGEC Data）

Model Name	Model Link	Base Model	Avg	SIGHAN-2015	EC-LAW	MCSC	GPU	QPS
Kenlm-CSC	shibing624/chinese-kenlm-klm	kenlm	0.3409	0.3147	0.3763	0.3317	CPU	9
Mengzi-T5-CSC	shibing624/mengzi-t5-base-chinese-correction	mengzi-t5-base	0.3984	0.7758	0.3156	0.1039	GPU	214
ERNIE-CSC	PaddleNLP/ernie-csc	PaddlePaddle/ernie-1.0-base-zh	0.4353	0.8383	0.3357	0.1318	GPU	114
MacBERT-CSC	shibing624/macbert4csc-base-chinese	hfl/chinese-macbert-base	0.3993	0.8314	0.1610	0.2055	GPU	224
ChatGLM3-6B-CSC	shibing624/chatglm3-6b-csc-chinese-lora	THUDM/chatglm3-6b	0.4538	0.6572	0.4369	0.2672	GPU	3
Qwen2.5-1.5B-CTC	shibing624/chinese-text-correction-1.5b	Qwen/Qwen2.5-1.5B-Instruct	0.6802	0.3032	0.7846	0.9529	GPU	6
Qwen2.5-7B-CTC	shibing624/chinese-text-correction-7b	Qwen/Qwen2.5-7B-Instruct	0.8225	0.4917	0.9798	0.9959	GPU	3
Qwen3-4B-CTC(Our)	twnlp/ChineseErrorCorrector3-4B	Qwen/Qwen3-4B	0.8521	0.6340	0.9360	0.9864	GPU	5

Without ChineseErrorCorrector, you can use the model like this:

First, you pass your input through the transformer model, then you get the generated sentence.

Install package:

pip install transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "twnlp/ChineseErrorCorrector3-4B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "你是一个文本纠错专家，纠正输入句子中的语法错误，并输出正确的句子，输入句子为："
text_input = "对待每一项工作都要一丝不够。"
messages = [
    {"role": "user", "content": prompt + text_input}
]
text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
    )
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

output:

对待每一项工作都要一丝不苟。

twnlp
/

ChineseErrorCorrector3-4B

模型列表

模型评测（NaCGEC Data）

Model tree for twnlp/ChineseErrorCorrector3-4B