lordChipotle/SimaQian · Hugging Face

Ancient Chinese Translator + Phonology Model (SimaQian)

Name Origin:

The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.

This model combines two key functionalities for Ancient Chinese texts:

1.	Translation: Converts Ancient Chinese passages into modern Chinese.

2.	Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).

Model Description

•	Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.

•	Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.

•	Output: Era identification (optional), phonetic renderings, and modern Chinese translations.

Training Data • Translation: Erya dataset from RUCAIBox/Erya-dataset. • Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions. • Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian")

model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian")

prompt = """ user Given the ancient text: 「子曰：學而時習之，不亦說乎？」

Identify the era
Provide the phonetic reading
Translate into modern Chinese model """

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_length=256)

print(tokenizer.decode(outputs[0]))

Limitations and Biases

•	Era Estimation: Model may not always correctly guess the historical era.

•	Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.

•	Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.

lordChipotle
/

SimaQian

Model tree for lordChipotle/SimaQian

Dataset used to train lordChipotle/SimaQian