Ancient Chinese Translator + Phonology Model (SimaQian)
Name Origin:
The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.
This model combines two key functionalities for Ancient Chinese texts:
1. Translation: Converts Ancient Chinese passages into modern Chinese.
2. Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).
Model Description
• Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.
• Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.
• Output: Era identification (optional), phonetic renderings, and modern Chinese translations.
Training Data • Translation: Erya dataset from RUCAIBox/Erya-dataset. • Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions. • Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian")
model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian")
prompt = """ user Given the ancient text: 「子曰:學而時習之,不亦說乎?」
- Identify the era
- Provide the phonetic reading
- Translate into modern Chinese model """
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
print(tokenizer.decode(outputs[0]))
Limitations and Biases
• Era Estimation: Model may not always correctly guess the historical era.
• Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.
• Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.