SimaQian / README.md
lordChipotle's picture
Trained with Unsloth
6e962dc verified
---
datasets:
- RUCAIBox/Erya-dataset
language:
- en
base_model:
- google/gemma-2-2b-it
tags:
- ancient-chinese
- chinese
- literature
- unsloth
- trl
- sft
---
Ancient Chinese Translator + Phonology Model (SimaQian)
Name Origin:
The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.
This model combines two key functionalities for Ancient Chinese texts:
1. Translation: Converts Ancient Chinese passages into modern Chinese.
2. Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).
Model Description
• Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.
• Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.
• Output: Era identification (optional), phonetic renderings, and modern Chinese translations.
Training Data
• Translation: Erya dataset from RUCAIBox/Erya-dataset.
• Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions.
• Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian")
model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian")
prompt = """
<start_of_turn>user
Given the ancient text: 「子曰:學而時習之,不亦說乎?」
1) Identify the era
2) Provide the phonetic reading
3) Translate into modern Chinese
<end_of_turn>
<start_of_turn>model
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
print(tokenizer.decode(outputs[0]))
Limitations and Biases
• Era Estimation: Model may not always correctly guess the historical era.
• Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.
• Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.