File size: 2,249 Bytes
aa8f620 6e962dc aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 790e3af aa8f620 790e3af aa8f620 790e3af a47273d aa8f620 b3b757a a47273d b3b757a aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 a47273d aa8f620 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
datasets:
- RUCAIBox/Erya-dataset
language:
- en
base_model:
- google/gemma-2-2b-it
tags:
- ancient-chinese
- chinese
- literature
- unsloth
- trl
- sft
---
Ancient Chinese Translator + Phonology Model (SimaQian)
Name Origin:
The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.
This model combines two key functionalities for Ancient Chinese texts:
1. Translation: Converts Ancient Chinese passages into modern Chinese.
2. Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).
Model Description
• Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.
• Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.
• Output: Era identification (optional), phonetic renderings, and modern Chinese translations.
Training Data
• Translation: Erya dataset from RUCAIBox/Erya-dataset.
• Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions.
• Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian")
model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian")
prompt = """
<start_of_turn>user
Given the ancient text: 「子曰:學而時習之,不亦說乎?」
1) Identify the era
2) Provide the phonetic reading
3) Translate into modern Chinese
<end_of_turn>
<start_of_turn>model
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
print(tokenizer.decode(outputs[0]))
Limitations and Biases
• Era Estimation: Model may not always correctly guess the historical era.
• Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.
• Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.
|