File size: 2,249 Bytes
aa8f620
 
 
 
 
 
 
 
 
 
 
6e962dc
 
 
aa8f620
 
 
 
a47273d
aa8f620
 
 
a47273d
aa8f620
a47273d
aa8f620
 
a47273d
aa8f620
a47273d
aa8f620
a47273d
aa8f620
a47273d
aa8f620
790e3af
aa8f620
 
 
 
790e3af
 
aa8f620
790e3af
a47273d
aa8f620
 
b3b757a
a47273d
b3b757a
aa8f620
a47273d
aa8f620
 
 
 
 
 
 
 
 
a47273d
aa8f620
a47273d
aa8f620
a47273d
aa8f620
 
a47273d
aa8f620
a47273d
aa8f620
a47273d
aa8f620
a47273d
aa8f620
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
datasets:
- RUCAIBox/Erya-dataset
language:
- en
base_model:
- google/gemma-2-2b-it
tags:
- ancient-chinese
- chinese
- literature
- unsloth
- trl
- sft
---
Ancient Chinese Translator + Phonology Model (SimaQian)

Name Origin:

The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.

This model combines two key functionalities for Ancient Chinese texts:

	1.	Translation: Converts Ancient Chinese passages into modern Chinese.
    
	2.	Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).


Model Description

	•	Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.
    
	•	Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.
    
	•	Output: Era identification (optional), phonetic renderings, and modern Chinese translations.

Training Data
	•	Translation: Erya dataset from RUCAIBox/Erya-dataset.
	•	Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions.
	•	Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.


Usage


  from transformers import AutoTokenizer, AutoModelForCausalLM
  
  tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian")
  
  model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian")
  
  
  prompt = """
  <start_of_turn>user
  Given the ancient text: 「子曰:學而時習之,不亦說乎?」
  1) Identify the era
  2) Provide the phonetic reading
  3) Translate into modern Chinese
  <end_of_turn>
  <start_of_turn>model
  """
  
  inputs = tokenizer(prompt, return_tensors="pt")
  
  outputs = model.generate(**inputs, max_length=256)
  
  print(tokenizer.decode(outputs[0]))


Limitations and Biases

	•	Era Estimation: Model may not always correctly guess the historical era.
    
	•	Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.
    
	•	Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.