lordChipotle
/

SimaQian

ancient-chinese

Model card Files Files and versions Community

SimaQian / README.md

lordChipotle's picture

Trained with Unsloth

6e962dc verified 2 months ago

|

history blame contribute delete

2.25 kB

	---
	datasets:
	- RUCAIBox/Erya-dataset
	language:
	- en
	base_model:
	- google/gemma-2-2b-it
	tags:
	- ancient-chinese
	- chinese
	- literature
	- unsloth
	- trl
	- sft
	---
	Ancient Chinese Translator + Phonology Model (SimaQian)

	Name Origin:

	The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.

	This model combines two key functionalities for Ancient Chinese texts:

	1. Translation: Converts Ancient Chinese passages into modern Chinese.

	2. Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).


	Model Description

	• Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.

	• Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.

	• Output: Era identification (optional), phonetic renderings, and modern Chinese translations.

	Training Data
	• Translation: Erya dataset from RUCAIBox/Erya-dataset.
	• Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions.
	• Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.


	Usage


	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian")

	model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian")


	prompt = """
	<start_of_turn>user
	Given the ancient text: 「子曰：學而時習之，不亦說乎？」
	1) Identify the era
	2) Provide the phonetic reading
	3) Translate into modern Chinese
	<end_of_turn>
	<start_of_turn>model
	"""

	inputs = tokenizer(prompt, return_tensors="pt")

	outputs = model.generate(**inputs, max_length=256)

	print(tokenizer.decode(outputs[0]))


	Limitations and Biases

	• Era Estimation: Model may not always correctly guess the historical era.

	• Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.

	• Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.