whynlp
/

tinyllama-zh

Text Generation

text-generation-inference

Model card Files Files and versions Community

tinyllama-zh / README.md

whynlp's picture

Update README.md

bb24762 verified over 1 year ago

|

history blame contribute delete

1.53 kB

	---
	license: mit
	datasets:
	- p208p2002/wudao
	language:
	- zh
	---
	# Chinese TinyLlama

	A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version [TinyLlama](https://github.com/whyNLP/tinyllama) to pretrain a model on a large corpus.

	See the [Github Repo](https://github.com/whyNLP/tinyllama-zh) for more details.

	## Usage

	```python
	# Load model directly
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh")
	```

	## Model Details

	### Model Description

	This model is trained on [WuDaoCorpora Text](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab). The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs.

	The model uses the `THUDM/chatglm3-6b` tokenizer from huggingface.

	- Model type: Llama
	- Language(s) (NLP): Chinese
	- License: MIT
	- Finetuned from model [optional]: TinyLlama-2.5T checkpoint

	## Uses

	The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. [wanjuan](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0)). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus.