File size: 1,532 Bytes
3b4ca9c bb24762 3b4ca9c bb24762 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
---
license: mit
datasets:
- p208p2002/wudao
language:
- zh
---
# Chinese TinyLlama
A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version [TinyLlama](https://github.com/whyNLP/tinyllama) to pretrain a model on a large corpus.
See the [Github Repo](https://github.com/whyNLP/tinyllama-zh) for more details.
## Usage
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh")
```
## Model Details
### Model Description
This model is trained on [WuDaoCorpora Text](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab). The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs.
The model uses the `THUDM/chatglm3-6b` tokenizer from huggingface.
- **Model type:** Llama
- **Language(s) (NLP):** Chinese
- **License:** MIT
- **Finetuned from model [optional]:** TinyLlama-2.5T checkpoint
## Uses
The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. [wanjuan](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0)). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus.
|