|
--- |
|
license: mit |
|
datasets: |
|
- p208p2002/wudao |
|
language: |
|
- zh |
|
--- |
|
# Chinese TinyLlama |
|
|
|
A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version [TinyLlama](https://github.com/whyNLP/tinyllama) to pretrain a model on a large corpus. |
|
|
|
See the [Github Repo](https://github.com/whyNLP/tinyllama-zh) for more details. |
|
|
|
## Usage |
|
|
|
```python |
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh") |
|
``` |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is trained on [WuDaoCorpora Text](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab). The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs. |
|
|
|
The model uses the `THUDM/chatglm3-6b` tokenizer from huggingface. |
|
|
|
- **Model type:** Llama |
|
- **Language(s) (NLP):** Chinese |
|
- **License:** MIT |
|
- **Finetuned from model [optional]:** TinyLlama-2.5T checkpoint |
|
|
|
## Uses |
|
|
|
The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. [wanjuan](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0)). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus. |
|
|