File size: 1,532 Bytes
3b4ca9c
 
bb24762
 
 
 
3b4ca9c
bb24762
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
license: mit
datasets:
- p208p2002/wudao
language:
- zh
---
# Chinese TinyLlama

A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version [TinyLlama](https://github.com/whyNLP/tinyllama) to pretrain a model on a large corpus.

See the [Github Repo](https://github.com/whyNLP/tinyllama-zh) for more details.

## Usage

```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh")
```

## Model Details

### Model Description

This model is trained on [WuDaoCorpora Text](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab). The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs.

The model uses the `THUDM/chatglm3-6b` tokenizer from huggingface.

- **Model type:** Llama
- **Language(s) (NLP):** Chinese
- **License:** MIT
- **Finetuned from model [optional]:** TinyLlama-2.5T checkpoint

## Uses

The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. [wanjuan](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0)). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus.