|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# TinyLlama + Japanese |
|
|
|
A continual pretraining model of TinyLlama 1.1B with a few Japanese texts. |
|
|
|
|
|
### Base Model |
|
|
|
[TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) |
|
|
|
### Tokenizers |
|
|
|
(elyza/ELYZA-japanese-Llama-2-7b)[https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b] |
|
|
|
|
|
### Training Dataset |
|
|
|
Around 9B tokens in total. |
|
|
|
- izumi-lab/wikipedia-ja-20230720 |
|
- if001/oscar_2023_filtered |
|
|
|
### Validation Dataset |
|
|
|
- izumi-lab/wikinews-ja-20230728 |
|
- izumi-lab/wikinews-en-20230728 |
|
- if001/aozorabunko-clean-sin |
|
|
|
|
|
### Evaluation |
|
|
|
We did not perform. |
|
|
|
|
|
### Acknowledgement |
|
|
|
We acknowledge those who prepared valuable datasets and [lit-gpt](https://github.com/Lightning-AI/lit-gpt). |
|
|
|
|
|
|
|
|