leonardlin
commited on
Commit
•
bc9268e
1
Parent(s):
60a80d6
Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,7 @@ language:
|
|
6 |
---
|
7 |
# shisa-base-7b-v1
|
8 |
|
9 |
-
`shisa-base-7b-v1` takes [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and adds an additional 8B tokens of primarily Japanese pre-training. Japanese tokens were sourced from [MADLAD-400](https://
|
10 |
|
11 |
We have extended the Mistral tokenizer to 120k tokens to improve Japanese efficiency. Our tokenizer achieves ~2.3 characters per token in JA, versus the base Mistral 7B tokenizer which is <1 character per token. Code for our implementation is available in our [Shisa repo](https://github.com/AUGMXNT/shisa).
|
12 |
|
|
|
6 |
---
|
7 |
# shisa-base-7b-v1
|
8 |
|
9 |
+
`shisa-base-7b-v1` takes [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and adds an additional 8B tokens of primarily Japanese pre-training. Japanese tokens were sourced from [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400), using [DSIR](https://github.com/p-lambda/dsir), along with 10% English tokens sampled from a mix of MADLAD-400 EN and various open datasources added in to prevent catastrophic forgetting.
|
10 |
|
11 |
We have extended the Mistral tokenizer to 120k tokens to improve Japanese efficiency. Our tokenizer achieves ~2.3 characters per token in JA, versus the base Mistral 7B tokenizer which is <1 character per token. Code for our implementation is available in our [Shisa repo](https://github.com/AUGMXNT/shisa).
|
12 |
|