augmxnt
/

shisa-base-7b-v1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

leonardlin commited on Nov 27, 2023

Commit

fc0e492

•

1 Parent(s): 6deb185

title

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -4,6 +4,8 @@ language:
 - en
 - ja
 ---
 `shisa-base-7b-v1` takes [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and adds an additional 8B tokens of primarily Japanese pre-training. Japanese tokens were sourced from [MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400), using [DSIR](https://github.com/p-lambda/dsir), along with 10% English tokens sampled from a mix of MADLAD-400 EN and various open datasources added in to prevent catastrophic forgetting.
 We have extended the Mistral tokenizer to 120k tokens to improve Japanese efficiency.  Our tokenizer achieves ~2.3 characters per token in JA, versus the base Mistral 7B tokenizer which is <1 character per token. Code for our implementation is available in our [Shisa repo](https://github.com/AUGMXNT/shisa).

 - en
 - ja
 ---
+# shisa-base-7b-v1
 `shisa-base-7b-v1` takes [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and adds an additional 8B tokens of primarily Japanese pre-training. Japanese tokens were sourced from [MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400), using [DSIR](https://github.com/p-lambda/dsir), along with 10% English tokens sampled from a mix of MADLAD-400 EN and various open datasources added in to prevent catastrophic forgetting.
 We have extended the Mistral tokenizer to 120k tokens to improve Japanese efficiency.  Our tokenizer achieves ~2.3 characters per token in JA, versus the base Mistral 7B tokenizer which is <1 character per token. Code for our implementation is available in our [Shisa repo](https://github.com/AUGMXNT/shisa).