augmxnt
/

shisa-base-7b-v1

@@ -29,7 +29,7 @@ Here we also compare `shisa-base-7b-v1` to other recently-released similar class
 ![7B llm-jp-eval Performance](https://huggingface.co/augmxnt/mistral-7b-ja-v0.1/resolve/main/llm-jp-eval.ja.png)
 ## Tokenizer
-As mentioned in the introduction, our tokenizer is an extended version of the Mistral 7B tokenizer, with a vocab size of  120073 and aligned to 120128 for better performance. The remaining unused tokens are assigned as average-weighted `<|extra_{idx}|>` tokens.
 We use the "Fast" tokenizer, which should be the default for `AutoTokenizer`, but if you have problems, make sure to check `tokenizer.is_fast` or to initialize with `use_fast=True`.
@@ -83,6 +83,9 @@ Thanks to the [ELYZA](https://huggingface.co/elyza) team for publishing the deta
 And of course, thanks to the [Mistral AI](https://huggingface.co/mistralai) for releasing such a strong base model!
 ---
 `shisa-base-7b-v1`は、[Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)を基にして、主に日本語の事前トレーニングのために追加で80億トークンを追加しています。日本語トークンは、[MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400)から取得し、[DSIR](https://github.com/p-lambda/dsir)を使用しています。さらに、MADLAD-400 ENと様々なオープンデータソースからの英語トークンの10%を追加し、壊滅的忘却を防ぐために組み込んでいます。
@@ -104,7 +107,7 @@ Mistralのトークン化器を12万トークンまで拡張し、日本語の
 ![7B llm-jp-eval パフォーマンス]()
 ## トークン化器
-序文で触れたように、私たちのトークン化器はMistral 7Bトークン化器の拡張版で、語彙サイズは120073であり、120128に合わせられています。残りの未使用トークンは、平均重み付けされた`<|extra_{idx}|>`トークンとして割り当てられています。
 私たちは「Fast」トークン化器を使用しており、これは`AutoTokenizer`のデフォルトであるべきですが、問題がある場合は`tokenizer.is_fast`をチェックするか、`use_fast=True`で初期化することを確認してください。

 ![7B llm-jp-eval Performance](https://huggingface.co/augmxnt/mistral-7b-ja-v0.1/resolve/main/llm-jp-eval.ja.png)
 ## Tokenizer
+As mentioned in the introduction, our tokenizer is an extended version of the Mistral 7B tokenizer, with a vocab size of  120073 and aligned to 120128 for better performance. The remaining unused tokens are assigned as zero-weighted `<|extra_{idx}|>` tokens.
 We use the "Fast" tokenizer, which should be the default for `AutoTokenizer`, but if you have problems, make sure to check `tokenizer.is_fast` or to initialize with `use_fast=True`.
 And of course, thanks to the [Mistral AI](https://huggingface.co/mistralai) for releasing such a strong base model!
 ---
+*(GPT-4によって翻訳されました)*
+# shisa-base-7b-v1
 `shisa-base-7b-v1`は、[Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)を基にして、主に日本語の事前トレーニングのために追加で80億トークンを追加しています。日本語トークンは、[MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400)から取得し、[DSIR](https://github.com/p-lambda/dsir)を使用しています。さらに、MADLAD-400 ENと様々なオープンデータソースからの英語トークンの10%を追加し、壊滅的忘却を防ぐために組み込んでいます。
 ![7B llm-jp-eval パフォーマンス]()
 ## トークン化器
+序文で触れたように、私たちのトークン化器はMistral 7Bトークン化器の拡張版で、語彙サイズは120073であり、120128に合わせられています。残りの未使用トークンは、ゼロ重み付けされた`<|extra_{idx}|>`トークンとして割り当てられています。
 私たちは「Fast」トークン化器を使用しており、これは`AutoTokenizer`のデフォルトであるべきですが、問題がある場合は`tokenizer.is_fast`をチェックするか、`use_fast=True`で初期化することを確認してください。