README.md · augmxnt/shisa-base-7b-v1 at 0fcc9a5b2933093b17bc27d1e56d100ae89aabd6

metadata

license: apache-2.0
language:
  - en
  - ja

This model takes Mistral 7B and adds an additional 8B tokens of primarily Japanese pre-training. Japanese tokens were sourced from MADLAD-400, using DSIR, along with 10% English tokens sampled from a mix of MADLAD-400 EN and various open datasources added in to prevent catastrophic forgetting.

We have extended the Mistral tokenizer to ~120k tokens to improve Japanese efficiency. Our tokenizer achieves ~2.4 characters per token, whereas the base tokenizer reaches < 1 character per token. The code and approach is aviable in our Shisa repo.

This model was created for use with Shisa 7B, our JA/EN fine-tuned model, but is provided for the community (with an open Apache 2.0 license) since it is, as far as we know, currently the only Mistral base model with additional JA pretraining.

(There are no stop tokens trained into this base model so it doesn't benchmark well but we have validated with ablations on fine-tuned models that using this base model outperforms raw Mistral 7B for Japanese language performance.)

Training took 2,400 A100-40 GPU hours on a single 16 x A100-40 machine with DeepSpeed ZeRO-3.

Acknowledgements

Training and data: Jon Durbin

Compute for this model was generously sponsored by AKA Virtual (Tokyo, Japan).

Thanks to the ELYZA team for publishing the details of their tokenizer extension approach which we used as our starting point.

And of course, thanks to the Mistral AI for releasing the base model.

このモデルはMistral 7Bを基に、主に日本語の事前学習トークンが追加された8Bトークンも含みます。日本語トークンはMADLAD-400から取得され、DSIRが用いられました。また、全体の10%はMADLAD-400の英語トークンやさまざまなオープンデータソースから抽出された英語トークンを組み合わせて追加されました。これは、過度の忘却を防ぐための措置です。

私たちは、Mistralのトークナイザーを約12万トークンに拡張し、日本語の効率を向上させました。私たちのトークナイザーは、1トークンあたり約2.4文字を達成していますが、ベースのトークナイザーは1文字未満になります。このコードとアプローチは、私たちのShisaリポジトリにて公開しています。

このモデルは、私たちの日本語/英語ファインチューニングモデルShisa 7Bのために作成されましたが、現在のところ、Mistral基本モデルと日本語の事前学習を追加したベースモデルはこれが唯一なので、コミュニティのために(Apache 2.0ライセンスで)提供しています。

（この基本モデルには「ストップトークン」が訓練されていないため、ベンチマークの結果が期待通りに出ないかもしれませんが、ファインチューニングによる検証から、このモデルは日本語性能においてオリジナルのミストラル7Bを上回っています。）

トレーニングにはDeepSpeed ZeRO-3を使用し、1台のA100-40マシン16台で2,400 A100-40 GPU時間を要しました。

謝辞

トレーニングとデータ: Jon Durbin氏

本モデルの計算リソースは、AKA Virtual (東京、日本) から寛大に提供されました。

トークナイザーの拡張手法の詳細を公開したELYZAチームに感謝します。私たちはこれをスタートポイントとして利用しました。

そしてもちろん、ベースモデルをリリースしたMistral AIに感謝します。