augmxnt
/

shisa-base-7b-v1

@@ -4,42 +4,150 @@ language:
 - en
 - ja
 ---
-This model takes [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and adds an additional 8B tokens of primarily Japanese pre-training. Japanese tokens were sourced from [MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400), using [DSIR](https://github.com/p-lambda/dsir), along with 10% English tokens sampled from a mix of MADLAD-400 EN and various open datasources added in to prevent catastrophic forgetting.
-We have extended the Mistral tokenizer to ~120k tokens to improve Japanese efficiency.  Our tokenizer achieves ~2.4 characters per token, whereas the base tokenizer reaches < 1 character per token.  The code and approach is aviable in our [Shisa repo](https://github.com/AUGMXNT/shisa).
-This model was created for use with [Shisa 7B](https://huggingface.co/augmxnt/shisa-7b-v1), our JA/EN fine-tuned model, but is provided for the community (with an open Apache 2.0 license) since it is, as far as we know, currently the only Mistral base model with additional JA pretraining.
-(There are no stop tokens trained into this base model so it doesn't benchmark well but we have validated with ablations on fine-tuned models that using this base model outperforms raw Mistral 7B for Japanese language performance.)
 Training took 2,400 A100-40 GPU hours on a single 16 x A100-40 machine with [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO-3.
 ## Acknowledgements
-Training and data: Jon Durbin
 Compute for this model was generously sponsored by [AKA Virtual](https://akavirtual.com/) (Tokyo, Japan).
-Thanks to the [ELYZA](https://huggingface.co/elyza) team for publishing the details of their [tokenizer extension approach](https://zenn.dev/elyza/articles/2fd451c944649d) which we used as our starting point.
-And of course, thanks to the [Mistral AI](https://huggingface.co/mistralai) for releasing the base model.
 ---
-このモデルは[Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)を基に、主に日本語の事前学習トークンが追加された8Bトークンも含みます。日本語トークンは[MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400)から取得され、[DSIR](https://github.com/p-lambda/dsir)が用いられました。また、全体の10%はMADLAD-400の英語トークンやさまざまなオープンデータソースから抽出された英語トークンを組み合わせて追加されました。これは、過度の忘却を防ぐための措置です。
-私たちは、Mistralのトークナイザーを約12万トークンに拡張し、日本語の効率を向上させました。私たちのトークナイザーは、1トークンあたり約2.4文字を達成していますが、ベースのトークナイザーは1文字未満になります。このコードとアプローチは、私たちの[Shisaリポジトリ](https://github.com/AUGMXNT/shisa)にて公開しています。
-このモデルは、私たちの日本語/英語ファインチューニングモデル[Shisa 7B](https://huggingface.co/augmxnt/shisa-7b-v1)のために作成されましたが、現在のところ、Mistral基本モデルと日本語の事前学習を追加したベースモデルはこれが唯一なので、コミュニティのために(Apache 2.0ライセンスで)提供しています。
-（この基本モデルには「ストップトークン」が訓練されていないため、ベンチマークの結果が期待通りに出ないかもしれませんが、ファインチューニングによる検証から、このモデルは日本語性能においてオリジナルのミストラル7Bを上回っています。）
-トレーニングには[DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO-3を使用し、1台のA100-40マシン16台で2,400 A100-40 GPU時間を要しました。
 ## 謝辞
-トレーニングとデータ: Jon Durbin氏
-本モデルの計算リソースは、[AKA Virtual](https://akavirtual.com/) (東京、日本) から寛大に提供されました。
-[トークナイザーの拡張手法](https://zenn.dev/elyza/articles/2fd451c944649d)の詳細を公開した[ELYZA](https://huggingface.co/elyza)チームに感謝します。私たちはこれをスタートポイントとして利用しました。
-そしてもちろん、ベースモデルをリリースした[Mistral AI](https://huggingface.co/mistralai)に感謝します。

 - en
 - ja
 ---
+`mistral-7b-ja-v0.1` takes [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and adds an additional 8B tokens of primarily Japanese pre-training. Japanese tokens were sourced from [MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400), using [DSIR](https://github.com/p-lambda/dsir), along with 10% English tokens sampled from a mix of MADLAD-400 EN and various open datasources added in to prevent catastrophic forgetting.
+We have extended the Mistral tokenizer to 120k tokens to improve Japanese efficiency.  Our tokenizer achieves ~2.3 characters per token in JA, versus the base Mistral 7B tokenizer which is <1 character per token. Code for our implementation is available in our [Shisa repo](https://github.com/AUGMXNT/shisa).
+This base model was created for use with [Shisa 7B](https://huggingface.co/augmxnt/shisa-7b-v1), our JA/EN fine-tuned model, but we provide it for the community as we believe the combination of strong performance and efficient bilingual tokenizer could be useful.
 Training took 2,400 A100-40 GPU hours on a single 16 x A100-40 machine with [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO-3.
+## Performance
+This base model was able to attain class-leading Japanese performance in standardized benchmarks with significantly less additional pre-training than previously released models. We believe this may be due to the use of a better-curated pre-training dataset, but ablations at even 2.5B additional JA tokens still showed very strong Japanese performance.
+We used a slightly modified [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) (our base model requires a `bos_token` to be prepended to the prompt; we tested other models with and without the modification and took the higher results for all models tested). Here we validate versus the original Mistral 7B base model as well as [Japanese Stable LM Instruct Gamma 7B](https://huggingface.co/stabilityai/japanese-stablelm-instruct-gamma-7b), which is a Mistral 7B base with an additional 100B tokens of JA/EN pre-training. We also include [Japanese-StableLM-Base-Beta-70B](https://huggingface.co/stabilityai/japanese-stablelm-base-beta-70b), which is a Llama 2 70B that also has an additional 100B tokens of JA/EN pre-training as a reference:
+![Mistral llm-jp-eval Comparison]()
+Here we also compare `mistral-7b-ja-v0.1` to other recently-released similar classed (7B parameter) Japanese-tuned models. [ELYZA 7B fast model](https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b-fast) and [Youri 7B](https://huggingface.co/rinna/youri-7b) are Llama 2 7B models with 18B and 40B of additional pre-training respectively, and [CALM2-7B](https://huggingface.co/cyberagent/calm2-7b) and [llm-jp-13b]() are pretrained models with 1.3T and 300B JA/EN tokens of pre-training:
+![7B llm-jp-eval Performance]()
+## Tokenizer
+As mentioned in the introduction, our tokenizer is an extended version of the Mistral 7B tokenizer, with a vocab size of  120073 and aligned to 128K. The remaining unused tokens are assigned as average-weighted `<|extra_{idx}|>` tokens.
+We use the "Fast" tokenizer, which should be the default for `AutoTokenizer`, but if you have problems, make sure to check `tokenizer.is_fast` or to initialize with `use_fast=True`.
+Japanese efficiency from sampling 50K items (~85M characters) from the JA subset of the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset:
+| LLM                                           | Tokenizer                                           |   Vocab Size |   Avg Char/Token |
+|:----------------------------------------------|:----------------------------------------------------|-------------:|-----------------:|
+| *Shisa 7B (AUGMXNT)*                          | *augmxnt/mistral-7b-ja-v0.1*                        |     *120073* |           *2.31* |
+| OpenCALM (CyberAgent)                         | cyberagent/open-calm-7b                             |        52000 |             2.17 |
+| Japanese LargeLM (LINE)                       | line-corporation/japanese-large-lm-3.6b             |        51200 |             2.14 |
+| CALM2-7B (CyberAgent)                         | cyberagent/calm2-7b                                 |        65000 |             2.00 |
+| Bilingual-GPT-NeoX-4B (Rinna)                 | rinna/bilingual-gpt-neox-4b                         |        65536 |             1.88 |
+| Japanese StableLM Alpha (Stability AI)        | [novelai/nerdstash-tokenizer-v1](https://huggingface.co/NovelAI/nerdstash-tokenizer-v1) | 65535 | 1.85 |
+| Japanese-GPT-NeoX-3.6B (Rinna)                | rinna/japanese-gpt-neox-3.6b                        |        32000 |             1.83 |
+| Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b |        49247 |             1.79 |
+| llm-jp-13b (LLM-jp)                           | [llm-jp/llm-jp-13b-v1.0](https://github.com/llm-jp/llm-jp-tokenizer) | 50570 |    1.65 |
+| Japanese-Llama-2-7b-fast (ELYZA)              | elyza/ELYZA-japanese-Llama-2-7b-fast                |        45043 |             1.53 |
+| Qwen 14B (Qwen)                               | Qwen/Qwen-14B                                       |       151851 |             1.48 |
+| weblab-10b (Matsuo Lab)                       | EleutherAI/gpt-neox-20b                             |        50254 |             1.00 |
+| Japanese StableLM Gamma (Stability AI)        | mistralai/Mistral-7B-v0.1                           |        32000 |             0.95 |
+| Youri 7B (Rinna)                              | meta-llama/Llama-2-7B                               |        32000 |             0.88 |
+We also test English efficiency using a sampling of 50K items (~177M characters) from the EN subset of the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset as a sanity check (and to see how other tokenizers fare):
+| LLM                                           | Tokenizer                                           |   Vocab Size |   Avg Char/Token |
+|:----------------------------------------------|:----------------------------------------------------|-------------:|-----------------:|
+| Qwen 14B (Qwen)                               | Qwen/Qwen-14B                                       |       151851 |             4.47 |
+| weblab-10b (Matsuo Lab)                       | EleutherAI/gpt-neox-20b                             |        50254 |             4.45 |
+| Japanese StableLM Alpha (Stability AI)        | [novelai/nerdstash-tokenizer-v1](https://huggingface.co/NovelAI/nerdstash-tokenizer-v1) | 65535 | 4.15 |
+| *Shisa 7B (AUGMXNT)*                          | *augmxnt/mistral-7b-ja-v0.1*                        |     *120073* |           *4.12* |
+| CALM2-7B (CyberAgent)                         | cyberagent/calm2-7b                                 |        65000 |             4.12 |
+| Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b |        49247 |             4.01 |
+| Japanese StableLM Gamma (Stability AI)        | mistralai/Mistral-7B-v0.1                           |        32000 |             4.01 |
+| Japanese-Llama-2-7b-fast (ELYZA)              | elyza/ELYZA-japanese-Llama-2-7b-fast                |        45043 |             3.86 |
+| Youri 7B (Rinna)                              | meta-llama/Llama-2-7B                               |        32000 |             3.86 |
+| llm-jp-13b (LLM-jp)                           | [llm-jp/llm-jp-13b-v1.0](https://github.com/llm-jp/llm-jp-tokenizer) | 50570 |   3.79 |
+| OpenCALM (CyberAgent)                         | cyberagent/open-calm-7b                             |        52000 |             2.83 |
+| Japanese LargeLM (LINE)                       | line-corporation/japanese-large-lm-3.6b             |        51200 |             2.49 |
+| Japanese-GPT-NeoX-3.6B (Rinna)                | rinna/japanese-gpt-neox-3.6b                        |        32000 |             2.42 |
+| Bilingual-GPT-NeoX-4B (Rinna)                 | rinna/bilingual-gpt-neox-4b                         |        65536 |             2.42 |
+With our extended tokenizer, we are able to achieve class-leading JA token efficiency without any losses in EN performance vs the base tokenizer. This bears out in our testing, and we often see >2X JA inference speedups with our tokenizer.
 ## Acknowledgements
+Team: [Jon Durbin](https://huggingface.co/jondurbin), [Leonard Lin](https://huggingface.co/leonardlin)
 Compute for this model was generously sponsored by [AKA Virtual](https://akavirtual.com/) (Tokyo, Japan).
+Thanks to the [ELYZA](https://huggingface.co/elyza) team for publishing the details of their [tokenizer extension approach](https://zenn.dev/elyza/articles/2fd451c944649d) which we used as a starting point for our tokenizer.
+And of course, thanks to the [Mistral AI](https://huggingface.co/mistralai) for releasing such a strong base model!
 ---
+`mistral-7b-ja-v0.1`は、[Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)を基にして、主に日本語の事前トレーニングのために追加で80億トークンを追加しています。日本語トークンは、[MADLAD-400](https://github.com/google-research/google-research/tree/master/madlad_400)から取得し、[DSIR](https://github.com/p-lambda/dsir)を使用しています。さらに、MADLAD-400 ENと様々なオープンデータソースからの英語トークンの10%を追加し、壊滅的忘却を防ぐために組み込んでいます。
+Mistralのトークン化器を12万トークンまで拡張し、日本語の効率を向上させました。私たちのトークン化器はJAでトークンあたり約2.3文字を実現しており、基本的なMistral 7Bのトークン化器はトークンあたり<1文字です。私たちの実装のコードは、[Shisaリポジトリ](https://github.com/AUGMXNT/shisa)で利用可能です。
+このベースモデルは、[Shisa 7B](https://huggingface.co/augmxnt/shisa-7b-v1)、私たちのJA/ENファインチューニングモデル用に作成されましたが、強力なパフォーマンスと効率的なバイリンガルトークン化器の組み合わせが有用であると考え、コミュニティに提供しています。
+トレーニングには、16 x A100-40マシンで2,400 A100-40 GPU時間を使用し、[DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO-3で行いました。
+## パフォーマンス
+このベースモデルは、以前にリリースされたモデルよりもはるかに少ない追加事前トレーニングで、標準ベンチマークにおいて日本語性能の先頭を切ることができました。これは、より良くキュレーションされた事前トレーニングデータセットの使用によるものかもしれませんが、25億追加JAトークンでのアブレーションでも非常に強力な日本語パフォーマンスを示しました。
+私たちは、わずかに変更された[llm-jp-eval](https://github.com/llm-jp/llm-jp-eval)を使用しました（私たちのベースモデルは、プロンプトに`bos_token`を追加する必要があります。他のモデルについても、変更の有無にかかわらずテストし、すべてのモデルでテストされた高い結果を取りました）。ここでは、元のMistral 7Bベースモデルおよび[日本語Stable LM Instruct Gamma 7B](https://huggingface.co/stabilityai/japanese-stablelm-instruct-gamma-7b)（これはMistral 7Bベースであり、追加の1000億JA/ENトークンの事前トレーニングが行われています）と比較します。また、[Japanese-StableLM-Base-Beta-70B](https://huggingface.co/stabilityai/japanese-stablelm-base-beta-70b)（これはLlama 2 70Bで、追加の1000億JA/ENトークンの事前トレーニングが行われています）も参考に含まれています。
+![Mistral llm-jp-eval 比較]()
+ここでは、`mistral-7b-ja-v0.1`を他の最近リリースされた同じクラス（7Bパラメータ）の日本語チューニングモデルとも比較します。[ELYZA 7B fast model](https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b-fast)および[Youri 7B](https://huggingface.co/rinna/youri-7b)はLlama 2 7Bモデルで、それぞれ180億と400億の追加事前トレーニングがあります。また、[CALM2-7B](https://huggingface.co/cyberagent/calm2-7b)と[llm-jp-13b]()は、1.3Tおよび3000億JA/ENトークンの事前トレーニングを行ったプリトレーニングモデルです。
+![7B llm-jp-eval パフォーマンス]()
+## トークン化器
+序文で触れたように、私たちのトークン化器はMistral 7Bトークン化器の拡張版で、語彙サイズは120073であり、128Kに合わせられています。残りの未使用トークンは、平均重み付けされた`<|extra_{idx}|>`トークンとして割り当てられています。
+私たちは「Fast」トークン化器を使用しており、これは`AutoTokenizer`のデフォルトであるべきですが、問題がある場合は`tokenizer.is_fast`をチェックするか、`use_fast=True`で初期化することを確認してください。
+[CulturaX](https://huggingface.co/datasets/uonlp/CulturaX)データセットのJAサブセットから50Kアイテム（約8500万文字）をサンプリングした際の日本語効率：
+| LLM                                           | トークン化器                                        |   語彙サイズ |   1トークンあたりの平均文字数 |
+|:----------------------------------------------|:----------------------------------------------------|-------------:|-----------------:|
+| *Shisa 7B (AUGMXNT)*                          | *augmxnt/mistral-7b-ja-v0.1*                        |     *120073* |           *2.31* |
+| OpenCALM (CyberAgent)                         | cyberagent/open-calm-7b                             |        52000 |             2.17 |
+| Japanese LargeLM (LINE)                       | line-corporation/japanese-large-lm-3.6b             |        51200 |             2.14 |
+| CALM2-7B (CyberAgent)                         | cyberagent/calm2-7b                                 |        65000 |             2.00 |
+| Bilingual-GPT-NeoX-4B (Rinna)                 | rinna/bilingual-gpt-neox-4b                         |        65536 |             1.88 |
+| Japanese StableLM Alpha (Stability AI)        | [novelai/nerdstash-tokenizer-v1](https://huggingface.co/NovelAI/nerdstash-tokenizer-v1) | 65535 | 1.85 |
+| Japanese-GPT-NeoX-3.6B (Rinna)                | rinna/japanese-gpt-neox-3.6b                        |        32000 |             1.83 |
+| Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b |        49247 |             1.79 |
+| llm-jp-13b (LLM-jp)                           | [llm-jp/llm-jp-13b-v1.0](https://github.com/llm-jp/llm-jp-tokenizer) | 50570 |    1.65 |
+| Japanese-Llama-2-7b-fast (ELYZA)              | elyza/ELYZA-japanese-Llama-2-7b-fast                |        45043 |             1.53 |
+| Qwen 14B (Qwen)                               | Qwen/Qwen-14B                                       |       151851 |             1.48 |
+| weblab-10b (Matsuo Lab)                       | EleutherAI/gpt-neox-20b                             |        50254 |             1.00 |
+| Japanese StableLM Gamma (Stability AI)        | mistralai/Mistral-7B-v0.1                           |        32000 |             0.95 |
+| Youri 7B (Rinna)                              | meta-llama/Llama-2-7B                               |        32000 |             0.88 |
+また、[CulturaX](https://huggingface.co/datasets/uonlp/CulturaX)データセットのENサブセットから50Kアイテム（約1億7700万文字）をサンプリングして、英語効率をテストしました。これは健全性チェック（および他のトークン化器のパフォーマンスを確認するため）として行われます：
+| LLM                                           | トークン化器                                        |   語彙サイズ |   1トークンあたりの平均文字数 |
+|:----------------------------------------------|:----------------------------------------------------|-------------:|-----------------:|
+| Qwen 14B (Qwen)                               | Qwen/Qwen-14B                                       |       151851 |             4.47 |
+| weblab-10b (Matsuo Lab)                       | EleutherAI/gpt-neox-20b                             |        50254 |             4.45 |
+| Japanese StableLM Alpha (Stability AI)        | [novelai/nerdstash-tokenizer-v1](https://huggingface.co/NovelAI/nerdstash-tokenizer-v1) | 65535 | 4.15 |
+| *Shisa 7B (AUGMXNT)*                          | *augmxnt/mistral-7b-ja-v0.1*                        |     *120073* |           *4.12* |
+| CALM2-7B (CyberAgent)                         | cyberagent/calm2-7b                                 |        65000 |             4.12 |
+| Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b |        49247 |             4.01 |
+| Japanese StableLM Gamma (Stability AI)        | mistralai/Mistral-7B-v0.1                           |        32000 |             4.01 |
+| Japanese-Llama-2-7b-fast (ELYZA)              | elyza/ELYZA-japanese-Llama-2-7b-fast                |        45043 |             3.86 |
+| Youri 7B (Rinna)                              | meta-llama/Llama-2-7B                               |        32000 |             3.86 |
+| llm-jp-13b (LLM-jp)                           | [llm-jp/llm-jp-13b-v1.0](https://github.com/llm-jp/llm-jp-tokenizer) | 50570 |   3.79 |
+| OpenCALM (CyberAgent)                         | cyberagent/open-calm-7b                             |        52000 |             2.83 |
+| Japanese LargeLM (LINE)                       | line-corporation/japanese-large-lm-3.6b             |        51200 |             2.49 |
+| Japanese-GPT-NeoX-3.6B (Rinna)                | rinna/japanese-gpt-neox-3.6b                        |        32000 |             2.42 |
+| Bilingual-GPT-NeoX-4B (Rinna)                 | rinna/bilingual-gpt-neox-4b                         |        65536 |             2.42 |
+私たちの拡張トークン化器を使用することで、基本トークン化器と比較してENパフォーマンスの損失なく、クラス最高のJAトークン効率を実現できました。これは私たちのテストで実証されており、トークン化器を使用することでJA推論速度が2倍以上になることがしばしばあります。
 ## 謝辞
+チーム：[Jon Durbin](https://huggingface.co/jondurbin)、[Leonard Lin](https://huggingface.co/leonardlin)
+このモデルの計算は、[AKA Virtual](https://akavirtual.com/)（日本、東京）によって寛大に提供されました。
+[ELYZA](https://huggingface.co/elyza)チームが公開した[トークン化器拡張アプローチ](https://zenn.dev/elyza/articles/2fd451c944649d)の詳細に感謝します。これは私たちのトークン化器の出発点として使用されました。
+もちろん、[Mistral AI](https://huggingface.co/mistralai)による強力なベースモデルのリリースに感謝します！

llm-jp-eval.ja.png ADDED Viewed

llm-jp-eval.mistral.png ADDED Viewed