tathi commited on
Commit
71d0e87
·
verified ·
1 Parent(s): f9652f3

add tokenizer info

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -96,12 +96,13 @@ print(tokenizer.decode(output))
96
  ## Tokenizer (To be updated)
97
 
98
  The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
99
- The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
100
- Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
 
101
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
102
- - **Training algorithm:** SentencePiece Unigram byte-fallback
103
  - **Training data:** A subset of the datasets for model pre-training
104
- - **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
105
 
106
 
107
  ## Datasets (To be updated)
 
96
  ## Tokenizer (To be updated)
97
 
98
  The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
99
+ The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
100
+ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
101
+
102
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
103
+ - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
104
  - **Training data:** A subset of the datasets for model pre-training
105
+ - **Vocabulary size:** 48,588 (mixed vocabulary of Japanese, English, and source code)
106
 
107
 
108
  ## Datasets (To be updated)