Update README.md
Browse files
README.md
CHANGED
@@ -55,7 +55,7 @@ Checkpoints format: Hugging Face Transformers
|
|
55 |
import torch
|
56 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
57 |
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
|
58 |
-
model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.
|
59 |
text = "自然言語処理とは何か"
|
60 |
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
|
61 |
with torch.no_grad():
|
@@ -97,7 +97,7 @@ The tokenizer of this model is based on [huggingface/tokenizers](https://github.
|
|
97 |
The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
|
98 |
Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
|
99 |
|
100 |
-
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model
|
101 |
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
|
102 |
- **Training data:** A subset of the datasets for model pre-training
|
103 |
- **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
|
|
|
55 |
import torch
|
56 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
57 |
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
|
58 |
+
model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.bfloat16)
|
59 |
text = "自然言語処理とは何か"
|
60 |
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
|
61 |
with torch.no_grad():
|
|
|
97 |
The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
|
98 |
Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
|
99 |
|
100 |
+
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model
|
101 |
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
|
102 |
- **Training data:** A subset of the datasets for model pre-training
|
103 |
- **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
|