Text Generation
Transformers
Safetensors
Finnish
llama
finnish
conversational
text-generation-inference
Ahma-3B / train_sentencepiece.py
aapot
Add new tokenizer
40e6898
raw
history blame
558 Bytes
import sentencepiece as spm
spm.SentencePieceTrainer.train(input="/researchdisk/training_dataset_sentences/train.txt", model_prefix="tokenizer",
model_type="bpe", split_digits=True, vocab_size=64256, byte_fallback=True,
user_defined_symbols=["[INST]", "[/INST]", "<<SYS>>", "<</SYS>>"],
train_extremely_large_corpus=True,
input_sentence_size=500000000, shuffle_input_sentence=True,
num_threads=96)