Spaces:

teragron
/

TinyStories

App Files Files Community

TinyStories / doc /train_llama_tokenizer.md

teragron's picture

Upload 21 files

233119d about 1 year ago

|

3.53 kB

	# training llama tokenizer

	How does Meta train their sentencepiece tokenizer? You can print the config as follows:

	```python
	import sentencepiece.sentencepiece_model_pb2
	mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
	mp.ParseFromString(open("tokenizer.model", "rb").read())
	print(mp.trainer_spec)
	print(mp.normalizer_spec)
	```

	this gives:

	```
	trainer_spec {
	input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
	model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
	model_type: BPE
	vocab_size: 32000
	self_test_sample_size: 0
	input_format: "text"
	character_coverage: 0.9999499917030334
	input_sentence_size: 200000000
	seed_sentencepiece_size: 1000000
	shrinking_factor: 0.75
	num_threads: 80
	num_sub_iterations: 2
	max_sentence_length: 4192
	shuffle_input_sentence: true
	max_sentencepiece_length: 16
	split_by_unicode_script: true
	split_by_whitespace: true
	split_by_number: true
	treat_whitespace_as_suffix: false
	split_digits: true
	allow_whitespace_only_pieces: true
	vocabulary_output_piece_score: true
	hard_vocab_limit: true
	use_all_vocab: false
	byte_fallback: true
	required_chars: ""
	unk_id: 0
	bos_id: 1
	eos_id: 2
	pad_id: -1
	unk_surface: " \342\201\207 "
	unk_piece: "<unk>"
	bos_piece: "<s>"
	eos_piece: "</s>"
	pad_piece: "<pad>"
	train_extremely_large_corpus: false
	enable_differential_privacy: false
	differential_privacy_noise_level: 0.0
	differential_privacy_clipping_threshold: 0
	}
	normalizer_spec {
	name: "identity"
	precompiled_charsmap: ""
	add_dummy_prefix: true
	remove_extra_whitespaces: false
	normalization_rule_tsv: ""
	}
	```

	We can use the sentencepiece spm_train to train the same models, but optionally smaller. Here are their [options docs](https://github.com/google/sentencepiece/blob/master/doc/options.md) we can refer to. It's not much but it helps.

	We'll depart on one setting, I recommend changing `character_coverage` -> 1.0. We also want to make sure to note the following important settings that come up in the paper and are not necessarily the default sentencepiece settings:

	```
	--split-digits = true
	--allow_whitespace_only_pieces = true
	--byte_fallback = true
	--normalization_rule_name = identity
	```

	With this in mind we can train a sentencepiece vocab in what I believe is probably the same to how Meta trained theirs as:

	```
	spm_train --input="$input" \
	--model_prefix="$model_prefix" \
	--model_type=bpe \
	--vocab_size="$vocab_size" \
	--self_test_sample_size=0 \
	--input_format="text" \
	--character_coverage=1.0 \
	--num_threads="$(nproc)" \
	--split_digits=true \
	--allow_whitespace_only_pieces=true \
	--byte_fallback=true \
	--unk_surface=" \342\201\207 " \
	--normalization_rule_name=identity \
	```

	Where $input is the input file, $model_prefix is the output path prefix, vocab_size is the desired vocab, and we're by default taking over the CPU resources of the machine.

	Lastly note that sentencepiece is weird and expects "sentences" delimited by newlines as the input. You can't just put in a massive block of text. And they have a hyperparameter that constols the maximum size of a "sentence". Fwiw I really dislike this design choice around a weird concept of a "sentence". It should just be block of text with no assumptions. But here we are.

	Look into the file `tinystories.py` where we train the vocab in the same way, but using Python bindings instead.