|
--- |
|
library_name: transformers |
|
tags: |
|
- tokenizer |
|
license: mit |
|
datasets: |
|
- DKYoon/SlimPajama-6B |
|
--- |
|
|
|
# sail-slimpajama-6B-32768-BPE-tokenizer |
|
|
|
|
|
This is simply an 'explicit' loadable repo with the 32768 vocab size tokenizer from the [sail vocab scaling laws study](https://huggingface.co/sail/scaling-with-vocab-trained-tokenizers/tree/main/hf_slimpajama-6B-32768-BPE) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```py |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
tk = AutoTokenizer.from_pretrained('pszemraj/sail-slimpajama-6B-32768-BPE-tokenizer') |
|
``` |
|
|
|
details: |
|
|
|
``` |
|
LlamaTokenizerFast(name_or_path='pszemraj/sail-slimpajama-6B-32768-BPE-tokenizer', vocab_size=32768, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False), added_tokens_decoder={ |
|
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), |
|
1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), |
|
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), |
|
3: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), |
|
} |
|
``` |
|
|