Problems with tokenizer

#48

by abdurnawaz - opened Oct 8, 2023

Oct 8, 2023

Why does tokenizer work in weird ways:

tokenizer("\n\n", add_special_tokens=False)

{'input_ids': [28705, 13, 13], 'attention_mask': [1, 1, 1]}

But when you add a "." in the beginning:

tokenizer(".\n\n", add_special_tokens=False)

{'input_ids': [842, 13, 13], 'attention_mask': [1, 1, 1]}

Shouldn't it be [842, 28705, 13, 13] as the token for "." is 842.

I'm trying to finetune the model and I want it to stop at "\n\n" . I'm use a stopping criteria based on tokens and this problem is not allowing the model to stop generation even though it the characters "\n\n" are generated.

ArthurZ

Oct 8, 2023

Hey! I am not sure what is wrong here. A prefix space is added as expected. See the tokenization:

>>> from transformers import AutoTokenizer
>>> tokenizer  = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
>>> tokenizer.tokenize("\n\n", add_special_tokens=False)
 ['▁', '<0x0A>', '<0x0A>']

>>> tokenizer.tokenize(".\n\n", add_special_tokens=False)
 ['▁.', '<0x0A>', '<0x0A>']

The character "\n\n" is not a token in the vocab, I would recommend you to stop based on two indexes (13,13) rather than taking the prefix space into account

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment