Problems with tokenizer

#48
by abdurnawaz - opened

Why does tokenizer work in weird ways:

tokenizer("\n\n", add_special_tokens=False)
{'input_ids': [28705, 13, 13], 'attention_mask': [1, 1, 1]}

But when you add a "." in the beginning:

tokenizer(".\n\n", add_special_tokens=False)
{'input_ids': [842, 13, 13], 'attention_mask': [1, 1, 1]}

Shouldn't it be [842, 28705, 13, 13] as the token for "." is 842.

I'm trying to finetune the model and I want it to stop at "\n\n" . I'm use a stopping criteria based on tokens and this problem is not allowing the model to stop generation even though it the characters "\n\n" are generated.

Hey! I am not sure what is wrong here. A prefix space is added as expected. See the tokenization:

>>> from transformers import AutoTokenizer
>>> tokenizer  = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
>>> tokenizer.tokenize("\n\n", add_special_tokens=False)
 ['▁', '<0x0A>', '<0x0A>']

>>> tokenizer.tokenize(".\n\n", add_special_tokens=False)
 ['▁.', '<0x0A>', '<0x0A>']

The character "\n\n" is not a token in the vocab, I would recommend you to stop based on two indexes (13,13) rather than taking the prefix space into account

Sign up or log in to comment