Problems with tokenizer
#48
by
abdurnawaz
- opened
Why does tokenizer work in weird ways:
tokenizer("\n\n", add_special_tokens=False)
{'input_ids': [28705, 13, 13], 'attention_mask': [1, 1, 1]}
But when you add a "." in the beginning:
tokenizer(".\n\n", add_special_tokens=False)
{'input_ids': [842, 13, 13], 'attention_mask': [1, 1, 1]}
Shouldn't it be [842, 28705, 13, 13] as the token for "." is 842.
I'm trying to finetune the model and I want it to stop at "\n\n" . I'm use a stopping criteria based on tokens and this problem is not allowing the model to stop generation even though it the characters "\n\n" are generated.
Hey! I am not sure what is wrong here. A prefix space is added as expected. See the tokenization:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
>>> tokenizer.tokenize("\n\n", add_special_tokens=False)
['β', '<0x0A>', '<0x0A>']
>>> tokenizer.tokenize(".\n\n", add_special_tokens=False)
['β.', '<0x0A>', '<0x0A>']
The character "\n\n" is not a token in the vocab, I would recommend you to stop based on two indexes (13,13) rather than taking the prefix space into account