Can anyone tell me why there are so many empty spaces tokens added to the tokenizer?

#98
by parikshit1619 - opened

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
image.png

parikshit1619 changed discussion title from Can anyone why there are so many empty spaces tokens added to the tokenizer? to Can anyone tell me why there are so many empty spaces tokens added to the tokenizer?
Microsoft org

Hello @parikshit1619 !

This was inherited from the Salesforce/codegen-350M-mono, which addressed one of the issues when tokenizing code: too many spaces and tabs. Hence, there are added tokens for representing different amounts of spaces and tabs, reducing the final amount of tokens per encoded data.

Regards,
Gustavo.

gugarosa changed discussion status to closed

Sign up or log in to comment