Can anyone tell me why there are so many empty spaces tokens added to the tokenizer?

#98

by parikshit1619 - opened Jan 25, 2024

Discussion

parikshit1619

Jan 25, 2024

•

edited Jan 25, 2024

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

parikshit1619 changed discussion title from Can anyone why there are so many empty spaces tokens added to the tokenizer? to Can anyone tell me why there are so many empty spaces tokens added to the tokenizer? Jan 25, 2024

gugarosa

Microsoft org Jan 26, 2024

Hello @parikshit1619 !

This was inherited from the Salesforce/codegen-350M-mono, which addressed one of the issues when tokenizing code: too many spaces and tabs. Hence, there are added tokens for representing different amounts of spaces and tabs, reducing the final amount of tokens per encoded data.

Regards,
Gustavo.

gugarosa changed discussion status to closed Jan 26, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment