Can anyone tell me why there are so many empty spaces tokens added to the tokenizer?
#98
by
parikshit1619
- opened
parikshit1619
changed discussion title from
Can anyone why there are so many empty spaces tokens added to the tokenizer?
to Can anyone tell me why there are so many empty spaces tokens added to the tokenizer?
Hello @parikshit1619 !
This was inherited from the Salesforce/codegen-350M-mono
, which addressed one of the issues when tokenizing code: too many spaces and tabs. Hence, there are added tokens for representing different amounts of spaces and tabs, reducing the final amount of tokens per encoded data.
Regards,
Gustavo.
gugarosa
changed discussion status to
closed