🚨🚨Big bug in Tokenizer!!🚨🚨
#4
by
JoaoLages
- opened
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5p-220m-py")
code = """
# this is a code comment
<extra_id_0>
"""
print(tokenizer.decode(tokenizer(aux)["input_ids"]))
output:
<s>
# this is a code comment<extra_id_0>
</s>
It seems that \t\n is not being encoded (or decoded) properly :(
I just found out that \n and \t have the exact same token id 😐
tokenizer.convert_tokens_to_ids(["\n", "\t"])
Out[35]: [3, 3]
Edit: yes, they are both the UNK id
tokenizer.unk_token_id
Out[39]: 3
It seems that the problem is with \n and \t before the special tokens:
aux
Out[58]: '\t\n# this is a code comment\n\t<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[59]: '<s>\t\n# this is a code comment<extra_id_0></s>'
aux
Out[62]: '\n# this is a code comment\n<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[63]: '<s>\n# this is a code comment<extra_id_0></s>'