🚨🚨Big bug in Tokenizer!!🚨🚨

by JoaoLages - opened Sep 1, 2023

Sep 1, 2023

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5p-220m-py")

code = """
    # this is a code comment
    <extra_id_0>
"""

print(tokenizer.decode(tokenizer(aux)["input_ids"]))

output:

<s>
    # this is a code comment<extra_id_0>
</s>

It seems that \t\n is not being encoded (or decoded) properly :(

JoaoLages

Sep 1, 2023

•

edited Sep 1, 2023

I just found out that \n and \t have the exact same token id 😐

tokenizer.convert_tokens_to_ids(["\n", "\t"])
Out[35]: [3, 3]

Edit: yes, they are both the UNK id

tokenizer.unk_token_id
Out[39]: 3

JoaoLages

Sep 1, 2023

It seems that the problem is with \n and \t before the special tokens:

aux
Out[58]: '\t\n# this is a code comment\n\t<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[59]: '<s>\t\n# this is a code comment<extra_id_0></s>'

aux
Out[62]: '\n# this is a code comment\n<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[63]: '<s>\n# this is a code comment<extra_id_0></s>'

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment