Llama3 Insturct Tokenizers.Encoding.offsets is wrong

#180
by AlignLearner - opened

Script

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(t("are you ok?", add_special_tokens=False)[0].offsets)

Output

[(0, 0), (3, 3), (7, 7), (10, 10)]

Expected Output

[(0, 3), (3, 7), (7, 10), (10, 11)]
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(t("今天天气好", add_special_tokens=False)[0].offsets)
[(0, 2), (2, 3), (3, 4), (4, 5)]

If it encodes Chinese characters, it's output is correct.

AlignLearner changed discussion status to closed

Sign up or log in to comment