Llama3 Insturct Tokenizers.Encoding.offsets is wrong

#180

by AlignLearner - opened Sep 24, 2024

Sep 24, 2024

Script

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(t("are you ok?", add_special_tokens=False)[0].offsets)

Output

[(0, 0), (3, 3), (7, 7), (10, 10)]

Expected Output

[(0, 3), (3, 7), (7, 10), (10, 11)]

AlignLearner

Sep 24, 2024

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(t("今天天气好", add_special_tokens=False)[0].offsets)

[(0, 2), (2, 3), (3, 4), (4, 5)]

If it encodes Chinese characters, it's output is correct.

AlignLearner

Sep 26, 2024

https://github.com/huggingface/transformers/issues/33675

AlignLearner changed discussion status to closed Sep 26, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment