Update tokenizer_config.json to prepend the bos token

#35
by eduagarcia - opened

As discussed in #9, the current HF tokenizer does not prepend the bos token (id: 128000) like in the reference implementation:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/generation.py#L256

and in their test cases:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/test_tokenizer.py#L23

This commit changes the tokenizer_class "PreTrainedTokenizerFast" to the "LlamaTokenizer", the PreTrainedTokenizerFast doesn't support seem to support the add_bos_token flag.

before the fix:

!git clone https://github.com/meta-llama/llama3.git

from llama3.llama import Tokenizer
from transformers import AutoTokenizer
llama_tokenizer = Tokenizer("llama3/Meta-Llama-3-8B/tokenizer.model")
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

text = "This is a test sentence"

orig_enc = llama_tokenizer.encode(text, bos=True, eos=False)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
hf_enc = hf_tokenizer.encode(text)
# [2028, 374, 264, 720, 1296, 271, 52989]

after the fix:

from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", revision="refs/pr/35")

text = "This is a test sentence"

hf_enc = hf_tokenizer.encode(text)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]

@eduagarcia does this fix also apply to the meta-llama/Meta-Llama-3-8B-Instruct model?

see:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
 payload = {
            "inputs": tokenizer.apply_chat_template(
                [
                    {
                        "role": "user",
                        "content": content,
                    }
                ],
                tokenize=False,
            ),
            "parameters": self.parameters,
        }

If you are using the chat_template, it makes no difference, the chat_template already appends the BOS Token. This problem only applies if you are not using the template, like in this base model.

From my tests, the "tokenizer.apply_chat_template(dialog, add_generation_prompt=True)" works the same as the ChatFormat(tokenizer).format.encode_dialog_prompt(dialog) from the reference implementation.

from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')
test = hf_tokenizer.apply_chat_template(
    [
            {
                "role": "system",
                "content": "This is a test sentence.",
            },
            {
                "role": "user",
                "content": "This is a response.",
            }
    ]
    , add_generation_prompt=True
)
print(test)
#[128000, 128006, 9125, 128007, 271, 2028, 374, 264, 1296, 11914, 13, 128009, 128006, 882, 128007, 271, 2028, 374, 264, 2077, 13, 128009, 128006, 78191, 128007, 271]
#   /\ bos_token
#is the same id's as the test on the official repo: https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/test_tokenizer.py#L68

@eduagarcia looks like meta-llama/Meta-Llama-3-8B-Instruct i can use for chat.

@eduagarcia whats does tokenize=False, and add_generation_prompt=True?

Tho add_bos should be used, what we need to update here is the tokenizer.json: the template processor needs this. I’ll update it

Meta Llama org

This PR addresses @ArthurZ 's comments.

Meta Llama org
pcuenq changed pull request status to closed

Sign up or log in to comment