togethercomputer/m2-bert-80M-8k-retrieval · Allow loading via AutoTokenizer

tomaarsen

Jan 11, 2024

•

edited Jan 11, 2024

Hello!

Pull Request overview

Allow loading a tokenizer via AutoTokenizer("togethercomputer/m2-bert-80M-8k-retrieval", trust_remote_code=True)

Details

I wanted to load the tokenizer for this model via AutoTokenizer:

from transformers import AutoTokenizer

testing_string = "Every morning, I make a cup of coffee to start my day."

tokenizer = AutoTokenizer.from_pretrained(
    "togethercomputer/m2-bert-80M-8k-retrieval",
    trust_remote_code=True,
)
input_ids = tokenizer(
    [testing_string],
    return_tensors="pt",
    padding="max_length",
    return_token_type_ids=False,
    truncation=True,
)
print(input_ids)

But that isn't possible I'm afraid:

You are using a model of type m2_bert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "c:\code\m2-bert-80M-8k-retrieval\demo.py", line 6, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 710, in from_pretrained
    tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\transformers\dynamic_module_utils.py", line 480, in get_class_from_dynamic_module
    module_file, class_name = class_reference.split(".")
    ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 1)

The reasoning is that this line does not work: https://huggingface.co/togethercomputer/m2-bert-80M-8k-retrieval/blob/main/config.json#L12

This is two-fold:

The AutoTokenizer in the auto_map expects a tuple: the first parameter should be a "slow" tokenizer, and the second a "fast" tokenizer.
The values there should be classes, not models.

The normal approach is to fully include the correct tokenizer files in this repository as well (cc: @osanseviero to make sure). So in this PR I've included the bert-base-cased tokenizer, but with model_max_length set to 8192 in the tokenizer_config.json. This prevents users from having to specify it themselves.

After this PR

The above script now returns:

{'input_ids': tensor([[ 101, 4081, 2106,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0]])}

🎉

Tom Aarsen

Add bert-base-cased tokenizer with model_max_length: 8192fdd629a0

Remove AutoTokenizer from the config7d7932ee

tomaarsen changed pull request status to open Jan 11, 2024

danfu09

Together org Jan 12, 2024

Thank you for these PRs, they’re really helpful! I saw you referenced Slack messages in one of them, trying to find that message to chat more!

danfu09 changed pull request status to merged Jan 12, 2024