Allow loading via AutoTokenizer

#3
by tomaarsen HF staff - opened

Hello!

Pull Request overview

  • Allow loading a tokenizer via AutoTokenizer("togethercomputer/m2-bert-80M-8k-retrieval", trust_remote_code=True)

Details

I wanted to load the tokenizer for this model via AutoTokenizer:

from transformers import AutoTokenizer

testing_string = "Every morning, I make a cup of coffee to start my day."

tokenizer = AutoTokenizer.from_pretrained(
    "togethercomputer/m2-bert-80M-8k-retrieval",
    trust_remote_code=True,
)
input_ids = tokenizer(
    [testing_string],
    return_tensors="pt",
    padding="max_length",
    return_token_type_ids=False,
    truncation=True,
)
print(input_ids)

But that isn't possible I'm afraid:

You are using a model of type m2_bert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "c:\code\m2-bert-80M-8k-retrieval\demo.py", line 6, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 710, in from_pretrained
    tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\transformers\dynamic_module_utils.py", line 480, in get_class_from_dynamic_module
    module_file, class_name = class_reference.split(".")
    ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 1)

The reasoning is that this line does not work: https://huggingface.co/togethercomputer/m2-bert-80M-8k-retrieval/blob/main/config.json#L12

This is two-fold:

  1. The AutoTokenizer in the auto_map expects a tuple: the first parameter should be a "slow" tokenizer, and the second a "fast" tokenizer.
  2. The values there should be classes, not models.

The normal approach is to fully include the correct tokenizer files in this repository as well (cc: @osanseviero to make sure). So in this PR I've included the bert-base-cased tokenizer, but with model_max_length set to 8192 in the tokenizer_config.json. This prevents users from having to specify it themselves.

After this PR

The above script now returns:

{'input_ids': tensor([[ 101, 4081, 2106,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0]])}

🎉

  • Tom Aarsen
tomaarsen changed pull request status to open
Together org

Thank you for these PRs, they’re really helpful! I saw you referenced Slack messages in one of them, trying to find that message to chat more!

danfu09 changed pull request status to merged

Sign up or log in to comment