Allow loading via AutoTokenizer
#3
by
tomaarsen
HF staff
- opened
Hello!
Pull Request overview
- Allow loading a tokenizer via
AutoTokenizer("togethercomputer/m2-bert-80M-8k-retrieval", trust_remote_code=True)
Details
I wanted to load the tokenizer for this model via AutoTokenizer
:
from transformers import AutoTokenizer
testing_string = "Every morning, I make a cup of coffee to start my day."
tokenizer = AutoTokenizer.from_pretrained(
"togethercomputer/m2-bert-80M-8k-retrieval",
trust_remote_code=True,
)
input_ids = tokenizer(
[testing_string],
return_tensors="pt",
padding="max_length",
return_token_type_ids=False,
truncation=True,
)
print(input_ids)
But that isn't possible I'm afraid:
You are using a model of type m2_bert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "c:\code\m2-bert-80M-8k-retrieval\demo.py", line 6, in <module>
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 710, in from_pretrained
tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\transformers\dynamic_module_utils.py", line 480, in get_class_from_dynamic_module
module_file, class_name = class_reference.split(".")
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 1)
The reasoning is that this line does not work: https://huggingface.co/togethercomputer/m2-bert-80M-8k-retrieval/blob/main/config.json#L12
This is two-fold:
- The AutoTokenizer in the auto_map expects a tuple: the first parameter should be a "slow" tokenizer, and the second a "fast" tokenizer.
- The values there should be classes, not models.
The normal approach is to fully include the correct tokenizer files in this repository as well (cc:
@osanseviero
to make sure). So in this PR I've included the bert-base-cased
tokenizer, but with model_max_length
set to 8192 in the tokenizer_config.json
. This prevents users from having to specify it themselves.
After this PR
The above script now returns:
{'input_ids': tensor([[ 101, 4081, 2106, ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0]])}
🎉
- Tom Aarsen
tomaarsen
changed pull request status to
open
Thank you for these PRs, they’re really helpful! I saw you referenced Slack messages in one of them, trying to find that message to chat more!
danfu09
changed pull request status to
merged