Fix tokenizer reloading

#42

Purpose

  • Fixes bug where processor cannot be saved to disk and then loaded again
    • Not all tokenizer kwargs are passed to the parent class, pretrainedTokenizerBase. This means that some tokenizer kwargs are not in self.init_kwargs, which results in them not being saved by pretrainedTokenizerBase.save_pretrained

Related Issues

Changes

  • Pass encode_special_tokens and image_size kwargs into super().__init__

Testing

  • Confirmed that newly written tokenizer_config.json contains the image_size and encode_special_tokens fields which were previously missing
from transformers import AutoTokenizer
processor = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b")
assert processor.image_size is not None

processor.save_pretrained("test")
processor = AutoTokenizer.from_pretrained(test)
assert processor.image_size is not None
kylesayrs changed pull request status to open
zRzRzRzRzRzRzR changed pull request status to merged

Sign up or log in to comment