Fix tokenizer reloading
#42
by
kylesayrs
- opened
Purpose
- Fixes bug where processor cannot be saved to disk and then loaded again
- Not all tokenizer kwargs are passed to the parent class,
pretrainedTokenizerBase
. This means that some tokenizer kwargs are not inself.init_kwargs
, which results in them not being saved bypretrainedTokenizerBase.save_pretrained
- Not all tokenizer kwargs are passed to the parent class,
Related Issues
Changes
- Pass
encode_special_tokens
andimage_size
kwargs intosuper().__init__
Testing
- Confirmed that newly written
tokenizer_config.json
contains theimage_size
andencode_special_tokens
fields which were previously missing
from transformers import AutoTokenizer
processor = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b")
assert processor.image_size is not None
processor.save_pretrained("test")
processor = AutoTokenizer.from_pretrained(test)
assert processor.image_size is not None
kylesayrs
changed pull request status to
open
zRzRzRzRzRzRzR
changed pull request status to
merged