cyrilzhang
/

gpt2-numfix

Model card Files Files and versions Community

gpt2-numfix / README.md

cyrilzhang's picture

Update README.md

b0e76f1 over 2 years ago

|

819 Bytes

	---
	license: mit
	---

	## GPT-2 Tokenizer with unmerged digits

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')
	```

	A fork of the GPT-2 tokenizer, which removes multi-digit tokens:
	```python
	tokenizer('123.45') # [16, 17, 18, 13, 19, 20]
	gpt2_tokenizer('123.45') # [10163, 13, 2231]
	```

	Backwards-compatible:
	```python
	tokenizer.decode([10163, 46387]) # '<unused123> pigeon'
	gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon'
	```

	- This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
	- [PaLM](https://arxiv.org/abs/2204.02311) does this.
	- Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) use the GPT-2 tokenizer, which doesn't do this.