|
# bpetokenizer
|
|
|
|
A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format.
|
|
|
|
|
|
### Overview
|
|
|
|
The Byte Pair Encoding (BPE) algorithm is a simple yet powerful method for building a vocabulary of subword units for a given text corpus. This tokenizer can be used for training your tokenizer of the LLM on various languages of text corpus.
|
|
|
|
this algorithm is first introduced in the paper [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909) and then used this in the gpt2 tokenizer([Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf))
|
|
|
|
The [notebook](notebooks/tokenization.ipynb) which shows the BPE algorithm in detail and how the tokenizers work internally.
|
|
|
|
Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their own text dataset.
|
|
|
|
|
|
### Features
|
|
|
|
- Implements Byte Pair Encoding (BPE) algorithm.
|
|
- Handles special tokens.
|
|
- Uses a customizable regex pattern for tokenization.
|
|
- Compatible with Python 3.9 and above
|
|
|
|
|
|
#### This repository has 2 different Tokenizers:
|
|
- `BPETokenizer`
|
|
- `Tokenizer`
|
|
|
|
1. [Tokenizer](bpetokenizer/base.py): This class contains `train`, `encode`, `decode` and functionalities to `save` and `load`. Also contains few helper functions `get_stats`, `merge`, `replace_control_characters`.. to perform the BPE algorithm for the tokenizer.
|
|
|
|
2. [BPETokenizer](bpetokenizer/tokenizer.py): This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..[tiktoken](https://github.com/openai/tiktoken)), uses the `GPT4_SPLIT_PATTERN` to split the text as mentioned in the gpt4 tokenizer. also handles the `special_tokens` (refer [sample_bpetokenizer](sample/bpetokenizer/sample_bpetokenizer.py)). which inherits the `save` and `load` functionlities to save and load the tokenizer respectively.
|
|
|
|
|
|
### Usage
|
|
|
|
this tutorial leverages the `special_tokens` usage in the Tokenizer.
|
|
|
|
Install the package
|
|
|
|
```shell
|
|
pip install bpetokenizer
|
|
```
|
|
|
|
|
|
```py
|
|
from bpetokenizer import BPETokenizer
|
|
|
|
special_tokens = {
|
|
"<|endoftext|>": 1001,
|
|
"<|startoftext|>": 1002,
|
|
"[SPECIAL1]": 1003,
|
|
"[SPECIAL2]": 1004,
|
|
}
|
|
|
|
tokenizer = BPETokenizer(special_tokens=special_tokens) # you can also use the method _special_tokens to register the special tokens (if not passed when intializing)
|
|
texts = "<|startoftext|> Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.<|endoftext|>"
|
|
|
|
tokenizer.train(texts, vocab_size=310, verbose=True)
|
|
# tokenizer._special_tokens(special_tokens) # if not passed when intialization of the BPETokenizer
|
|
|
|
encode_text = """
|
|
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
|
|
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
|
|
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.
|
|
Hello, World! This is yet another sample text, with [SPECIAL1] and [SPECIAL2] making an appearance.
|
|
Hey there, World! Testing the tokenizer with [SPECIAL1] and [SPECIAL2] to see if it handles special tokens properly.
|
|
Salutations, Planet! The tokenizer should recognize [SPECIAL1] and [SPECIAL2] in this long string of text.
|
|
Hello again, World! [SPECIAL1] and [SPECIAL2] are special tokens that need to be handled correctly by the tokenizer.
|
|
Welcome, World! Including [SPECIAL1] and [SPECIAL2] multiple times in this large text to ensure proper encoding.
|
|
Hi, World! Let's add [SPECIAL1] and [SPECIAL2] in various parts of this long sentence to test the tokenizer thoroughly.
|
|
<|endoftext|>
|
|
"""
|
|
ids = tokenizer.encode(encode_text, special_tokens="all")
|
|
print(ids)
|
|
|
|
decode_text = tokenizer.decode(ids)
|
|
print(decode_text)
|
|
|
|
tokenizer.save("sample_bpetokenizer", mode="json") # mode: default is file
|
|
```
|
|
|
|
refer [sample_bpetokenizer](sample/bpetokenizer) to have an understanding of the `vocab` and the `model` file of the tokenizer trained on the above texts.
|
|
|
|
|
|
#### To Load the Tokenizer
|
|
|
|
```py
|
|
from bpetokenizer import BPETokenizer
|
|
|
|
tokenizer = BPETokenizer()
|
|
|
|
tokenizer.load("sample_bpetokenizer.json", mode="json")
|
|
|
|
encode_text = """
|
|
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
|
|
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
|
|
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.<|endoftext|>"""
|
|
|
|
print("vocab: ", tokenizer.vocab)
|
|
print('---')
|
|
print("merges: ", tokenizer.merges)
|
|
print('---')
|
|
print("special tokens: ", tokenizer.special_tokens)
|
|
|
|
ids = tokenizer.encode(encode_text, special_tokens="all")
|
|
print('---')
|
|
print(ids)
|
|
|
|
decode_text = tokenizer.decode(ids)
|
|
print('---')
|
|
print(decode_text)
|
|
|
|
# you can also print the tokens and the text chunks split with the pattern.
|
|
tokens = tokenizer.tokens(encode_text, verbose=True) # if verbose, prints the text chunks and also the pattern used to split.
|
|
print('---')
|
|
print("tokens: ", tokens)
|
|
|
|
```
|
|
refer to the [load_json_vocab](sample/load_json_vocab/) and run the `bpetokenizer_json` to get an overview of `vocab`, `merges`, `special_tokens` and to view the tokens that are split by the tokenizer using pattern, look at [tokens](sample/load_json_vocab/tokens.py)
|
|
|
|
### Run Tests
|
|
|
|
the tests folder `tests/` include the tests of the tokenizer, uses pytest.
|
|
|
|
```
|
|
python3 -m pytest
|
|
```
|
|
|
|
additionally, the workflows are setup to run the tests when made a PR.
|
|
|
|
|
|
### Contributing
|
|
|
|
Contributions to the BPE Tokenizer are most welcomed! If you would like to contribute, please follow these steps:
|
|
|
|
- Star and Fork the repository.
|
|
- Create a new branch (git checkout -b feature/your-feature).
|
|
- Commit your changes (git commit -am 'Add some feature').
|
|
- Push to the branch (git push origin feature/your-feature).
|
|
- Create a new Pull Request.
|
|
|
|
Please ensure your code follows the project's coding standards and includes appropriate tests. Also, update the documentation as necessary.
|
|
|
|
|
|
### License
|
|
|
|
This project is licensed under the MIT License.
|
|
|
|
----
|
|
|
|
*this tokenizer is inspired from the [minbpe](https://github.com/karpathy/minbpe), but more optimized. |