Greek_Tokenizer / README.md
gsar78's picture
Update README.md
ceb39bf verified
---
tags:
- greek
- tokenization
- bpe
license: mit
language:
- el
---
# Greek Tokenizer
Tokenizer trained from scratch based on BPE algorithm on Greek corpus.
### Usage:
To use this tokenizer, you can load it from the Hugging Face Hub:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
```
### Example:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
# Tokenize input text
input_text = "Αυτό είναι ένα παράδειγμα."
inputs = tokenizer(input_text, return_tensors="pt")
# Print the tokenized input (IDs and tokens)
print("Token IDs:", inputs["input_ids"].tolist())
# Convert token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)
# Manually join tokens to form the tokenized string
tokenized_string = ' '.join(tokens)
print("Tokenized String:", tokenized_string)
```
It can also be used as a head start for pretraining a GPT2 base model on the Greek language.
## Training Details
Vocabulary Size: 52000
Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
## Benefits and Why to use:
Many generic tokenizers split words in multiple tokens.
This tokenizer, efficient and only splits words that was not trained on.
In the example above , the output of this tokenizer is only five tokens, while another tokenizer *e.g. Llama-3* results in 9 or more tokens.
This can have an impact in inference costs and downstream applications.
## Update
July 2024:
A new version is imminent that further decrease the fertility for Greek and English language.