|
--- |
|
tags: |
|
- greek |
|
- tokenization |
|
- bpe |
|
license: mit |
|
language: |
|
- el |
|
--- |
|
|
|
# Greek Tokenizer |
|
|
|
Tokenizer trained from scratch based on BPE algorithm on Greek corpus. |
|
|
|
### Usage: |
|
To use this tokenizer, you can load it from the Hugging Face Hub: |
|
|
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer") |
|
``` |
|
|
|
### Example: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer") |
|
|
|
# Tokenize input text |
|
input_text = "Αυτό είναι ένα παράδειγμα." |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
|
# Print the tokenized input (IDs and tokens) |
|
print("Token IDs:", inputs["input_ids"].tolist()) |
|
|
|
# Convert token IDs to tokens |
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
print("Tokens:", tokens) |
|
|
|
# Manually join tokens to form the tokenized string |
|
tokenized_string = ' '.join(tokens) |
|
print("Tokenized String:", tokenized_string) |
|
``` |
|
|
|
It can also be used as a head start for pretraining a GPT2 base model on the Greek language. |
|
|
|
|
|
## Training Details |
|
Vocabulary Size: 52000 |
|
|
|
Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK] |
|
|
|
|
|
## Benefits and Why to use: |
|
|
|
Many generic tokenizers split words in multiple tokens. |
|
|
|
This tokenizer, efficient and only splits words that was not trained on. |
|
|
|
In the example above , the output of this tokenizer is only five tokens, while another tokenizer *e.g. Llama-3* results in 9 or more tokens. |
|
|
|
This can have an impact in inference costs and downstream applications. |
|
|
|
## Update |
|
|
|
July 2024: |
|
|
|
A new version is imminent that further decrease the fertility for Greek and English language. |