Model Card for Model ID
English & Greek Tokenizer trained from scratch
Direct Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/tokenizer_BPE_en_el")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/tokenizer_BPE_en_el")
# Tokenize input text
input_text = "This is a game"
inputs = tokenizer(input_text, return_tensors="pt")
# Print the tokenized input (IDs and tokens)
print("Token IDs:", inputs["input_ids"].tolist())
# Convert token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)
# Manually join tokens to form the tokenized string
tokenized_string = ' '.join(tokens)
print("Tokenized String:", tokenized_string)
# Output:
Token IDs: [[2967, 317, 220, 1325]]
Tokens: ['This', 'Ġis', 'Ġa', 'Ġgame']
Tokenized String: This Ġis Ġa Ġgame
Recommendations
When tokenizing Greek, Greek tokens may appear as gibberish, but actually this does not impact the downstream model pretraining.
(An improved version of this tokenizer, without the gibberish Greek tokens can be found here: gsar78/Greek_Tokenizer)
Can be used a good start for pretraining a GPT-based model or any other model using BPE.