gsar78
/

Greek_Tokenizer

Model card Files Files and versions Community

Greek_Tokenizer / README.md

gsar78's picture

Update README.md

ceb39bf verified 6 months ago

|

history blame contribute delete

1.68 kB

	---
	tags:
	- greek
	- tokenization
	- bpe
	license: mit
	language:
	- el
	---

	# Greek Tokenizer

	Tokenizer trained from scratch based on BPE algorithm on Greek corpus.

	### Usage:
	To use this tokenizer, you can load it from the Hugging Face Hub:


	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
	```

	### Example:

	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")

	# Tokenize input text
	input_text = "Αυτό είναι ένα παράδειγμα."
	inputs = tokenizer(input_text, return_tensors="pt")

	# Print the tokenized input (IDs and tokens)
	print("Token IDs:", inputs["input_ids"].tolist())

	# Convert token IDs to tokens
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	print("Tokens:", tokens)

	# Manually join tokens to form the tokenized string
	tokenized_string = ' '.join(tokens)
	print("Tokenized String:", tokenized_string)
	```

	It can also be used as a head start for pretraining a GPT2 base model on the Greek language.


	## Training Details
	Vocabulary Size: 52000

	Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]


	## Benefits and Why to use:

	Many generic tokenizers split words in multiple tokens.

	This tokenizer, efficient and only splits words that was not trained on.

	In the example above , the output of this tokenizer is only five tokens, while another tokenizer e.g. Llama-3 results in 9 or more tokens.

	This can have an impact in inference costs and downstream applications.

	## Update

	July 2024:

	A new version is imminent that further decrease the fertility for Greek and English language.