shantipriya
/

OdiaTokenizer

Model card Files Files and versions

OdiaTokenizer / README.md

shantipriya's picture

Update README.md

85d1ed9 verified 9 months ago

|

history blame contribute delete

3.42 kB

	---
	license: cc-by-nc-sa-4.0
	datasets:
	- OdiaGenAIdata/pre_train_odia_data_processed
	language:
	- or
	---

	# Odia SentencePiece Tokenizer Model

	This repository hosts the SentencePiece tokenizer model for the Odia language, created to support the efficient tokenization of Odia text in NLP applications. The tokenizer was built using a diverse dataset of Odia text, ensuring comprehensive language coverage and accurate tokenization.

	## Model Details

	- Model Prefix: `odia_tokenizers_test`
	- Model Type: BPE (Byte-Pair Encoding)
	- Vocabulary Size: 50,000 tokens

	## File Structure

	- `odia_tokenizers_test.model`: SentencePiece tokenizer model file.
	- `odia_tokenizers_test.vocab`: Vocabulary file containing all token mappings.

	## Installation and Usage

	To load and use this tokenizer model, make sure you have the `sentencepiece` package installed:

	```bash
	pip install sentencepiece
	```

	```python
	import sentencepiece as spm
	from huggingface_hub import hf_hub_download

	# Download the model file from Hugging Face
	model_path = hf_hub_download(repo_id="shantipriya/OdiaTokenizer", filename="odia_tokenizers_test.model")

	# Load the tokenizer model
	sp = spm.SentencePieceProcessor()
	sp.load(model_path)

	# Sample text for tokenization
	text = "ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।"

	# Tokenize the text into pieces (subwords or tokens)
	tokens = sp.encode_as_pieces(text)

	# Tokenize the text into token IDs (integer representations of the tokens)
	token_ids = sp.encode_as_ids(text)

	# Print the tokenized output
	print("Tokens:", tokens)
	print("Token IDs:", token_ids)
	```

	# Sample Tokenization

	The model has been specifically trained on a diverse corpus of Odia text, ensuring high-quality tokenization results. Here’s an example of how the model tokenizes Odia sentences:

	Input: ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।

	Tokens: `['▁ଦୀପାବଳି', '▁ଏକ', '▁ଭାରତୀୟ', '▁ପର୍ବ', '▁।']`

	Token IDs: `[1234, 5678, 91011, 121314, 1516]` (example IDs)

	## Vocabulary Coverage

	The vocabulary size was chosen to balance memory efficiency with language coverage, making it suitable for applications ranging from language modeling to text classification.

	### Vocabulary Statistics

	- Total Tokens: 50,000
	- Average Token Length: 6.46
	- Max Token Length: 16
	- Min Token Length: 1

	## Training and Configuration Details

	The tokenizer was trained using the SentencePiece library with the following configurations:

	- Character Coverage: 99.995%
	- Input Sentence Size: 200 million sentences
	- Maximum Sentence Length: 4192 characters

	Model Training Parameters:
	- `shuffle_input_sentence=True`
	- `split_by_unicode_script=True`
	- `split_by_whitespace=True`
	- `byte_fallback=True`

	## Intended Use

	This model is intended for use in various NLP applications involving the Odia language, such as:

	- Language Modeling
	- Text Classification
	- Named Entity Recognition (NER)
	- Translation tasks involving Odia

	## License

	This model is released under the cc-by-nc-sa-4.0 License.

	## Acknowledgments

	This model was developed as part of a project to support low-resource language processing.
	Thanks to OdiaGenAI for providing the initial training data, which made this model possible.

	## Contributors

	- Shantipriya Parida
	- Sambit Sekhar
	- Sahil Khan