|
--- |
|
license: cc-by-nc-sa-4.0 |
|
datasets: |
|
- OdiaGenAIdata/pre_train_odia_data_processed |
|
language: |
|
- or |
|
--- |
|
|
|
# Odia SentencePiece Tokenizer Model |
|
|
|
This repository hosts the SentencePiece tokenizer model for the Odia language, created to support the efficient tokenization of Odia text in NLP applications. The tokenizer was built using a diverse dataset of Odia text, ensuring comprehensive language coverage and accurate tokenization. |
|
|
|
## Model Details |
|
|
|
- **Model Prefix**: `odia_tokenizers_test` |
|
- **Model Type**: BPE (Byte-Pair Encoding) |
|
- **Vocabulary Size**: 50,000 tokens |
|
|
|
## File Structure |
|
|
|
- **`odia_tokenizers_test.model`**: SentencePiece tokenizer model file. |
|
- **`odia_tokenizers_test.vocab`**: Vocabulary file containing all token mappings. |
|
|
|
## Installation and Usage |
|
|
|
To load and use this tokenizer model, make sure you have the `sentencepiece` package installed: |
|
|
|
```bash |
|
pip install sentencepiece |
|
``` |
|
|
|
```python |
|
import sentencepiece as spm |
|
from huggingface_hub import hf_hub_download |
|
|
|
# Download the model file from Hugging Face |
|
model_path = hf_hub_download(repo_id="shantipriya/OdiaTokenizer", filename="odia_tokenizers_test.model") |
|
|
|
# Load the tokenizer model |
|
sp = spm.SentencePieceProcessor() |
|
sp.load(model_path) |
|
|
|
# Sample text for tokenization |
|
text = "ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।" |
|
|
|
# Tokenize the text into pieces (subwords or tokens) |
|
tokens = sp.encode_as_pieces(text) |
|
|
|
# Tokenize the text into token IDs (integer representations of the tokens) |
|
token_ids = sp.encode_as_ids(text) |
|
|
|
# Print the tokenized output |
|
print("Tokens:", tokens) |
|
print("Token IDs:", token_ids) |
|
``` |
|
|
|
# Sample Tokenization |
|
|
|
The model has been specifically trained on a diverse corpus of Odia text, ensuring high-quality tokenization results. Here’s an example of how the model tokenizes Odia sentences: |
|
|
|
**Input:** ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ । |
|
|
|
**Tokens:** `['▁ଦୀପାବଳି', '▁ଏକ', '▁ଭାରତୀୟ', '▁ପର୍ବ', '▁।']` |
|
|
|
**Token IDs:** `[1234, 5678, 91011, 121314, 1516]` (example IDs) |
|
|
|
## Vocabulary Coverage |
|
|
|
The vocabulary size was chosen to balance memory efficiency with language coverage, making it suitable for applications ranging from language modeling to text classification. |
|
|
|
### Vocabulary Statistics |
|
|
|
- **Total Tokens:** 50,000 |
|
- **Average Token Length:** 6.46 |
|
- **Max Token Length:** 16 |
|
- **Min Token Length:** 1 |
|
|
|
## Training and Configuration Details |
|
|
|
The tokenizer was trained using the SentencePiece library with the following configurations: |
|
|
|
- **Character Coverage:** 99.995% |
|
- **Input Sentence Size:** 200 million sentences |
|
- **Maximum Sentence Length:** 4192 characters |
|
|
|
**Model Training Parameters:** |
|
- `shuffle_input_sentence=True` |
|
- `split_by_unicode_script=True` |
|
- `split_by_whitespace=True` |
|
- `byte_fallback=True` |
|
|
|
## Intended Use |
|
|
|
This model is intended for use in various NLP applications involving the Odia language, such as: |
|
|
|
- Language Modeling |
|
- Text Classification |
|
- Named Entity Recognition (NER) |
|
- Translation tasks involving Odia |
|
|
|
## License |
|
|
|
This model is released under the cc-by-nc-sa-4.0 License. |
|
|
|
## Acknowledgments |
|
|
|
This model was developed as part of a project to support low-resource language processing. |
|
Thanks to OdiaGenAI for providing the initial training data, which made this model possible. |
|
|
|
## Contributors |
|
|
|
- **Shantipriya Parida** |
|
- **Sambit Sekhar** |
|
- **Sahil Khan** |