Fine-Tuned mBERT for Enhanced Tamil NLP
Optimized with 100K OSCAR Tamil Data Points
Model Overview
This model is a fine-tuned version of Multilingual BERT (mBERT) on the OSCAR Tamil dataset, leveraging 100,000 data points for enhanced Tamil language understanding. The fine-tuning process was performed to improve the model's ability to handle Tamil text effectively, making it suitable for various NLP tasks such as classification, named entity recognition, and text generation.
Dataset Details
- Dataset Name: OSCAR (Open Super-large Crawled ALMAnaCH Research dataset) – Tamil subset
- Size: 100K samples
- Preprocessing: Text normalization, tokenization using the mBERT tokenizer, and removal of noise for improved data quality.
Model Specifications
- Base Model:
bert-base-multilingual-cased
- Training Steps: Custom fine-tuning with Tamil text
- Tokenizer Used: mBERT tokenizer
- Batch Size: Optimized for performance
- Objective: Improve Tamil language representation in mBERT for downstream NLP tasks
Usage
This model can be used for multiple NLP tasks in Tamil, such as:
✅ Text Classification
✅ Named Entity Recognition (NER)
✅ Sentiment Analysis
✅ Question Answering
✅ Sentence Embeddings
How to Use the Model
To load the model in Python using Hugging Face Transformers, use the following code snippet:
from transformers import AutoTokenizer, AutoModel
model_name = "viswadarshan06/Tamil-MLM" # Replace with your model path
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Tokenizing a sample Tamil text
text = "தமிழ் மொழியில் இயற்கை மொழி செயலாக்கம் முக்கியம்!"
tokens = tokenizer(text, return_tensors="pt")
# Getting model embeddings
outputs = model(**tokens)
print(outputs.last_hidden_state.shape) # Output shape: (batch_size, seq_length, hidden_size)
Performance & Evaluation
Evaluated on downstream tasks to validate improved Tamil language representation. Shows better contextual understanding of Tamil text compared to the base mBERT model.
Conclusion
This fine-tuned mBERT model bridges the gap in Tamil NLP by leveraging large-scale pretraining and task-specific fine-tuning, making it a valuable resource for researchers and developers working on Tamil NLP applications.
- Downloads last month
- 18
Model tree for viswadarshan06/Tamil-MLM
Base model
google-bert/bert-base-multilingual-cased