Fine-Tuned mBERT for Enhanced Tamil NLP

Optimized with 100K OSCAR Tamil Data Points

Model Overview

This model is a fine-tuned version of Multilingual BERT (mBERT) on the OSCAR Tamil dataset, leveraging 100,000 data points for enhanced Tamil language understanding. The fine-tuning process was performed to improve the model's ability to handle Tamil text effectively, making it suitable for various NLP tasks such as classification, named entity recognition, and text generation.

Dataset Details

Dataset Name: OSCAR (Open Super-large Crawled ALMAnaCH Research dataset) – Tamil subset
Size: 100K samples
Preprocessing: Text normalization, tokenization using the mBERT tokenizer, and removal of noise for improved data quality.

Model Specifications

Base Model: bert-base-multilingual-cased
Training Steps: Custom fine-tuning with Tamil text
Tokenizer Used: mBERT tokenizer
Batch Size: Optimized for performance
Objective: Improve Tamil language representation in mBERT for downstream NLP tasks

Usage

This model can be used for multiple NLP tasks in Tamil, such as:
✅ Text Classification
✅ Named Entity Recognition (NER)
✅ Sentiment Analysis
✅ Question Answering
✅ Sentence Embeddings

How to Use the Model

To load the model in Python using Hugging Face Transformers, use the following code snippet:

from transformers import AutoTokenizer, AutoModel

model_name = "viswadarshan06/Tamil-MLM"  # Replace with your model path
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenizing a sample Tamil text
text = "தமிழ் மொழியில் இயற்கை மொழி செயலாக்கம் முக்கியம்!"
tokens = tokenizer(text, return_tensors="pt")

# Getting model embeddings
outputs = model(**tokens)
print(outputs.last_hidden_state.shape)  # Output shape: (batch_size, seq_length, hidden_size)

Performance & Evaluation

Evaluated on downstream tasks to validate improved Tamil language representation. Shows better contextual understanding of Tamil text compared to the base mBERT model.

Conclusion

This fine-tuned mBERT model bridges the gap in Tamil NLP by leveraging large-scale pretraining and task-specific fine-tuning, making it a valuable resource for researchers and developers working on Tamil NLP applications.

viswadarshan06
/

Tamil-MLM