Fine-Tuned mBERT for Enhanced Tamil NLP

Optimized with 100K OSCAR Tamil Data Points

Model Overview

This model is a fine-tuned version of Multilingual BERT (mBERT) on the OSCAR Tamil dataset, leveraging 100,000 data points for enhanced Tamil language understanding. The fine-tuning process was performed to improve the model's ability to handle Tamil text effectively, making it suitable for various NLP tasks such as classification, named entity recognition, and text generation.

Dataset Details

  • Dataset Name: OSCAR (Open Super-large Crawled ALMAnaCH Research dataset) – Tamil subset
  • Size: 100K samples
  • Preprocessing: Text normalization, tokenization using the mBERT tokenizer, and removal of noise for improved data quality.

Model Specifications

  • Base Model: bert-base-multilingual-cased
  • Training Steps: Custom fine-tuning with Tamil text
  • Tokenizer Used: mBERT tokenizer
  • Batch Size: Optimized for performance
  • Objective: Improve Tamil language representation in mBERT for downstream NLP tasks

Usage

This model can be used for multiple NLP tasks in Tamil, such as:
✅ Text Classification
✅ Named Entity Recognition (NER)
✅ Sentiment Analysis
✅ Question Answering
✅ Sentence Embeddings

How to Use the Model

To load the model in Python using Hugging Face Transformers, use the following code snippet:

from transformers import AutoTokenizer, AutoModel

model_name = "viswadarshan06/Tamil-MLM"  # Replace with your model path
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenizing a sample Tamil text
text = "தமிழ் மொழியில் இயற்கை மொழி செயலாக்கம் முக்கியம்!"
tokens = tokenizer(text, return_tensors="pt")

# Getting model embeddings
outputs = model(**tokens)
print(outputs.last_hidden_state.shape)  # Output shape: (batch_size, seq_length, hidden_size)

Performance & Evaluation

Evaluated on downstream tasks to validate improved Tamil language representation. Shows better contextual understanding of Tamil text compared to the base mBERT model.

Conclusion

This fine-tuned mBERT model bridges the gap in Tamil NLP by leveraging large-scale pretraining and task-specific fine-tuning, making it a valuable resource for researchers and developers working on Tamil NLP applications.

Downloads last month
18
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for viswadarshan06/Tamil-MLM

Finetuned
(649)
this model

Dataset used to train viswadarshan06/Tamil-MLM