--- license: mit datasets: - oscar-corpus/OSCAR-2301 language: - ta base_model: - google-bert/bert-base-multilingual-cased pipeline_tag: fill-mask library_name: transformers --- # **Fine-Tuned mBERT for Enhanced Tamil NLP** ### *Optimized with 100K OSCAR Tamil Data Points* ## **Model Overview** This model is a fine-tuned version of **Multilingual BERT (mBERT)** on the **OSCAR Tamil dataset**, leveraging 100,000 data points for enhanced Tamil language understanding. The fine-tuning process was performed to improve the model's ability to handle Tamil text effectively, making it suitable for various NLP tasks such as classification, named entity recognition, and text generation. ## **Dataset Details** - **Dataset Name**: OSCAR (Open Super-large Crawled ALMAnaCH Research dataset) – Tamil subset - **Size**: 100K samples - **Preprocessing**: Text normalization, tokenization using the mBERT tokenizer, and removal of noise for improved data quality. ## **Model Specifications** - **Base Model**: `bert-base-multilingual-cased` - **Training Steps**: Custom fine-tuning with Tamil text - **Tokenizer Used**: mBERT tokenizer - **Batch Size**: Optimized for performance - **Objective**: Improve Tamil language representation in mBERT for downstream NLP tasks ## **Usage** This model can be used for multiple NLP tasks in Tamil, such as: ✅ Text Classification ✅ Named Entity Recognition (NER) ✅ Sentiment Analysis ✅ Question Answering ✅ Sentence Embeddings ## **How to Use the Model** To load the model in Python using **Hugging Face Transformers**, use the following code snippet: ```python from transformers import AutoTokenizer, AutoModel model_name = "viswadarshan06/Tamil-MLM" # Replace with your model path tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Tokenizing a sample Tamil text text = "தமிழ் மொழியில் இயற்கை மொழி செயலாக்கம் முக்கியம்!" tokens = tokenizer(text, return_tensors="pt") # Getting model embeddings outputs = model(**tokens) print(outputs.last_hidden_state.shape) # Output shape: (batch_size, seq_length, hidden_size) ``` ## Performance & Evaluation Evaluated on downstream tasks to validate improved Tamil language representation. Shows better contextual understanding of Tamil text compared to the base mBERT model. ## Conclusion This fine-tuned mBERT model bridges the gap in Tamil NLP by leveraging large-scale pretraining and task-specific fine-tuning, making it a valuable resource for researchers and developers working on Tamil NLP applications.