--- license: apache-2.0 datasets: - OdiaGenAI/sentiment_analysis_hindi language: - hi metrics: - accuracy - f1 base_model: - FacebookAI/xlm-roberta-large --- # Hindi Sentiment Analysis Model This repository contains a Hindi sentiment analysis model that can classify text into three categories: negative (neg), neutral (neu), and positive (pos). The model has been trained and evaluated using various BERT-based architectures, with XLM-RoBERTa showing the best performance. ## Model Performance ### Test Accuracy Comparison ![Test Accuracy Comparison](./test_accuracy_comparison.png) Our extensive evaluation shows: - XLM-RoBERTa: 81.3% - mBERT: 76.5% - Custom-BERT-Attention: 74.9% - IndicBERT: 69.9% ### Detailed Results #### Confusion Matrices ![Confusion Matrices](./confusion_matrices.png) The confusion matrices show the prediction performance for each model: - XLM-RoBERTa shows the strongest performance with 82.1% accuracy on positive class - mBERT demonstrates balanced performance across classes - Custom-BERT-Attention maintains consistent performance - IndicBERT shows room for improvement in negative class detection #### Per-class Metrics ![Per-class Metrics](./per_class_metrics.png) The detailed per-class metrics show: 1. Precision: - Positive class: Best performance across all models (~0.80-0.85) - Neutral class: Consistent performance (~0.75-0.80) - Negative class: More varied performance (~0.40-0.70) 2. Recall: - Positive class: High recall across models (~0.85-0.90) - Neutral class: Moderate recall (~0.65-0.85) - Negative class: Lower but improving recall (~0.25-0.60) 3. F1-Score: - Positive class: Best overall performance (~0.80-0.85) - Neutral class: Good balance (~0.70-0.80) - Negative class: Area for potential improvement (~0.30-0.65) ### Training Progress ![Training Progress](./training_progress.png) The training graphs show: - Consistent loss reduction across epochs - Stable validation accuracy improvement - No significant overfitting - XLM-RoBERTa achieving the best validation accuracy - Custom-BERT-Attention showing rapid initial learning ## Model Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained("madhav112/hindi-sentiment-analysis") model = AutoModelForSequenceClassification.from_pretrained("madhav112/hindi-sentiment-analysis") # Example usage text = "यह फिल्म बहुत अच्छी है" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) outputs = model(**inputs) predictions = outputs.logits.argmax(-1) ``` ## Model Architecture The repository contains experiments with multiple BERT-based architectures: 1. XLM-RoBERTa (Best performing) - Highest overall accuracy - Best performance on positive sentiment - Strong cross-lingual capabilities 2. mBERT - Good balanced performance - Strong on neutral class detection - Consistent across all metrics 3. Custom-BERT-Attention - Competitive performance - Quick convergence during training - Good precision on positive class 4. IndicBERT - Baseline performance - Room for improvement - Better suited for specific Indian language tasks ## Dataset The model was trained on a Hindi sentiment analysis dataset with three classes: - Positive (pos) - Neutral (neu) - Negative (neg) The confusion matrices show balanced class distribution and strong performance across categories. ## Training Details The model was trained for 7 epochs with the following characteristics: - Learning rate: Optimized for each architecture - Batch size: Adjusted for optimal performance - Validation split: Regular evaluation during training - Early stopping: Monitored for best model selection - Loss function: Cross-entropy loss ## Limitations - Lower performance on negative sentiment detection compared to positive - Neutral class classification shows moderate confusion with both positive and negative - Performance may vary on domain-specific text - Best suited for standard Hindi text; may have reduced performance on heavily colloquial or dialectal variations ## Citation If you use this model in your research, please cite: ```bibtex @misc{madhav2024hindisentiment, author = {Madhav}, title = {Hindi Sentiment Analysis Model}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/madhav112/hindi-sentiment-analysis}} } ``` ## Author **Madhav** - HuggingFace: [madhav](https://huggingface.co/madhav) ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Acknowledgments Special thanks to the HuggingFace team and the open-source community for providing the tools and frameworks that made this model possible. language: hi tags: - hindi - sentiment-analysis - text-classification - bert datasets: - hindi-sentiment metrics: - accuracy - f1 - precision - recall model-index: - name: hindi-sentiment-analysis results: - task: type: text-classification name: Text Classification dataset: name: Hindi Sentiment type: hindi-sentiment metrics: - type: accuracy value: 81.3 name: Test Accuracy - type: f1 value: 0.82 name: F1 Score