forwarder1121
/

ast-finetuned-model

+# **AST Fine-Tuned Model for Emotion Classification**
+This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the **CREMA-D dataset**, focusing on six emotional categories. The base model was sourced from **MIT's pre-trained AST model**.
+---
+## **Model Details**
+- **Base Model**: `MIT/ast-finetuned-audioset-10-10-0.4593`
+- **Fine-Tuned Dataset**: CREMA-D
+- **Architecture**: Audio Spectrogram Transformer (AST)
+- **Model Type**: Single-label classification
+- **Input Features**: Log-Mel Spectrograms (128 mel bins)
+- **Output Classes**:
+  - **ANG**: Anger
+  - **DIS**: Disgust
+  - **FEA**: Fear
+  - **HAP**: Happiness
+  - **NEU**: Neutral
+  - **SAD**: Sadness
+---
+## **Model Configuration**
+- **Hidden Size**: 768
+- **Number of Attention Heads**: 12
+- **Number of Hidden Layers**: 12
+- **Patch Size**: 16
+- **Maximum Length**: 1024
+- **Dropout Probability**: 0.0
+- **Activation Function**: GELU (Gaussian Error Linear Unit)
+- **Optimizer**: Adam
+- **Learning Rate**: 1e-4
+---
+## **Training Details**
+- **Dataset**: CREMA-D (Emotion-Labeled Speech Data)
+- **Data Augmentation**:
+  - Noise injection
+  - Time shifting
+  - Speed perturbation
+- **Fine-Tuning Epochs**: 5
+- **Batch Size**: 16
+- **Learning Rate Scheduler**: Linear decay
+- **Best Validation Accuracy**: 60.71%
+- **Best Checkpoint**: `./results/checkpoint-1119`
+---
+## **How to Use**
+### **Load the Model**
+```python
+from transformers import AutoModelForAudioClassification, AutoProcessor
+# Load the model and processor
+model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
+processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")
+# Prepare input audio (e.g., waveform) as log-mel spectrogram
+inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")
+# Make predictions
+outputs = model(**inputs)
+predicted_class = outputs.logits.argmax(-1).item()
+print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")
+```
+---
+## **Metrics**
+### **Validation Results**
+- **Best Validation Accuracy**: 60.71%
+- **Validation Loss**: 1.1126
+### **Evaluation Details**
+- **Eval Dataset**: CREMA-D test split
+- **Batch Size**: 16
+- **Number of Steps**: 94
+---
+## **Limitations**
+- The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
+- Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.
+---
+## **Acknowledgments**
+This work is based on the **Audio Spectrogram Transformer (AST)** model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.
+---
+## **License**
+The model is shared under the MIT License. Refer to the licensing details in the repository.
+---
+## **Citation**
+If you use this model in your work, please cite:
+```
+@misc{ast-finetuned-model,
+  author = {forwarder1121},
+  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
+  year = {2024},
+  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
+}
+```
+---
+## **Contact**
+For questions, reach out to `[email protected]`.