File size: 3,750 Bytes

366d0ce
 
bc06669
366d0ce
bc06669
366d0ce
 
 
 
 
bc06669
 
366d0ce
 
 
bc06669
 
 
 
 
366d0ce
 
 
 
77489cf

---
language:
  - en
datasets:
  - CREMA-D
library_name: transformers
tags:
  - emotion-classification
  - audio-classification
  - audio-spectrogram
  - transformer
  - fine-tuned
license: apache-2.0
pipeline_tag: audio-classification
base_model: "MIT/ast-finetuned-audioset-10-10-0.4593"
metrics:
  - accuracy
  - f1
task_categories:
  - audio-classification
---


# AST Fine-Tuned Model for Emotion Classification


# **AST Fine-Tuned Model for Emotion Classification**

This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the **CREMA-D dataset**, focusing on six emotional categories. The base model was sourced from **MIT's pre-trained AST model**.

---

## **Model Details**
- **Base Model**: `MIT/ast-finetuned-audioset-10-10-0.4593`
- **Fine-Tuned Dataset**: CREMA-D
- **Architecture**: Audio Spectrogram Transformer (AST)
- **Model Type**: Single-label classification
- **Input Features**: Log-Mel Spectrograms (128 mel bins)
- **Output Classes**:
  - **ANG**: Anger
  - **DIS**: Disgust
  - **FEA**: Fear
  - **HAP**: Happiness
  - **NEU**: Neutral
  - **SAD**: Sadness

---

## **Model Configuration**
- **Hidden Size**: 768
- **Number of Attention Heads**: 12
- **Number of Hidden Layers**: 12
- **Patch Size**: 16
- **Maximum Length**: 1024
- **Dropout Probability**: 0.0
- **Activation Function**: GELU (Gaussian Error Linear Unit)
- **Optimizer**: Adam
- **Learning Rate**: 1e-4

---

## **Training Details**
- **Dataset**: CREMA-D (Emotion-Labeled Speech Data)
- **Data Augmentation**:
  - Noise injection
  - Time shifting
  - Speed perturbation
- **Fine-Tuning Epochs**: 5
- **Batch Size**: 16
- **Learning Rate Scheduler**: Linear decay
- **Best Validation Accuracy**: 60.71%
- **Best Checkpoint**: `./results/checkpoint-1119`

---

## **How to Use**

### **Load the Model**
```python
from transformers import AutoModelForAudioClassification, AutoProcessor

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")

# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")

# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()

print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")
```

---

## **Metrics**

### **Validation Results**
- **Best Validation Accuracy**: 60.71%
- **Validation Loss**: 1.1126

### **Evaluation Details**
- **Eval Dataset**: CREMA-D test split
- **Batch Size**: 16
- **Number of Steps**: 94

---

## **Limitations**
- The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
- Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.

---

## **Acknowledgments**
This work is based on the **Audio Spectrogram Transformer (AST)** model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.

---

## **License**
The model is shared under the MIT License. Refer to the licensing details in the repository.

---

## **Citation**
If you use this model in your work, please cite:
```
@misc{ast-finetuned-model,
  author = {forwarder1121},
  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
  year = {2024},
  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}
```

---

## **Contact**
For questions, reach out to `[email protected]`.