|
--- |
|
language: |
|
- en |
|
datasets: |
|
- CREMA-D |
|
library_name: transformers |
|
tags: |
|
- emotion-classification |
|
- audio-classification |
|
- audio-spectrogram |
|
- transformer |
|
- fine-tuned |
|
license: apache-2.0 |
|
pipeline_tag: audio-classification |
|
base_model: "MIT/ast-finetuned-audioset-10-10-0.4593" |
|
metrics: |
|
- accuracy |
|
- f1 |
|
task_categories: |
|
- audio-classification |
|
--- |
|
|
|
|
|
# AST Fine-Tuned Model for Emotion Classification |
|
|
|
|
|
# **AST Fine-Tuned Model for Emotion Classification** |
|
|
|
This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the **CREMA-D dataset**, focusing on six emotional categories. The base model was sourced from **MIT's pre-trained AST model**. |
|
|
|
--- |
|
|
|
## **Model Details** |
|
- **Base Model**: `MIT/ast-finetuned-audioset-10-10-0.4593` |
|
- **Fine-Tuned Dataset**: CREMA-D |
|
- **Architecture**: Audio Spectrogram Transformer (AST) |
|
- **Model Type**: Single-label classification |
|
- **Input Features**: Log-Mel Spectrograms (128 mel bins) |
|
- **Output Classes**: |
|
- **ANG**: Anger |
|
- **DIS**: Disgust |
|
- **FEA**: Fear |
|
- **HAP**: Happiness |
|
- **NEU**: Neutral |
|
- **SAD**: Sadness |
|
|
|
--- |
|
|
|
## **Model Configuration** |
|
- **Hidden Size**: 768 |
|
- **Number of Attention Heads**: 12 |
|
- **Number of Hidden Layers**: 12 |
|
- **Patch Size**: 16 |
|
- **Maximum Length**: 1024 |
|
- **Dropout Probability**: 0.0 |
|
- **Activation Function**: GELU (Gaussian Error Linear Unit) |
|
- **Optimizer**: Adam |
|
- **Learning Rate**: 1e-4 |
|
|
|
--- |
|
|
|
## **Training Details** |
|
- **Dataset**: CREMA-D (Emotion-Labeled Speech Data) |
|
- **Data Augmentation**: |
|
- Noise injection |
|
- Time shifting |
|
- Speed perturbation |
|
- **Fine-Tuning Epochs**: 5 |
|
- **Batch Size**: 16 |
|
- **Learning Rate Scheduler**: Linear decay |
|
- **Best Validation Accuracy**: 60.71% |
|
- **Best Checkpoint**: `./results/checkpoint-1119` |
|
|
|
--- |
|
|
|
## **How to Use** |
|
|
|
### **Load the Model** |
|
```python |
|
from transformers import AutoModelForAudioClassification, AutoProcessor |
|
|
|
# Load the model and processor |
|
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model") |
|
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model") |
|
|
|
# Prepare input audio (e.g., waveform) as log-mel spectrogram |
|
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt") |
|
|
|
# Make predictions |
|
outputs = model(**inputs) |
|
predicted_class = outputs.logits.argmax(-1).item() |
|
|
|
print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}") |
|
``` |
|
|
|
--- |
|
|
|
## **Metrics** |
|
|
|
### **Validation Results** |
|
- **Best Validation Accuracy**: 60.71% |
|
- **Validation Loss**: 1.1126 |
|
|
|
### **Evaluation Details** |
|
- **Eval Dataset**: CREMA-D test split |
|
- **Batch Size**: 16 |
|
- **Number of Steps**: 94 |
|
|
|
--- |
|
|
|
## **Limitations** |
|
- The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages. |
|
- Validation accuracy is 60.71%, indicating room for improvement for real-world deployment. |
|
|
|
--- |
|
|
|
## **Acknowledgments** |
|
This work is based on the **Audio Spectrogram Transformer (AST)** model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors. |
|
|
|
--- |
|
|
|
## **License** |
|
The model is shared under the MIT License. Refer to the licensing details in the repository. |
|
|
|
--- |
|
|
|
## **Citation** |
|
If you use this model in your work, please cite: |
|
``` |
|
@misc{ast-finetuned-model, |
|
author = {forwarder1121}, |
|
title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification}, |
|
year = {2024}, |
|
url = {https://huggingface.co/forwarder1121/ast-finetuned-model}, |
|
} |
|
``` |
|
|
|
--- |
|
|
|
## **Contact** |
|
For questions, reach out to `[email protected]`. |