File size: 3,750 Bytes
366d0ce
 
bc06669
366d0ce
bc06669
366d0ce
 
 
 
 
bc06669
 
366d0ce
 
 
bc06669
 
 
 
 
366d0ce
 
 
 
77489cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
language:
  - en
datasets:
  - CREMA-D
library_name: transformers
tags:
  - emotion-classification
  - audio-classification
  - audio-spectrogram
  - transformer
  - fine-tuned
license: apache-2.0
pipeline_tag: audio-classification
base_model: "MIT/ast-finetuned-audioset-10-10-0.4593"
metrics:
  - accuracy
  - f1
task_categories:
  - audio-classification
---


# AST Fine-Tuned Model for Emotion Classification


# **AST Fine-Tuned Model for Emotion Classification**

This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the **CREMA-D dataset**, focusing on six emotional categories. The base model was sourced from **MIT's pre-trained AST model**.

---

## **Model Details**
- **Base Model**: `MIT/ast-finetuned-audioset-10-10-0.4593`
- **Fine-Tuned Dataset**: CREMA-D
- **Architecture**: Audio Spectrogram Transformer (AST)
- **Model Type**: Single-label classification
- **Input Features**: Log-Mel Spectrograms (128 mel bins)
- **Output Classes**:
  - **ANG**: Anger
  - **DIS**: Disgust
  - **FEA**: Fear
  - **HAP**: Happiness
  - **NEU**: Neutral
  - **SAD**: Sadness

---

## **Model Configuration**
- **Hidden Size**: 768
- **Number of Attention Heads**: 12
- **Number of Hidden Layers**: 12
- **Patch Size**: 16
- **Maximum Length**: 1024
- **Dropout Probability**: 0.0
- **Activation Function**: GELU (Gaussian Error Linear Unit)
- **Optimizer**: Adam
- **Learning Rate**: 1e-4

---

## **Training Details**
- **Dataset**: CREMA-D (Emotion-Labeled Speech Data)
- **Data Augmentation**:
  - Noise injection
  - Time shifting
  - Speed perturbation
- **Fine-Tuning Epochs**: 5
- **Batch Size**: 16
- **Learning Rate Scheduler**: Linear decay
- **Best Validation Accuracy**: 60.71%
- **Best Checkpoint**: `./results/checkpoint-1119`

---

## **How to Use**

### **Load the Model**
```python
from transformers import AutoModelForAudioClassification, AutoProcessor

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")

# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")

# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()

print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")
```

---

## **Metrics**

### **Validation Results**
- **Best Validation Accuracy**: 60.71%
- **Validation Loss**: 1.1126

### **Evaluation Details**
- **Eval Dataset**: CREMA-D test split
- **Batch Size**: 16
- **Number of Steps**: 94

---

## **Limitations**
- The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
- Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.

---

## **Acknowledgments**
This work is based on the **Audio Spectrogram Transformer (AST)** model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.

---

## **License**
The model is shared under the MIT License. Refer to the licensing details in the repository.

---

## **Citation**
If you use this model in your work, please cite:
```
@misc{ast-finetuned-model,
  author = {forwarder1121},
  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
  year = {2024},
  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}
```

---

## **Contact**
For questions, reach out to `[email protected]`.