Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
|
2 |
+
|
3 |
+
# Model Details
|
4 |
+
```
|
5 |
+
Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
|
6 |
+
Base Model: facebook/wav2vec2-base
|
7 |
+
Dataset: narad/ravdess
|
8 |
+
Quantization: Available as an optional FP16 version for optimized inference
|
9 |
+
Training Device: CUDA (GPU)
|
10 |
+
```
|
11 |
+
|
12 |
+
# Dataset Information
|
13 |
+
```
|
14 |
+
Dataset Structure:
|
15 |
+
|
16 |
+
DatasetDict({
|
17 |
+
train: Dataset({
|
18 |
+
features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'],
|
19 |
+
num_rows: 1440
|
20 |
+
})
|
21 |
+
})
|
22 |
+
```
|
23 |
+
**Note:** Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split.
|
24 |
+
# Available Splits:
|
25 |
+
|
26 |
+
- **Train:** 1,152 examples (after 80/20 split)
|
27 |
+
- **Validation:** 288 examples (after 80/20 split)
|
28 |
+
- **Test:** Not provided; external audio used for testing
|
29 |
+
# Feature Representation:
|
30 |
+
- **audio:** Raw waveform (48kHz, resampled to 16kHz during preprocessing)
|
31 |
+
- **text:** Spoken sentence (e.g., "Dogs are sitting by the door")
|
32 |
+
- **labels:** Integer labels for emotions (0–7)
|
33 |
+
- **speaker_id:** Actor identifier (e.g., "9")
|
34 |
+
- **speaker_gender:** Gender of speaker (e.g., "male")
|
35 |
+
# Training Details
|
36 |
+
- **Number of Classes:** 8
|
37 |
+
- **Class Names:**
|
38 |
+
neutral, calm, happy, sad, angry, fearful, disgust, surprised
|
39 |
+
- **Training Process:**
|
40 |
+
Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence)
|
41 |
+
- **Learning rate:** 3e-5, with warmup steps (100) and weight decay (0.1)
|
42 |
+
-**Batch size:** 4 with gradient accumulation (effective batch size 8)
|
43 |
+
- **Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization**
|
44 |
+
- **Performance Metrics**
|
45 |
+
|
46 |
+
- **Epochs:** 10
|
47 |
+
- **Training Loss:** ~0.8
|
48 |
+
- **Validation Loss:** ~1.2
|
49 |
+
- **Accuracy:** ~0.65
|
50 |
+
- **F1 Score:** ~0.63
|
51 |
+
|
52 |
+
|
53 |
+
# Inference Example
|
54 |
+
```python
|
55 |
+
import torch
|
56 |
+
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
|
57 |
+
import librosa
|
58 |
+
|
59 |
+
def load_model(model_path):
|
60 |
+
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path)
|
61 |
+
processor = Wav2Vec2Processor.from_pretrained(model_path)
|
62 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
63 |
+
model.to(device)
|
64 |
+
model.eval()
|
65 |
+
return model, processor, device
|
66 |
+
|
67 |
+
def predict_emotion(model_path, audio_path):
|
68 |
+
model, processor, device = load_model(model_path)
|
69 |
+
|
70 |
+
# Load and preprocess audio
|
71 |
+
audio, sr = librosa.load(audio_path, sr=16000)
|
72 |
+
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True)
|
73 |
+
input_values = inputs["input_values"].to(device)
|
74 |
+
|
75 |
+
# Inference
|
76 |
+
with torch.no_grad():
|
77 |
+
outputs = model(input_values)
|
78 |
+
logits = outputs.logits
|
79 |
+
predicted_label = torch.argmax(logits, dim=1).item()
|
80 |
+
probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy()
|
81 |
+
|
82 |
+
emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
|
83 |
+
return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)}
|
84 |
+
|
85 |
+
# Example usage
|
86 |
+
if __name__ == "__main__":
|
87 |
+
model_path = "path/to/wav2vec2-ravdess-emotion/final_model" # Update with your HF username/repo
|
88 |
+
audio_path = "path/to/audio.wav"
|
89 |
+
emotion, probs = predict_emotion(model_path, audio_path)
|
90 |
+
print(f"Predicted Emotion: {emotion}")
|
91 |
+
print("Probabilities:", probs)
|
92 |
+
```
|
93 |
+
# Quantization & Optimization
|
94 |
+
- **Quantization:** Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint.
|
95 |
+
- **Optimized:** Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%.
|
96 |
+
# Usage
|
97 |
+
- **Input:** Raw audio files (.wav) resampled to 16kHz
|
98 |
+
- **Output:** Predicted emotion label (one of 8 classes) with confidence probabilities
|
99 |
+
# Limitations
|
100 |
+
- **Generalization:** Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio.
|
101 |
+
- **Dataset Size:** Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions.
|
102 |
+
- **Accuracy:** Performance on external audio varies; retraining with augmentation or larger datasets may be needed.
|
103 |
+
# Future Improvements
|
104 |
+
- **Data Augmentation:** Incorporate noise, pitch shift, or speed changes to improve robustness.
|
105 |
+
- **Larger Dataset:** Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity.
|
106 |
+
- **Model Tuning:** Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust).
|