|
# Fine-Tuned Wav2Vec2 for Speech Emotion Recognition |
|
|
|
# Model Details |
|
``` |
|
Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition |
|
Base Model: facebook/wav2vec2-base |
|
Dataset: narad/ravdess |
|
Quantization: Available as an optional FP16 version for optimized inference |
|
Training Device: CUDA (GPU) |
|
``` |
|
|
|
# Dataset Information |
|
``` |
|
Dataset Structure: |
|
|
|
DatasetDict({ |
|
train: Dataset({ |
|
features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'], |
|
num_rows: 1440 |
|
}) |
|
}) |
|
``` |
|
**Note:** Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split. |
|
# Available Splits: |
|
|
|
- **Train:** 1,152 examples (after 80/20 split) |
|
- **Validation:** 288 examples (after 80/20 split) |
|
- **Test:** Not provided; external audio used for testing |
|
# Feature Representation: |
|
- **audio:** Raw waveform (48kHz, resampled to 16kHz during preprocessing) |
|
- **text:** Spoken sentence (e.g., "Dogs are sitting by the door") |
|
- **labels:** Integer labels for emotions (0–7) |
|
- **speaker_id:** Actor identifier (e.g., "9") |
|
- **speaker_gender:** Gender of speaker (e.g., "male") |
|
# Training Details |
|
- **Number of Classes:** 8 |
|
- **Class Names:** |
|
neutral, calm, happy, sad, angry, fearful, disgust, surprised |
|
- **Training Process:** |
|
Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence) |
|
- **Learning rate:** 3e-5, with warmup steps (100) and weight decay (0.1) |
|
-**Batch size:** 4 with gradient accumulation (effective batch size 8) |
|
- **Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization** |
|
- **Performance Metrics** |
|
|
|
- **Epochs:** 10 |
|
- **Training Loss:** ~0.8 |
|
- **Validation Loss:** ~1.2 |
|
- **Accuracy:** ~0.65 |
|
- **F1 Score:** ~0.63 |
|
|
|
|
|
# Inference Example |
|
```python |
|
import torch |
|
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor |
|
import librosa |
|
|
|
def load_model(model_path): |
|
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path) |
|
processor = Wav2Vec2Processor.from_pretrained(model_path) |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
model.eval() |
|
return model, processor, device |
|
|
|
def predict_emotion(model_path, audio_path): |
|
model, processor, device = load_model(model_path) |
|
|
|
# Load and preprocess audio |
|
audio, sr = librosa.load(audio_path, sr=16000) |
|
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True) |
|
input_values = inputs["input_values"].to(device) |
|
|
|
# Inference |
|
with torch.no_grad(): |
|
outputs = model(input_values) |
|
logits = outputs.logits |
|
predicted_label = torch.argmax(logits, dim=1).item() |
|
probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy() |
|
|
|
emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'] |
|
return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)} |
|
|
|
# Example usage |
|
if __name__ == "__main__": |
|
model_path = "path/to/wav2vec2-ravdess-emotion/final_model" # Update with your HF username/repo |
|
audio_path = "path/to/audio.wav" |
|
emotion, probs = predict_emotion(model_path, audio_path) |
|
print(f"Predicted Emotion: {emotion}") |
|
print("Probabilities:", probs) |
|
``` |
|
# Quantization & Optimization |
|
- **Quantization:** Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint. |
|
- **Optimized:** Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%. |
|
# Usage |
|
- **Input:** Raw audio files (.wav) resampled to 16kHz |
|
- **Output:** Predicted emotion label (one of 8 classes) with confidence probabilities |
|
# Limitations |
|
- **Generalization:** Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio. |
|
- **Dataset Size:** Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions. |
|
- **Accuracy:** Performance on external audio varies; retraining with augmentation or larger datasets may be needed. |
|
# Future Improvements |
|
- **Data Augmentation:** Incorporate noise, pitch shift, or speed changes to improve robustness. |
|
- **Larger Dataset:** Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity. |
|
- **Model Tuning:** Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust). |