File size: 4,511 Bytes
0f78217 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
# Model Details
```
Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
Base Model: facebook/wav2vec2-base
Dataset: narad/ravdess
Quantization: Available as an optional FP16 version for optimized inference
Training Device: CUDA (GPU)
```
# Dataset Information
```
Dataset Structure:
DatasetDict({
train: Dataset({
features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'],
num_rows: 1440
})
})
```
**Note:** Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split.
# Available Splits:
- **Train:** 1,152 examples (after 80/20 split)
- **Validation:** 288 examples (after 80/20 split)
- **Test:** Not provided; external audio used for testing
# Feature Representation:
- **audio:** Raw waveform (48kHz, resampled to 16kHz during preprocessing)
- **text:** Spoken sentence (e.g., "Dogs are sitting by the door")
- **labels:** Integer labels for emotions (0–7)
- **speaker_id:** Actor identifier (e.g., "9")
- **speaker_gender:** Gender of speaker (e.g., "male")
# Training Details
- **Number of Classes:** 8
- **Class Names:**
neutral, calm, happy, sad, angry, fearful, disgust, surprised
- **Training Process:**
Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence)
- **Learning rate:** 3e-5, with warmup steps (100) and weight decay (0.1)
-**Batch size:** 4 with gradient accumulation (effective batch size 8)
- **Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization**
- **Performance Metrics**
- **Epochs:** 10
- **Training Loss:** ~0.8
- **Validation Loss:** ~1.2
- **Accuracy:** ~0.65
- **F1 Score:** ~0.63
# Inference Example
```python
import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import librosa
def load_model(model_path):
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path)
processor = Wav2Vec2Processor.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
return model, processor, device
def predict_emotion(model_path, audio_path):
model, processor, device = load_model(model_path)
# Load and preprocess audio
audio, sr = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True)
input_values = inputs["input_values"].to(device)
# Inference
with torch.no_grad():
outputs = model(input_values)
logits = outputs.logits
predicted_label = torch.argmax(logits, dim=1).item()
probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy()
emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)}
# Example usage
if __name__ == "__main__":
model_path = "path/to/wav2vec2-ravdess-emotion/final_model" # Update with your HF username/repo
audio_path = "path/to/audio.wav"
emotion, probs = predict_emotion(model_path, audio_path)
print(f"Predicted Emotion: {emotion}")
print("Probabilities:", probs)
```
# Quantization & Optimization
- **Quantization:** Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint.
- **Optimized:** Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%.
# Usage
- **Input:** Raw audio files (.wav) resampled to 16kHz
- **Output:** Predicted emotion label (one of 8 classes) with confidence probabilities
# Limitations
- **Generalization:** Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio.
- **Dataset Size:** Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions.
- **Accuracy:** Performance on external audio varies; retraining with augmentation or larger datasets may be needed.
# Future Improvements
- **Data Augmentation:** Incorporate noise, pitch shift, or speed changes to improve robustness.
- **Larger Dataset:** Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity.
- **Model Tuning:** Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust). |