AventIQ-AI
/

wav2vec2-base_speech_emotion_recognition

Safetensors

wav2vec2

Model card Files Files and versions Community

YashikaNagpal commited on Mar 4

Commit

0f78217

verified ·

1 Parent(s): 749a7a9

Create README.md

Browse files

Files changed (1) hide show

README.md +106 -0

README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
+# Model Details
+```
+Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
+Base Model: facebook/wav2vec2-base
+Dataset: narad/ravdess
+Quantization: Available as an optional FP16 version for optimized inference
+Training Device: CUDA (GPU)
+```
+# Dataset Information
+```
+Dataset Structure:
+DatasetDict({
+    train: Dataset({
+        features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'],
+        num_rows: 1440
+    })
+})
+```
+**Note:** Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split.
+# Available Splits:
+- **Train:** 1,152 examples (after 80/20 split)
+- **Validation:** 288 examples (after 80/20 split)
+- **Test:** Not provided; external audio used for testing
+# Feature Representation:
+- **audio:** Raw waveform (48kHz, resampled to 16kHz during preprocessing)
+- **text:** Spoken sentence (e.g., "Dogs are sitting by the door")
+- **labels:** Integer labels for emotions (0–7)
+- **speaker_id:** Actor identifier (e.g., "9")
+- **speaker_gender:** Gender of speaker (e.g., "male")
+# Training Details
+- **Number of Classes:** 8
+- **Class Names:**
+neutral, calm, happy, sad, angry, fearful, disgust, surprised
+- **Training Process:**
+Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence)
+- **Learning rate:** 3e-5, with warmup steps (100) and weight decay (0.1)
+-**Batch size:** 4 with gradient accumulation (effective batch size 8)
+- **Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization**
+- **Performance Metrics**
+- **Epochs:** 10
+- **Training Loss:** ~0.8
+- **Validation Loss:** ~1.2
+- **Accuracy:** ~0.65
+- **F1 Score:** ~0.63
+# Inference Example
+```python
+import torch
+from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
+import librosa
+def load_model(model_path):
+    model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path)
+    processor = Wav2Vec2Processor.from_pretrained(model_path)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model.to(device)
+    model.eval()
+    return model, processor, device
+def predict_emotion(model_path, audio_path):
+    model, processor, device = load_model(model_path)
+    # Load and preprocess audio
+    audio, sr = librosa.load(audio_path, sr=16000)
+    inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True)
+    input_values = inputs["input_values"].to(device)
+    # Inference
+    with torch.no_grad():
+        outputs = model(input_values)
+        logits = outputs.logits
+        predicted_label = torch.argmax(logits, dim=1).item()
+        probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy()
+    emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
+    return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)}
+# Example usage
+if __name__ == "__main__":
+    model_path = "path/to/wav2vec2-ravdess-emotion/final_model"  # Update with your HF username/repo
+    audio_path = "path/to/audio.wav"
+    emotion, probs = predict_emotion(model_path, audio_path)
+    print(f"Predicted Emotion: {emotion}")
+    print("Probabilities:", probs)
+```
+# Quantization & Optimization
+- **Quantization:** Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint.
+- **Optimized:** Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%.
+# Usage
+- **Input:** Raw audio files (.wav) resampled to 16kHz
+- **Output:** Predicted emotion label (one of 8 classes) with confidence probabilities
+# Limitations
+- **Generalization:** Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio.
+- **Dataset Size:** Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions.
+- **Accuracy:** Performance on external audio varies; retraining with augmentation or larger datasets may be needed.
+# Future Improvements
+- **Data Augmentation:** Incorporate noise, pitch shift, or speed changes to improve robustness.
+- **Larger Dataset:** Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity.
+- **Model Tuning:** Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust).