YashikaNagpal commited on
Commit
0f78217
·
verified ·
1 Parent(s): 749a7a9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
2
+
3
+ # Model Details
4
+ ```
5
+ Model Name: Fine-Tuned Wav2Vec2 for Speech Emotion Recognition
6
+ Base Model: facebook/wav2vec2-base
7
+ Dataset: narad/ravdess
8
+ Quantization: Available as an optional FP16 version for optimized inference
9
+ Training Device: CUDA (GPU)
10
+ ```
11
+
12
+ # Dataset Information
13
+ ```
14
+ Dataset Structure:
15
+
16
+ DatasetDict({
17
+ train: Dataset({
18
+ features: ['audio', 'text', 'labels', 'speaker_id', 'speaker_gender'],
19
+ num_rows: 1440
20
+ })
21
+ })
22
+ ```
23
+ **Note:** Split manually into 80% train (1,152 examples) and 20% validation (288 examples) during training, as the original dataset provides only a single "train" split.
24
+ # Available Splits:
25
+
26
+ - **Train:** 1,152 examples (after 80/20 split)
27
+ - **Validation:** 288 examples (after 80/20 split)
28
+ - **Test:** Not provided; external audio used for testing
29
+ # Feature Representation:
30
+ - **audio:** Raw waveform (48kHz, resampled to 16kHz during preprocessing)
31
+ - **text:** Spoken sentence (e.g., "Dogs are sitting by the door")
32
+ - **labels:** Integer labels for emotions (0–7)
33
+ - **speaker_id:** Actor identifier (e.g., "9")
34
+ - **speaker_gender:** Gender of speaker (e.g., "male")
35
+ # Training Details
36
+ - **Number of Classes:** 8
37
+ - **Class Names:**
38
+ neutral, calm, happy, sad, angry, fearful, disgust, surprised
39
+ - **Training Process:**
40
+ Fine-tuned for 10 epochs (initially 3, revised to 10 for better convergence)
41
+ - **Learning rate:** 3e-5, with warmup steps (100) and weight decay (0.1)
42
+ -**Batch size:** 4 with gradient accumulation (effective batch size 8)
43
+ - **Dropout added (attention_dropout=0.1, hidden_dropout=0.1) for regularization**
44
+ - **Performance Metrics**
45
+
46
+ - **Epochs:** 10
47
+ - **Training Loss:** ~0.8
48
+ - **Validation Loss:** ~1.2
49
+ - **Accuracy:** ~0.65
50
+ - **F1 Score:** ~0.63
51
+
52
+
53
+ # Inference Example
54
+ ```python
55
+ import torch
56
+ from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
57
+ import librosa
58
+
59
+ def load_model(model_path):
60
+ model = Wav2Vec2ForSequenceClassification.from_pretrained(model_path)
61
+ processor = Wav2Vec2Processor.from_pretrained(model_path)
62
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
63
+ model.to(device)
64
+ model.eval()
65
+ return model, processor, device
66
+
67
+ def predict_emotion(model_path, audio_path):
68
+ model, processor, device = load_model(model_path)
69
+
70
+ # Load and preprocess audio
71
+ audio, sr = librosa.load(audio_path, sr=16000)
72
+ inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True, max_length=160000, truncation=True)
73
+ input_values = inputs["input_values"].to(device)
74
+
75
+ # Inference
76
+ with torch.no_grad():
77
+ outputs = model(input_values)
78
+ logits = outputs.logits
79
+ predicted_label = torch.argmax(logits, dim=1).item()
80
+ probabilities = torch.softmax(logits, dim=1).squeeze().cpu().numpy()
81
+
82
+ emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
83
+ return emotions[predicted_label], {emotion: prob for emotion, prob in zip(emotions, probabilities)}
84
+
85
+ # Example usage
86
+ if __name__ == "__main__":
87
+ model_path = "path/to/wav2vec2-ravdess-emotion/final_model" # Update with your HF username/repo
88
+ audio_path = "path/to/audio.wav"
89
+ emotion, probs = predict_emotion(model_path, audio_path)
90
+ print(f"Predicted Emotion: {emotion}")
91
+ print("Probabilities:", probs)
92
+ ```
93
+ # Quantization & Optimization
94
+ - **Quantization:** Optional FP16 version created using PyTorch’s .half() for faster inference with reduced memory footprint.
95
+ - **Optimized:** Suitable for deployment on GPU-enabled devices; FP16 version reduces model size by ~50%.
96
+ # Usage
97
+ - **Input:** Raw audio files (.wav) resampled to 16kHz
98
+ - **Output:** Predicted emotion label (one of 8 classes) with confidence probabilities
99
+ # Limitations
100
+ - **Generalization:** Trained on acted speech (RAVDESS), may underperform on spontaneous or noisy real-world audio.
101
+ - **Dataset Size:** Limited to 1,440 samples, potentially insufficient for robust emotion recognition across diverse conditions.
102
+ - **Accuracy:** Performance on external audio varies; retraining with augmentation or larger datasets may be needed.
103
+ # Future Improvements
104
+ - **Data Augmentation:** Incorporate noise, pitch shift, or speed changes to improve robustness.
105
+ - **Larger Dataset:** Combine with additional SER datasets (e.g., IEMOCAP, CREMA-D) for diversity.
106
+ - **Model Tuning:** Experiment with freezing lower layers or using a model pre-trained for SER (e.g., facebook/wav2vec2-large-robust).