yuvraj108c commited on
Commit
42dbe81
·
verified ·
1 Parent(s): 15c2fcf

Delete wav2vec-english-speech-emotion-recognition/README.md

Browse files
wav2vec-english-speech-emotion-recognition/README.md DELETED
@@ -1,86 +0,0 @@
1
- ---
2
- license: apache-2.0
3
- tags:
4
- - generated_from_trainer
5
- metrics:
6
- - accuracy
7
- model_index:
8
- name: wav2vec-english-speech-emotion-recognition
9
- ---
10
- # Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0
11
- The model is a fine-tuned version of [jonatasgrosman/wav2vec2-large-xlsr-53-english](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) for a Speech Emotion Recognition (SER) task.
12
-
13
- Several datasets were used the fine-tune the original model:
14
- - Surrey Audio-Visual Expressed Emotion [(SAVEE)](http://kahlan.eps.surrey.ac.uk/savee/Database.html) - 480 audio files from 4 male actors
15
- - Ryerson Audio-Visual Database of Emotional Speech and Song [(RAVDESS)](https://zenodo.org/record/1188976) - 1440 audio files from 24 professional actors (12 female, 12 male)
16
- - Toronto emotional speech set [(TESS)](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 audio files from 2 female actors
17
-
18
- 7 labels/emotions were used as classification labels
19
- ```python
20
- emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']
21
- ```
22
- It achieves the following results on the evaluation set:
23
- - Loss: 0.104075
24
- - Accuracy: 0.97463
25
-
26
- ## Model Usage
27
- ```bash
28
- pip install transformers librosa torch
29
- ```
30
- ```python
31
- from transformers import *
32
- import librosa
33
- import torch
34
-
35
- feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
36
- model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
37
-
38
- def predict_emotion(audio_path):
39
- audio, rate = librosa.load(audio_path, sr=16000)
40
- inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
41
-
42
- with torch.no_grad():
43
- outputs = model(inputs.input_values)
44
- predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1) # Average over sequence length
45
- predicted_label = torch.argmax(predictions, dim=-1)
46
- emotion = model.config.id2label[predicted_label.item()]
47
- return emotion
48
-
49
- emotion = predict_emotion("example_audio.wav")
50
- print(f"Predicted emotion: {emotion}")
51
- >> Predicted emotion: angry
52
- ```
53
-
54
-
55
- ## Training procedure
56
- ### Training hyperparameters
57
- The following hyperparameters were used during training:
58
- - learning_rate: 0.0001
59
- - train_batch_size: 4
60
- - eval_batch_size: 4
61
- - eval_steps: 500
62
- - seed: 42
63
- - gradient_accumulation_steps: 2
64
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
65
- - num_epochs: 4
66
- - max_steps=7500
67
- - save_steps: 1500
68
-
69
- ### Training results
70
- | Step | Training Loss | Validation Loss | Accuracy |
71
- | ---- | ------------- | --------------- | -------- |
72
- | 500 | 1.8124 | 1.365212 | 0.486258 |
73
- | 1000 | 0.8872 | 0.773145 | 0.79704 |
74
- | 1500 | 0.7035 | 0.574954 | 0.852008 |
75
- | 2000 | 0.6879 | 1.286738 | 0.775899 |
76
- | 2500 | 0.6498 | 0.697455 | 0.832981 |
77
- | 3000 | 0.5696 | 0.33724 | 0.892178 |
78
- | 3500 | 0.4218 | 0.307072 | 0.911205 |
79
- | 4000 | 0.3088 | 0.374443 | 0.930233 |
80
- | 4500 | 0.2688 | 0.260444 | 0.936575 |
81
- | 5000 | 0.2973 | 0.302985 | 0.92389 |
82
- | 5500 | 0.1765 | 0.165439 | 0.961945 |
83
- | 6000 | 0.1475 | 0.170199 | 0.961945 |
84
- | 6500 | 0.1274 | 0.15531 | 0.966173 |
85
- | 7000 | 0.0699 | 0.103882 | 0.976744 |
86
- | 7500 | 0.083 | 0.104075 | 0.97463 |