wav2vec2-emotion-recognition

This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.

Model Description

  • Model Architecture: Wav2Vec2 with sequence classification head
  • Language: English
  • Task: Speech Emotion Recognition
  • Fine-tuned from: facebook/wav2vec2-base
  • Datasets: Combined emotion datasets
  • TESS
  • CREMA-D
  • SAVEE
  • RAVDESS

Performance Metrics

  • Accuracy: 79.57%
  • F1 Score: 79.43%

Supported Emotions

  • 😠 Angry
  • 😌 Calm
  • 🀒 Disgust
  • 😨 Fearful
  • 😊 Happy
  • 😐 Neutral
  • 😒 Sad
  • 😲 Surprised

Training Details

The model was trained with the following configuration:

  • Epochs: 15
  • Batch Size: 16
  • Learning Rate: 5e-5
  • Optimizer: AdamW
  • Weight Decay: 0.03
  • Gradient Accumulation Steps: 2
  • Mixed Precision: fp16

For detailed training process, check out the Fine-tuning Notebook

Limitations

Audio Requirements:

  • Sampling rate: 16kHz (will be automatically resampled)
  • Maximum duration: 1 minute
  • Clear speech with minimal background noise recommended

Performance Considerations:

  • Best results with clear speech audio
  • Performance may vary with different accents
  • Background noise can affect accuracy

Demo

https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition

Contact

For issues and questions, feel free to:

  1. Open an issue on the Model Repository
  2. Comment on the Demo Space

Usage

from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")

# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
   resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
   speech_array = resampler(speech_array)

# Convert to mono if stereo
if speech_array.shape[0] > 1:
   speech_array = torch.mean(speech_array, dim=0, keepdim=True)

speech_array = speech_array.squeeze().numpy()

# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
   outputs = model(**inputs)
   predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()]
Downloads last month
145
Safetensors
Model size
94.6M params
Tensor type
F32
Β·
Inference API
Unable to determine this model's library. Check the docs .

Space using Dpngtm/wav2vec2-emotion-recognition 1