metadata

title: Multi Modal Emotion Recognition
emoji: 📈
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit

Multi Modal Emotion Recognition 📈

This application allows users to analyze emotions from videos using state-of-the-art models for both audio and visual content. You can upload videos (maximum length of 2 minutes) to extract emotions from both speech and facial expressions in real-time.

Features:

Audio Emotion Detection: Uses OpenAI's Whisper model for transcription and Cardiff NLP's RoBERTa model for emotion recognition in text.
Visual Emotion Analysis: Leverages Salesforce's BLIP model for image captioning and J-Hartmann's DistilRoBERTa for visual emotion recognition.

Instructions:

Upload a video file (maximum length: 2 minutes).
The app will analyze both the audio and visual components of the video to extract and display emotions in real-time.

Models Used:

The models have been handpicked after numerous trials and are optimized for this task. Below are the models and the corresponding research papers:

Cardiff NLP RoBERTa for Emotion Recognition from Text:
- Model: cardiffnlp/twitter-roberta-base-emotion
- Paper: RoBERTa Sentiment & Emotion Analysis
Salesforce BLIP for Image Captioning and Visual Emotion Analysis:
- Model: Salesforce/blip-image-captioning-base
- Paper: BLIP - Bootstrapping Language-Image Pre-training
J-Hartmann DistilRoBERTa for Emotion Recognition from Images:
- Model: j-hartmann/emotion-english-distilroberta-base
OpenAI Whisper for Speech-to-Text Transcription:
- Model: openai/whisper-base
- Paper: Whisper - Speech Recognition

These models were selected based on extensive trials to ensure the best performance for this multimodal emotion recognition task.

Access the App:

You can try the app here.

License:

This project is licensed under the MIT License.