metadata
license: mit
datasets:
- fixie-ai/librispeech_asr
language:
- en
base_model:
- facebook/wav2vec2-base
pipeline_tag: voice-activity-detection
Voice Detection AI - Real vs AI Audio Classifier
Model Overview
This model is a fine-tuned Wav2Vec2-based audio classifier capable of distinguishing between real human voices and AI-generated voices. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.
Model Details
- Architecture: Wav2Vec2ForSequenceClassification
- Fine-tuned on: Custom dataset with real and AI-generated audio
- Classes:
- Real Human Voice
- AI-generated (e.g., Melgan, DiffWave, etc.)
- Input Requirements:
- Audio format:
.wav
,.mp3
, etc. - Sample rate: 16kHz
- Max duration: 10 seconds (longer audios are truncated, shorter ones are padded)
- Audio format:
Performance
- Validation Accuracy: 99.8%
- Robustness: Successfully classifies across multiple AI-generation models.
- Limitations: Struggles with certain unseen AI-generation models (e.g., ElevenLabs).
How to Use
1. Install Dependencies
Make sure you have transformers
and torch
installed:
pip install transformers torch torchaudio