File size: 3,608 Bytes
4431555 6855a95 4431555 6855a95 4431555 d331afd 4431555 0aff5d9 d331afd 4c74016 d331afd 4c74016 6250a60 4c74016 6855a95 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: mit
datasets:
- fixie-ai/librispeech_asr
language:
- en
base_model:
- facebook/wav2vec2-base
pipeline_tag: audio-classification
metrics:
- accuracy
library_name: transformers
tags:
- voice_phishing
- audio_classification
---
# Voice Detection AI - Real vs AI Audio Classifier

### **Model Overview**
This model is a fine-tuned Wav2Vec2-based audio classifier capable of distinguishing between **real human voices** and **AI-generated voices**. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.
---
### **Model Details**
- **Architecture:** Wav2Vec2ForSequenceClassification
- **Fine-tuned on:** Custom dataset with real and AI-generated audio
- **Classes:**
1. Real Human Voice
2. AI-generated (e.g., Melgan, DiffWave, etc.)
- **Input Requirements:**
- Audio format: `.wav`, `.mp3`, etc.
- Sample rate: 16kHz
- Max duration: 10 seconds (longer audios are truncated, shorter ones are padded)
---
### **Performance**
- **Robustness:** Successfully classifies across multiple AI-generation models.
- **Limitations:** Struggles with certain unseen AI-generation models (e.g., ElevenLabs).
---
### **How to Use**
#### **1. Install Dependencies**
Make sure you have `transformers` and `torch` installed:
```bash
pip install transformers torch torchaudio
```
## Usage
### Here's how to use VoiceGUARD for audio classification:
```
import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torchaudio
# Load model and processor
model_name = "Mrkomiljon/voiceGUARD"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)
# Load audio
waveform, sample_rate = torchaudio.load("path_to_audio_file.wav")
# Resample if necessary
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
# Preprocess
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
# Inference
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
# Map to label
labels = ["Real Human Voice", "AI-generated"]
prediction = labels[predicted_ids.item()]
print(f"Prediction: {prediction}")
```
## Training Procedure
- Data Collection: Compiled a balanced dataset of real human voices and AI-generated samples from various TTS models.
- Preprocessing: Standardized audio formats, resampled to 16 kHz, and adjusted durations to 10 seconds.
- Fine-Tuning: Utilized the Wav2Vec2 architecture for sequence classification, training for 3 epochs with a learning rate of 1e-5.
## Evaluation
- Metrics: Accuracy, Precision, Recall
- Results: Achieved 99.8% validation accuracy on the test set.
## Limitations and Future Work
- While VoiceGUARD performs robustly across known AI-generation models, it may encounter challenges with novel or unseen models.
- Future work includes expanding the training dataset with samples from emerging TTS technologies to enhance generalization.
## License
This project is licensed under the MIT License. See the LICENSE file for details.
## Acknowledgements
* Special thanks to the developers of the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model and the contributors to the datasets used in this project.
* View the complete project on [GitHub](https://github.com/Mrkomiljon/VoiceGUARD2) |