Inference

import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torchaudio

# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Mrkomiljon/voiceGUARD")
processor = Wav2Vec2Processor.from_pretrained("Mrkomiljon/voiceGUARD")

# Inference function
def classify_audio(audio_path, target_sample_rate=16000):
# Load and preprocess audio
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != target_sample_rate:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
waveform = resampler(waveform)

# Ensure audio is 10 seconds (truncate/pad)
max_length = target_sample_rate * 10
if waveform.size(1) > max_length:
waveform = waveform[:, :max_length]
elif waveform.size(1) < max_length:
waveform = torch.nn.functional.pad(waveform, (0, max_length - waveform.size(1)))

# Process input
inputs = processor(waveform.squeeze().numpy(), sampling_rate=target_sample_rate, return_tensors="pt")
inputs = {key: val.to(model.device) for key, val in inputs.items()}

# Perform inference
with torch.no_grad():
logits = model(**inputs).logits
predicted_label = torch.argmax(logits, dim=-1).item()

return predicted_label

# Example usage
audio_path = "path_to_audio_file.wav"
label = classify_audio(audio_path)
print(f"Predicted Label: {label}")

Files changed (1) hide show

README.md +43 -0

README.md ADDED Viewed

	@@ -0,0 +1,43 @@

+---
+license: mit
+datasets:
+- fixie-ai/librispeech_asr
+language:
+- en
+base_model:
+- facebook/wav2vec2-base
+pipeline_tag: voice-activity-detection
+---
+# Voice Detection AI - Real vs AI Audio Classifier
+### **Model Overview**
+This model is a fine-tuned Wav2Vec2-based audio classifier capable of distinguishing between **real human voices** and **AI-generated voices**. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.
+---
+### **Model Details**
+- **Architecture:** Wav2Vec2ForSequenceClassification
+- **Fine-tuned on:** Custom dataset with real and AI-generated audio
+- **Classes:**
+  1. Real Human Voice
+  2. AI-generated (e.g., Melgan, DiffWave, etc.)
+- **Input Requirements:**
+  - Audio format: `.wav`, `.mp3`, etc.
+  - Sample rate: 16kHz
+  - Max duration: 10 seconds (longer audios are truncated, shorter ones are padded)
+---
+### **Performance**
+- **Validation Accuracy:** 99.8%
+- **Robustness:** Successfully classifies across multiple AI-generation models.
+- **Limitations:** Struggles with certain unseen AI-generation models (e.g., ElevenLabs).
+---
+### **How to Use**
+#### **1. Install Dependencies**
+Make sure you have `transformers` and `torch` installed:
+```bash
+pip install transformers torch torchaudio