File size: 3,693 Bytes
fb349fb 2f1b74c fb349fb c837fe9 fb349fb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
language: multilingual
license: apache-2.0
datasets:
- voxceleb2
libraries:
- speechbrain
- librosa
tags:
- age-estimation
- speaker-characteristics
- speaker-recognition
- audio-regression
- voice-analysis
---
# Age Estimation Model
This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an SVR regressor to predict speaker age from audio input. The model uses ECAPA embeddings and Librosa acoustic features, trained on the VoxCeleb2 dataset.
## Model Performance Comparison
We provide multiple pre-trained models with different architectures and feature sets. Here's a comprehensive comparison of their performance:
| Model | Architecture | Features | Training Data | Test MAE | Best For |
|-------|-------------|----------|---------------|-----------|----------|
| VoxCeleb2 SVR (223) | SVR | ECAPA + Librosa (223-dim) | VoxCeleb2 | 7.88 years | Best performance on VoxCeleb2 |
| VoxCeleb2 SVR (192) | SVR | ECAPA only (192-dim) | VoxCeleb2 | 7.89 years | Lightweight deployment |
| TIMIT ANN (192) | ANN | ECAPA only (192-dim) | TIMIT | 4.95 years | Clean studio recordings |
| Combined ANN (223) | ANN | ECAPA + Librosa (223-dim) | VoxCeleb2 + TIMIT | 6.93 years | Best general performance |
You may find other models [here](https://huggingface.co/griko).
## Model Details
- Input: Audio file (will be converted to 16kHz, mono, single channel)
- Output: Predicted age in years (continuous value)
- Features:
- SpeechBrain ECAPA-TDNN embedding [192 features]
- Additional Librosa features [31 features]
- Regressor: Support Vector Regression optimized through Optuna
- Performance:
- VoxCeleb2 test set: 7.88 years Mean Absolute Error (MAE)
## Features
1. SpeechBrain ECAPA-TDNN embeddings (192 dimensions)
2. Librosa acoustic features (31 dimensions):
- 13 MFCCs
- 13 Delta MFCCs
- Zero crossing rate
- Spectral centroid
- Spectral bandwidth
- Spectral contrast
- Spectral flatness
## Training Data
The model was trained on the VoxCeleb2 dataset:
- Audio preprocessing:
- Converted to WAV format, single channel, 16kHz sampling rate
- Applied SileroVAD for voice activity detection, taking the first voiced segment
- Age data was collected from Wikidata and public sources
## Installation
```bash
pip install git+https://github.com/griko/voice-age-regression.git#egg=voice-age-regressor[svr-ecapa-librosa-voxceleb2]
```
## Usage
```python
from age_regressor import AgeRegressionPipeline
# Load the pipeline
regressor = AgeRegressionPipeline.from_pretrained(
"griko/age_reg_svr_ecapa_librosa_voxceleb2"
)
# Single file prediction
result = regressor("path/to/audio.wav")
print(f"Predicted age: {result[0]:.1f} years")
# Batch prediction
results = regressor(["audio1.wav", "audio2.wav"])
print(f"Predicted ages: {[f'{age:.1f}' for age in results]} years")
```
## Limitations
- Model was trained on celebrity voices from YouTube interviews recordings
- Performance may vary on different audio qualities or recording conditions
- Age predictions are estimates and should not be used for medical or legal purposes
- Age estimations should be treated as approximate values, not exact measurements
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{koushnir2025vanpyvoiceanalysisframework,
title={VANPY: Voice Analysis Framework},
author={Gregory Koushnir and Michael Fire and Galit Fuhrmann Alpert and Dima Kagan},
year={2025},
eprint={2502.17579},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2502.17579},
}
```
|