dg845/univnet-dev · Hugging Face

The UnivNet model is a state-of-the-art neural vocoder which synthesizes audio waveforms from full-band MEL spectrograms, introduced in "UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation" by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim. UnivNet is a generative adversarial network (GAN) in which the generator is trained to convert real (or fake, during training) log MEL spectrograms to waveforms, and the discriminator is trained to classify whether input waveforms are real or fake. From the original paper abstract:

Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time. Inspired by works in the field of voice activity detection, we added a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets. Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input. In an evaluation on a dataset containing information on hundreds of speakers, UnivNet obtained the best objective and subjective results among competing models for both seen and unseen speakers. These results, including the best subjective score for text-to-speech, demonstrate the potential for fast adaptation to new speakers without a need for training from scratch.

Currently, only the generator/vocoder part of the model is implemented.

This checkpoint was released as part of an unofficial implementation by maum-ai (on which the transformers implementation is also based). As far as I know, there is no official model or code release by the original authors from Kakao Enterprise.

Download

The original PyTorch model checkpoints from the maum-ai/univnet implementation can be downloaded from their Github repo. Note that this checkpoint corresponds with their c32 checkpoint.

The transformers model and feature extractor (to prepare inputs for the model) can be downloaded as follows:

from transformers import UnivNetFeatureExtractor, UnivNetModel

model_id_or_path = "dg845/univnet-dev"
feature_extractor = UnivNetFeatureExtractor.from_pretrained(model_id_or_path)
model = UnivNetModel.from_pretrained(model_id_or_path)

Usage

The original model checkpoints can be used with the maum-ai/univnet codebase.

An example of using the UnivNet model with transformers is as follows:

import torch
from scipy.io.wavfile import write
from datasets import Audio, load_dataset

from transformers import UnivNetFeatureExtractor, UnivNetModel

model_id_or_path = "dg845/univnet-dev"
model = UnivNetModel.from_pretrained(model_id_or_path)
feature_extractor = UnivNetFeatureExtractor.from_pretrained(model_id_or_path)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# Resample the audio to the model and feature extractor's sampling rate.
ds = ds.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
# Pad the end of the converted waveforms to reduce artifacts at the end of the output audio samples.
inputs = feature_extractor(
    ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], pad_end=True, return_tensors="pt"
)

with torch.no_grad():
    audio = model(**inputs)

# Remove the extra padding at the end of the output.
audio = feature_extractor.batch_decode(**audio)[0]
# Convert to wav file
write("sample_audio.wav", feature_extractor.sampling_rate, audio)

Model Details

Model type: Vocoder (spectrogram-to-waveform) model, trained as the generator of a GAN
Dataset: LibriTTS
License: BSD-3-Clause
Model Description: This model maps log MEL spectrograms to audio waveforms (that is, a vocoder). Its main component is a location-variable convolution based ResNet, which parameterizes the vocoder. This model was trained as the generator of a generative adversarial network (GAN).
Resources for more information: Paper, unofficial implementation