File size: 2,738 Bytes
3cc8be9 6505384 3cc8be9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
---
license: mit
tags:
- neural-vocoder
- audio
---
# DisCoder: High-Fidelity Music Vocoder Using Neural Audio Codecs
[Paper](https://arxiv.org/abs/2502.12759) | [Samples](https://lucala.github.io/discoder/) | [Code](https://github.com/ETH-DISCO/discoder) | [Model](https://huggingface.co/disco-eth/discoder)
DisCoder is a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by
a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms
the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space
before reconstructing it to an audio signal using a fine-tuned DAC decoder.
## Installation
The codebase has been tested with Python 3.11. To get started, clone the repository and set up the environment using Conda:
```shell
git clone https://github.com/ETH-DISCO/discoder
conda create -n discoder python=3.11
conda activate discoder
python -m pip install -r requirements.txt
```
## Inference with 🤗 Hugging Face
Use the following script to perform inference with the pretrained DisCoder model from Hugging Face.
The model uses the z prediction target and was trained using 128 mel bins.
```python
import torch
from discoder.models import DisCoder
from discoder import meldataset, utils
device = "cuda"
sr_target = 44100
# load pretrained DisCoder model
discoder = DisCoder.from_pretrained("disco-eth/discoder")
discoder = discoder.eval().to(device)
# load 44.1 kHz audio file and create mel spectrogram
audio, _ = meldataset.load_wav(full_path="path/to/audio.wav", sr_target=sr_target, resample=True, normalize=True)
audio = torch.tensor(audio).unsqueeze(dim=0).to(device)
mel = utils.get_mel_spectrogram_from_config(audio, discoder.config) # [B, 128, frames]
# reconstruct audio
with torch.no_grad():
wav_recon = discoder(mel) # [B, 1, time]
```
## Training
To calculate [ViSQOL](https://github.com/google/visqol) during validation, install the required library by following the steps below:
```shell
cd discoder
git clone https://github.com/google/visqol
bazel build :visqol -c opt
cd visqol && pip install .
```
To start training, use the following command:
```shell
python -u train.py --config configs/config_z.json
```
## Inference
The inference script allows batch processing of audio files. It converts all WAV files in the specified `input_dir` to
mel spectrograms, then reconstructs them into audio files in the `output_dir`.
```shell
python -u inference.py --input_dir input_dir --output_dir output_dir --checkpoint_file model.pt --config configs/config_z.json
```
You can also pass the `normalize_volume` flag to standardize the output volume.
|