Spaces:
Running
FAcodec
Pytorch implementation for the training of FAcodec, which was proposed in paper NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
A dedicated repository for the FAcodec model can also be find here.
This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including
transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files.
With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions.
We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers.
Model storage
We provide pretrained checkpoints on 50k hours speech data.
Demo
Training
Prepare your data and put them under one folder, internal file structure does not matter.
Then, change the dataset
in ./egs/codec/FAcodec/exp_custom_data.json
to the path of your data folder.
Finally, run the following command:
sh ./egs/codec/FAcodec/train.sh
Inference
To reconstruct a speech file, run:
python ./bins/codec/inference.py --source <source_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>
To use zero-shot voice conversion, run:
python ./bins/codec/inference.py --source <source_wav> --reference <reference_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>
Feature extraction
When running ./bins/codec/inference.py
, check the returned results of the FAcodecInference
class: a tuple of (quantized, codes)
quantized
is the quantized representation of the input speech file.quantized[0]
is the quantized representation of prosodyquantized[1]
is the quantized representation of contentcodes
is the discrete code representation of the input speech file.codes[0]
is the discrete code representation of prosodycodes[1]
is the discrete code representation of content
For the most clean content representation without any timbre, we suggest to use codes[1][:, 0, :]
, which is the first layer of content codebooks.