Textless speech emotion conversion using decomposed and discrete representations

Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

abstract: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github.io/emotion.

Installation

First, create a conda virtual environment and activate it:

conda create -n emotion python=3.8 -y
conda activate emotion

Then, clone this repository:

git clone https://github.com/facebookresearch/fairseq.git
cd fairseq/examples/emotion_conversion
git clone https://github.com/felixkreuk/speech-resynthesis

Next, download the EmoV discrete tokens:

wget https://dl.fbaipublicfiles.com/textless_nlp/emotion_conversion/data.tar.gz  # (still in fairseq/examples/emotion_conversion)
tar -xzvf data.tar.gz

Your fairseq/examples/emotion_conversion directory should like this:

drwxrwxr-x 3 felixkreuk felixkreuk   0 Feb  6  2022 data
drwxrwxr-x 3 felixkreuk felixkreuk   0 Sep 28 10:41 emotion_models
drwxr-xr-x 3 felixkreuk felixkreuk   0 Jun 29 05:43 fairseq_models
drwxr-xr-x 3 felixkreuk felixkreuk   0 Sep 28 10:41 preprocess
-rw-rw-r-- 1 felixkreuk felixkreuk 11K Dec  5 09:00 README.md
-rw-rw-r-- 1 felixkreuk felixkreuk  88 Mar  6  2022 requirements.txt
-rw-rw-r-- 1 felixkreuk felixkreuk 13K Jun 29 06:26 synthesize.py

Lastly, install fairseq and the other packages:

pip install --editable ./
pip install -r examples/emotion_conversion/requirements.txt

Data preprocessing

Convert your audio to discrete representations

Please follow the steps described here. To generate the same discrete representations please use the following:

HuBERT checkpoint
k-means model at data/hubert_base_ls960_layer9_clusters200/data_hubert_base_ls960_layer9_clusters200.bin

Construct data splits

This step will use the discrete representations from the previous step and split them to train/valid/test sets for 3 tasks:

Translation model pre-training (BART language denoising)
Translation model training (content units emotion translation mechanism)
HiFiGAN model training (for synthesizing audio from discrete representations)

Your processed data should be at data/:

hubert_base_ls960_layer9_clusters200 - discrete representations extracted using HuBERT layer 9, clustered into 200 clusters.
data.tsv - a tsv file pointing to the EmoV dataset in your environment (Please edit the first line of this file according to your path).

The following command will create the above splits:

python examples/emotion_conversion/preprocess/create_core_manifest.py \
    --tsv data/data.tsv \
    --emov-km data/hubert_base_ls960_layer9_clusters200/data.km \
    --km data/hubert_base_ls960_layer9_clusters200/vctk.km \
    --dict data/hubert_base_ls960_layer9_clusters200/dict.txt \
    --manifests-dir $DATA

Set $DATA as the directory that will contain the processed data.

Extract F0

To train the HiFiGAN vocoder we need to first extract the F0 curves:

python examples/emotion_conversion/preprocess/extract_f0.py \
    --tsv data/data.tsv \
    --extractor pyaapt \

HiFiGAN training

Now we are all set to train the HiFiGAN vocoder:

python examples/emotion_conversion/speech-resynthesis/train.py 
    --checkpoint_path <hifigan-checkpoint-dir> \
    --config examples/emotion_conversion/speech-resynthesis/configs/EmoV/emov_hubert-layer9-cluster200_fixed-spkr-embedder_f0-raw_gst.json

Translation Pre-training

Before translating emotions, we first need to pre-train the translation model as a denoising autoencoder (similarly to BART).

python train.py \
    $DATA/fairseq-data/emov_multilingual_denoising_cross-speaker_dedup_nonzeroshot/tokenized \
    --save-dir <your-save-dir> \
    --tensorboard-logdir <your-tb-dir> \
    --langs neutral,amused,angry,sleepy,disgusted,vctk.km \
    --dataset-impl mmap \
    --task multilingual_denoising \
    --arch transformer_small --criterion cross_entropy \
    --multilang-sampling-alpha 1.0 --sample-break-mode eos --max-tokens 16384 \
    --update-freq 1 --max-update 3000000 \
    --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.0 \
    --optimizer adam --weight-decay 0.01 --adam-eps 1e-06 \
    --clip-norm 0.1 --lr-scheduler polynomial_decay --lr 0.0003 \
    --total-num-update 3000000 --warmup-updates 10000 --fp16 \
    --poisson-lambda 3.5 --mask 0.3 --mask-length span-poisson --replace-length 1 --rotate 0 --mask-random 0.1 --insert 0 --permute-sentences 1.0 \
    --skip-invalid-size-inputs-valid-test \
    --user-dir examples/emotion_conversion/fairseq_models

Translation Training

Now we are ready to train our emotion translation model:

python train.py \
    --distributed-world-size 1 \
    $DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/tokenized/ \
    --save-dir <your-save-dir> \
    --tensorboard-logdir <your-tb-dir> \
    --arch multilingual_small --task multilingual_translation \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
    --lang-pairs neutral-amused,neutral-sleepy,neutral-disgusted,neutral-angry,amused-sleepy,amused-disgusted,amused-neutral,amused-angry,angry-amused,angry-sleepy,angry-disgusted,angry-neutral,disgusted-amused,disgusted-sleepy,disgusted-neutral,disgusted-angry,sleepy-amused,sleepy-neutral,sleepy-disgusted,sleepy-angry \
    --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
    --lr 1e-05 --clip-norm 0 --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.01 --warmup-updates 2000 --lr-scheduler inverse_sqrt \
    --max-tokens 4096 --update-freq 1 --max-update 100000 \
    --required-batch-size-multiple 8 --fp16 --num-workers 4 \
    --seed 2 --log-format json --log-interval 25 --save-interval-updates 1000 \
    --no-epoch-checkpoints --keep-best-checkpoints 1 --keep-interval-updates 1 \
    --finetune-from-model <path-to-model-from-previous-step> \
    --user-dir examples/emotion_conversion/fairseq_models

To share encoders/decoders use the --share-encoders and --share-decoders flags.
To add source/target emotion tokens use the --encoder-langtok {'src'|'tgt'} and --decoder-langtok flags.

F0-predictor Training

The following command trains the F0 prediction module:

cd examples/emotion_conversion
python -m emotion_models.pitch_predictor n_tokens=200 \
    train_tsv="$DATA/denoising/emov/train.tsv" \
    train_km="$DATA/denoising/emov/train.km" \
    valid_tsv="$DATA/denoising/emov/valid.tsv" \
    valid_km="$DATA/denoising/emov/valid.km"

See hyra.run.dir to configure directory for saving models.

Duration-predictor Training

The following command trains the duration prediction modules:

cd examples/emotion_conversion
for emotion in "neutral" "amused" "angry" "disgusted" "sleepy"; do
    python -m emotion_models.duration_predictor n_tokens=200 substring=$emotion \
        train_tsv="$DATA/denoising/emov/train.tsv" \
        train_km="$DATA/denoising/emov/train.km" \
        valid_tsv="$DATA/denoising/emov/valid.tsv" \
        valid_km="$DATA/denoising/emov/valid.km"
done

See hyra.run.dir to configure directory for saving models.
After the above command you should have 5 duration models in your checkpoint directory:

❯ ll duration_predictor/
total 21M
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 amused.ckpt
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 angry.ckpt
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 disgusted.ckpt
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 neutral.ckpt
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15  2021 sleepy.ckpt

Token Generation

The following command uses fairseq-generate to generate the token sequences based on the source and target emotions.

fairseq-generate \
    $DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/tokenized/ \
    --task multilingual_translation \
    --gen-subset test \
    --path <your-saved-translation-checkpoint> \
    --beam 5 \
    --batch-size 4 --max-len-a 1.8 --max-len-b 10 --lenpen 1 --min-len 1 \
    --skip-invalid-size-inputs-valid-test --distributed-world-size 1 \
    --source-lang neutral --target-lang amused \
    --lang-pairs neutral-amused,neutral-sleepy,neutral-disgusted,neutral-angry,amused-sleepy,amused-disgusted,amused-neutral,amused-angry,angry-amused,angry-sleepy,angry-disgusted,angry-neutral,disgusted-amused,disgusted-sleepy,disgusted-neutral,disgusted-angry,sleepy-amused,sleepy-neutral,sleepy-disgusted,sleepy-angry \
    --results-path <token-output-path> \
    --user-dir examples/emotion_conversion/fairseq_models

Modify --source-lang and --target-lang to control for the source and target emotions.
See fairseq documentation for a full overview of generation parameters (e.g., top-k/top-p sampling).

Waveform Synthesis

Using the output of the above command, the HiFiGAN vocoder, and the prosody prediction modules (F0 and duration) we can now generate the output waveforms:

python examples/emotion_conversion/synthesize.py \
    --result-path <token-output-path>/generate-test.txt \
    --data $DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/neutral-amused \
    --orig-tsv examples/emotion_conversion/data/data.tsv \
    --orig-km examples/emotion_conversion/data/hubert_base_ls960_layer9_clusters200/data.km \
    --checkpoint-file <hifigan-checkpoint-dir>/g_00400000 \
    --dur-model duration_predictor/ \
    --f0-model pitch_predictor/pitch_predictor.ckpt \
    -s neutral -t amused \
    --outdir ~/tmp/emotion_results/wavs/neutral-amused

Please make sure the source and target emotions here match those of the previous command.

Citation

If you find this useful in your research, please use the following BibTeX entry for citation.

@article{kreuk2021textless,
  title={Textless speech emotion conversion using decomposed and discrete representations},
  author={Kreuk, Felix and Polyak, Adam and Copet, Jade and Kharitonov, Eugene and Nguyen, Tu-Anh and Rivi{\`e}re, Morgane and Hsu, Wei-Ning and Mohamed, Abdelrahman and Dupoux, Emmanuel and Adi, Yossi},
  journal={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2022}
}