|
### Zero-shot voice conversion🎙🔁
|
|
We have performed a series of objective evaluations on our Seed-VC's voice conversion capabilities.
|
|
For ease of reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>
|
|
|
|
Source audios can be found under `./examples/libritts-test-clean` <br>
|
|
Reference audios can be found under `./examples/reference` <br>
|
|
|
|
We evaluate the conversion results in terms of speaker embedding cosine similarity (SECS), word error rate (WER) and character error rate (CER) and compared
|
|
our results with two strong open sourced baselines, namely [OpenVoice](https://github.com/myshell-ai/OpenVoice) and [CosyVoice](https://github.com/FunAudioLLM/CosyVoice).
|
|
Results in the table below shows that our Seed-VC model significantly outperforms the baseline models in both intelligibility and speaker similarity.<br>
|
|
|
|
| Models\Metrics | SECS↑ | WER↓ | CER↓ | SIG↑ | BAK↑ | OVRL↑ |
|
|
|----------------|------------|-----------|----------|----------|----------|----------|
|
|
| Ground Truth | 1.0000 | 8.02 | 1.57 | ~ | ~ | ~ |
|
|
| OpenVoice | 0.7547 | 15.46 | 4.73 | **3.56** | **4.02** | **3.27** |
|
|
| CosyVoice | 0.8440 | 18.98 | 7.29 | 3.51 | **4.02** | 3.21 |
|
|
| Seed-VC(Ours) | **0.8676** | **11.99** | **2.92** | 3.42 | 3.97 | 3.11 |
|
|
|
|
We have also compared with non-zero-shot voice conversion models for several speakers (based on model availability):
|
|
|
|
| Characters | Models\Metrics | SECS↑ | WER↓ | CER↓ | SIG↑ | BAK↑ | OVRL↑ |
|
|
|---------------------|----------------|------------|-----------|----------|----------|----------|----------|
|
|
| ~ | Ground Truth | 1.0000 | 6.43 | 1.00 | ~ | ~ | ~ |
|
|
| Tokai Teio | So-VITS-4.0 | 0.8637 | 21.46 | 9.63 | 3.06 | 3.66 | 2.68 |
|
|
| | Seed-VC(Ours) | **0.8899** | **15.32** | **4.66** | **3.12** | **3.71** | **2.72** |
|
|
| Milky Green | So-VITS-4.0 | 0.6850 | 48.43 | 32.50 | 3.34 | 3.51 | 2.82 |
|
|
| | Seed-VC(Ours) | **0.8072** | **7.26** | **1.32** | **3.48** | **4.07** | **3.20** |
|
|
| Matikane Tannhuaser | So-VITS-4.0 | 0.8594 | 16.25 | 8.64 | **3.25** | 3.71 | 2.84 |
|
|
| | Seed-VC(Ours) | **0.8768** | **12.62** | **5.86** | 3.18 | **3.83** | **2.85** |
|
|
|
|
Results show that, despite not being trained on the target speakers, Seed-VC is able to achieve significantly better results than the non-zero-shot models.
|
|
However, this may vary a lot depending on the SoVITS model quality. PR or Issue is welcomed if you find this comparison unfair or inaccurate.
|
|
(Tokai Teio model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
|
|
(Matikane Tannhuaser model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
|
|
(Milky Green model from [sparanoid/milky-green-sovits-4](https://huggingface.co/spaces/sparanoid/milky-green-sovits-4))
|
|
|
|
*English ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model*
|
|
*Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model* <br>
|
|
|
|
You can reproduce the evaluation by running `eval.py` script.
|
|
```bash
|
|
python eval.py
|
|
--source ./examples/libritts-test-clean
|
|
--target ./examples/reference
|
|
--output ./examples/eval/converted
|
|
--diffusion-steps 25
|
|
--length-adjust 1.0
|
|
--inference-cfg-rate 0.7
|
|
--xvector-extractor "resemblyzer"
|
|
--baseline "" # fill in openvoice or cosyvoice to compute baseline result
|
|
--max-samples 100 # max source utterances to go through
|
|
```
|
|
Before that, make sure you have openvoice and cosyvoice repo correctly installed on `../OpenVoice/` and `../CosyVoice/` if you would like to run baseline evaluation.
|
|
|
|
### Zero-shot singing voice conversion🎤🎶
|
|
|
|
Additional singing voice conversion evaluation is done on [M4Singer](https://github.com/M4Singer/M4Singer) dataset, with 4 target speakers whose audio data is available [here](https://huggingface.co/datasets/XzJosh/audiodataset).
|
|
Speaker similariy is calculated by averaging the cosine similarities between conversion result and all available samples in respective character dataset.
|
|
For each character, one random utterance is chosen as the prompt for zero-shot inference. For comparison, we trained respective [RVCv2-f0-48k](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) model for each character as baseline.
|
|
100 random utterances for each singer type are used as source audio.
|
|
|
|
| Models\Metrics | F0CORR↑ | F0RMSE↓ | SECS↑ | CER↓ | SIG↑ | BAK↑ | OVRL↑ |
|
|
|----------------|---------|---------|------------|-----------|----------|----------|----------|
|
|
| RVCv2 | 0.9404 | 30.43 | 0.7264 | 28.46 | **3.41** | **4.05** | **3.12** |
|
|
| Seed-VC(Ours) | 0.9375 | 33.35 | **0.7405** | **19.70** | 3.39 | 3.96 | 3.06 |
|
|
|
|
<details>
|
|
<summary>Click to expand detailed evaluation results</summary>
|
|
|
|
| Source Singer Type | Characters | Models\Metrics | F0CORR↑ | F0RMSE↓ | SECS↑ | CER↓ | SIG↑ | BAK↑ | OVRL↑ |
|
|
|--------------------|--------------------|----------------|---------|---------|------------|-----------|------|------|----------|
|
|
| Alto (Female) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 8.16 | ~ | ~ | ~ |
|
|
| | Azuma (Female) | RVCv2 | 0.9617 | 33.03 | **0.7352** | 24.70 | 3.36 | 4.07 | 3.07 |
|
|
| | | Seed-VC(Ours) | 0.9658 | 31.64 | 0.7341 | **15.23** | 3.37 | 4.02 | 3.07 |
|
|
| | Diana (Female) | RVCv2 | 0.9626 | 32.56 | 0.7212 | 19.67 | 3.45 | 4.08 | **3.17** |
|
|
| | | Seed-VC(Ours) | 0.9648 | 31.94 | **0.7457** | **16.81** | 3.49 | 3.99 | 3.15 |
|
|
| | Ding Zhen (Male) | RVCv2 | 0.9013 | 26.72 | 0.7221 | 18.53 | 3.37 | 4.03 | 3.06 |
|
|
| | | Seed-VC(Ours) | 0.9356 | 21.87 | **0.7513** | **15.63** | 3.44 | 3.94 | **3.09** |
|
|
| | Kobe Bryant (Male) | RVCv2 | 0.9215 | 23.90 | 0.7495 | 37.23 | 3.49 | 4.06 | **3.21** |
|
|
| | | Seed-VC(Ours) | 0.9248 | 23.40 | **0.7602** | **26.98** | 3.43 | 4.02 | 3.13 |
|
|
| Bass (Male) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 8.62 | ~ | ~ | ~ |
|
|
| | Azuma | RVCv2 | 0.9288 | 32.62 | **0.7148** | 24.88 | 3.45 | 4.10 | **3.18** |
|
|
| | | Seed-VC(Ours) | 0.9383 | 31.57 | 0.6960 | **10.31** | 3.45 | 4.03 | 3.15 |
|
|
| | Diana | RVCv2 | 0.9403 | 30.00 | 0.7010 | 14.54 | 3.53 | 4.15 | **3.27** |
|
|
| | | Seed-VC(Ours) | 0.9428 | 30.06 | **0.7299** | **9.66** | 3.53 | 4.11 | 3.25 |
|
|
| | Ding Zhen | RVCv2 | 0.9061 | 19.53 | 0.6922 | 25.99 | 3.36 | 4.09 | **3.08** |
|
|
| | | Seed-VC(Ours) | 0.9169 | 18.15 | **0.7260** | **14.13** | 3.38 | 3.98 | 3.07 |
|
|
| | Kobe Bryant | RVCv2 | 0.9302 | 16.37 | 0.7717 | 41.04 | 3.51 | 4.13 | **3.25** |
|
|
| | | Seed-VC(Ours) | 0.9176 | 17.93 | **0.7798** | **24.23** | 3.42 | 4.08 | 3.17 |
|
|
| Soprano (Female) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 27.92 | ~ | ~ | ~ |
|
|
| | Azuma | RVCv2 | 0.9742 | 47.80 | 0.7104 | 38.70 | 3.14 | 3.85 | **2.83** |
|
|
| | | Seed-VC(Ours) | 0.9521 | 64.00 | **0.7177** | **33.10** | 3.15 | 3.86 | 2.81 |
|
|
| | Diana | RVCv2 | 0.9754 | 46.59 | **0.7319** | 32.36 | 3.14 | 3.85 | **2.83** |
|
|
| | | Seed-VC(Ours) | 0.9573 | 59.70 | 0.7317 | **30.57** | 3.11 | 3.78 | 2.74 |
|
|
| | Ding Zhen | RVCv2 | 0.9543 | 31.45 | 0.6792 | 40.80 | 3.41 | 4.08 | **3.14** |
|
|
| | | Seed-VC(Ours) | 0.9486 | 33.37 | **0.6979** | **34.45** | 3.41 | 3.97 | 3.10 |
|
|
| | Kobe Bryant | RVCv2 | 0.9691 | 25.50 | 0.6276 | 61.59 | 3.43 | 4.04 | **3.15** |
|
|
| | | Seed-VC(Ours) | 0.9496 | 32.76 | **0.6683** | **39.82** | 3.32 | 3.98 | 3.04 |
|
|
| Tenor (Male) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 5.94 | ~ | ~ | ~ |
|
|
| | Azuma | RVCv2 | 0.9333 | 42.09 | **0.7832** | 16.66 | 3.46 | 4.07 | **3.18** |
|
|
| | | Seed-VC(Ours) | 0.9162 | 48.06 | 0.7697 | **8.48** | 3.38 | 3.89 | 3.01 |
|
|
| | Diana | RVCv2 | 0.9467 | 36.65 | 0.7729 | 15.28 | 3.53 | 4.08 | **3.24** |
|
|
| | | Seed-VC(Ours) | 0.9360 | 41.49 | **0.7920** | **8.55** | 3.49 | 3.93 | 3.13 |
|
|
| | Ding Zhen | RVCv2 | 0.9197 | 22.82 | 0.7591 | 12.92 | 3.40 | 4.02 | **3.09** |
|
|
| | | Seed-VC(Ours) | 0.9247 | 22.77 | **0.7721** | **13.95** | 3.45 | 3.82 | 3.05 |
|
|
| | Kobe Bryant | RVCv2 | 0.9415 | 19.33 | 0.7507 | 30.52 | 3.48 | 4.02 | **3.19** |
|
|
| | | Seed-VC(Ours) | 0.9082 | 24.86 | **0.7764** | **13.35** | 3.39 | 3.93 | 3.07 |
|
|
</details>
|
|
|
|
|
|
Despite Seed-VC is not trained on the target speakers, and only one random utterance is used as prompt, it still constantly outperforms speaker-specific RVCv2 models
|
|
in terms of speaker similarity (SECS) and intelligibility (CER), which demonstrates the superior voice cloning capability and robustness of Seed-VC.
|
|
|
|
However, it is observed that Seed-VC's audio quality (DNSMOS) is slightly lower than RVCv2. We take this drawback seriously and
|
|
will give high priority to improve the audio quality in the future.
|
|
PR or issue is welcomed if you find this comparison unfair or inaccurate.
|
|
|
|
*Chinese ASR result computed by [SenseVoiceSmall](https://github.com/FunAudioLLM/SenseVoice)*
|
|
*Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model*
|
|
*We set +12 semitones pitch shift for male-to-female conversion and -12 semitones for female-to-male converison, otherwise 0 pitch shift*
|
|
|
|
|