Spaces:

soiz1
/

seed-vc3

Running

App Files Files Community

seed-vc3 / EVAL.md

soiz1

Upload 123 files

20d6bb2 verified 3 months ago

preview code

raw

history blame contribute delete

11.4 kB

	### Zero-shot voice conversion🎙🔁
	We have performed a series of objective evaluations on our Seed-VC's voice conversion capabilities.
	For ease of reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>

	Source audios can be found under `./examples/libritts-test-clean` <br>
	Reference audios can be found under `./examples/reference` <br>

	We evaluate the conversion results in terms of speaker embedding cosine similarity (SECS), word error rate (WER) and character error rate (CER) and compared
	our results with two strong open sourced baselines, namely [OpenVoice](https://github.com/myshell-ai/OpenVoice) and [CosyVoice](https://github.com/FunAudioLLM/CosyVoice).
	Results in the table below shows that our Seed-VC model significantly outperforms the baseline models in both intelligibility and speaker similarity.<br>

	\| Models\Metrics \| SECS↑ \| WER↓ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|----------------\|------------\|-----------\|----------\|----------\|----------\|----------\|
	\| Ground Truth \| 1.0000 \| 8.02 \| 1.57 \| ~ \| ~ \| ~ \|
	\| OpenVoice \| 0.7547 \| 15.46 \| 4.73 \| 3.56 \| 4.02 \| 3.27 \|
	\| CosyVoice \| 0.8440 \| 18.98 \| 7.29 \| 3.51 \| 4.02 \| 3.21 \|
	\| Seed-VC(Ours) \| 0.8676 \| 11.99 \| 2.92 \| 3.42 \| 3.97 \| 3.11 \|

	We have also compared with non-zero-shot voice conversion models for several speakers (based on model availability):

	\| Characters \| Models\Metrics \| SECS↑ \| WER↓ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|---------------------\|----------------\|------------\|-----------\|----------\|----------\|----------\|----------\|
	\| ~ \| Ground Truth \| 1.0000 \| 6.43 \| 1.00 \| ~ \| ~ \| ~ \|
	\| Tokai Teio \| So-VITS-4.0 \| 0.8637 \| 21.46 \| 9.63 \| 3.06 \| 3.66 \| 2.68 \|
	\| \| Seed-VC(Ours) \| 0.8899 \| 15.32 \| 4.66 \| 3.12 \| 3.71 \| 2.72 \|
	\| Milky Green \| So-VITS-4.0 \| 0.6850 \| 48.43 \| 32.50 \| 3.34 \| 3.51 \| 2.82 \|
	\| \| Seed-VC(Ours) \| 0.8072 \| 7.26 \| 1.32 \| 3.48 \| 4.07 \| 3.20 \|
	\| Matikane Tannhuaser \| So-VITS-4.0 \| 0.8594 \| 16.25 \| 8.64 \| 3.25 \| 3.71 \| 2.84 \|
	\| \| Seed-VC(Ours) \| 0.8768 \| 12.62 \| 5.86 \| 3.18 \| 3.83 \| 2.85 \|

	Results show that, despite not being trained on the target speakers, Seed-VC is able to achieve significantly better results than the non-zero-shot models.
	However, this may vary a lot depending on the SoVITS model quality. PR or Issue is welcomed if you find this comparison unfair or inaccurate.
	(Tokai Teio model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
	(Matikane Tannhuaser model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
	(Milky Green model from [sparanoid/milky-green-sovits-4](https://huggingface.co/spaces/sparanoid/milky-green-sovits-4))

	English ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model
	Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model <br>

	You can reproduce the evaluation by running `eval.py` script.
	```bash
	python eval.py
	--source ./examples/libritts-test-clean
	--target ./examples/reference
	--output ./examples/eval/converted
	--diffusion-steps 25
	--length-adjust 1.0
	--inference-cfg-rate 0.7
	--xvector-extractor "resemblyzer"
	--baseline "" # fill in openvoice or cosyvoice to compute baseline result
	--max-samples 100 # max source utterances to go through
	```
	Before that, make sure you have openvoice and cosyvoice repo correctly installed on `../OpenVoice/` and `../CosyVoice/` if you would like to run baseline evaluation.

	### Zero-shot singing voice conversion🎤🎶

	Additional singing voice conversion evaluation is done on [M4Singer](https://github.com/M4Singer/M4Singer) dataset, with 4 target speakers whose audio data is available [here](https://huggingface.co/datasets/XzJosh/audiodataset).
	Speaker similariy is calculated by averaging the cosine similarities between conversion result and all available samples in respective character dataset.
	For each character, one random utterance is chosen as the prompt for zero-shot inference. For comparison, we trained respective [RVCv2-f0-48k](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) model for each character as baseline.
	100 random utterances for each singer type are used as source audio.

	\| Models\Metrics \| F0CORR↑ \| F0RMSE↓ \| SECS↑ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|----------------\|---------\|---------\|------------\|-----------\|----------\|----------\|----------\|
	\| RVCv2 \| 0.9404 \| 30.43 \| 0.7264 \| 28.46 \| 3.41 \| 4.05 \| 3.12 \|
	\| Seed-VC(Ours) \| 0.9375 \| 33.35 \| 0.7405 \| 19.70 \| 3.39 \| 3.96 \| 3.06 \|

	<details>
	<summary>Click to expand detailed evaluation results</summary>

	\| Source Singer Type \| Characters \| Models\Metrics \| F0CORR↑ \| F0RMSE↓ \| SECS↑ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|--------------------\|--------------------\|----------------\|---------\|---------\|------------\|-----------\|------\|------\|----------\|
	\| Alto (Female) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 8.16 \| ~ \| ~ \| ~ \|
	\| \| Azuma (Female) \| RVCv2 \| 0.9617 \| 33.03 \| 0.7352 \| 24.70 \| 3.36 \| 4.07 \| 3.07 \|
	\| \| \| Seed-VC(Ours) \| 0.9658 \| 31.64 \| 0.7341 \| 15.23 \| 3.37 \| 4.02 \| 3.07 \|
	\| \| Diana (Female) \| RVCv2 \| 0.9626 \| 32.56 \| 0.7212 \| 19.67 \| 3.45 \| 4.08 \| 3.17 \|
	\| \| \| Seed-VC(Ours) \| 0.9648 \| 31.94 \| 0.7457 \| 16.81 \| 3.49 \| 3.99 \| 3.15 \|
	\| \| Ding Zhen (Male) \| RVCv2 \| 0.9013 \| 26.72 \| 0.7221 \| 18.53 \| 3.37 \| 4.03 \| 3.06 \|
	\| \| \| Seed-VC(Ours) \| 0.9356 \| 21.87 \| 0.7513 \| 15.63 \| 3.44 \| 3.94 \| 3.09 \|
	\| \| Kobe Bryant (Male) \| RVCv2 \| 0.9215 \| 23.90 \| 0.7495 \| 37.23 \| 3.49 \| 4.06 \| 3.21 \|
	\| \| \| Seed-VC(Ours) \| 0.9248 \| 23.40 \| 0.7602 \| 26.98 \| 3.43 \| 4.02 \| 3.13 \|
	\| Bass (Male) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 8.62 \| ~ \| ~ \| ~ \|
	\| \| Azuma \| RVCv2 \| 0.9288 \| 32.62 \| 0.7148 \| 24.88 \| 3.45 \| 4.10 \| 3.18 \|
	\| \| \| Seed-VC(Ours) \| 0.9383 \| 31.57 \| 0.6960 \| 10.31 \| 3.45 \| 4.03 \| 3.15 \|
	\| \| Diana \| RVCv2 \| 0.9403 \| 30.00 \| 0.7010 \| 14.54 \| 3.53 \| 4.15 \| 3.27 \|
	\| \| \| Seed-VC(Ours) \| 0.9428 \| 30.06 \| 0.7299 \| 9.66 \| 3.53 \| 4.11 \| 3.25 \|
	\| \| Ding Zhen \| RVCv2 \| 0.9061 \| 19.53 \| 0.6922 \| 25.99 \| 3.36 \| 4.09 \| 3.08 \|
	\| \| \| Seed-VC(Ours) \| 0.9169 \| 18.15 \| 0.7260 \| 14.13 \| 3.38 \| 3.98 \| 3.07 \|
	\| \| Kobe Bryant \| RVCv2 \| 0.9302 \| 16.37 \| 0.7717 \| 41.04 \| 3.51 \| 4.13 \| 3.25 \|
	\| \| \| Seed-VC(Ours) \| 0.9176 \| 17.93 \| 0.7798 \| 24.23 \| 3.42 \| 4.08 \| 3.17 \|
	\| Soprano (Female) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 27.92 \| ~ \| ~ \| ~ \|
	\| \| Azuma \| RVCv2 \| 0.9742 \| 47.80 \| 0.7104 \| 38.70 \| 3.14 \| 3.85 \| 2.83 \|
	\| \| \| Seed-VC(Ours) \| 0.9521 \| 64.00 \| 0.7177 \| 33.10 \| 3.15 \| 3.86 \| 2.81 \|
	\| \| Diana \| RVCv2 \| 0.9754 \| 46.59 \| 0.7319 \| 32.36 \| 3.14 \| 3.85 \| 2.83 \|
	\| \| \| Seed-VC(Ours) \| 0.9573 \| 59.70 \| 0.7317 \| 30.57 \| 3.11 \| 3.78 \| 2.74 \|
	\| \| Ding Zhen \| RVCv2 \| 0.9543 \| 31.45 \| 0.6792 \| 40.80 \| 3.41 \| 4.08 \| 3.14 \|
	\| \| \| Seed-VC(Ours) \| 0.9486 \| 33.37 \| 0.6979 \| 34.45 \| 3.41 \| 3.97 \| 3.10 \|
	\| \| Kobe Bryant \| RVCv2 \| 0.9691 \| 25.50 \| 0.6276 \| 61.59 \| 3.43 \| 4.04 \| 3.15 \|
	\| \| \| Seed-VC(Ours) \| 0.9496 \| 32.76 \| 0.6683 \| 39.82 \| 3.32 \| 3.98 \| 3.04 \|
	\| Tenor (Male) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 5.94 \| ~ \| ~ \| ~ \|
	\| \| Azuma \| RVCv2 \| 0.9333 \| 42.09 \| 0.7832 \| 16.66 \| 3.46 \| 4.07 \| 3.18 \|
	\| \| \| Seed-VC(Ours) \| 0.9162 \| 48.06 \| 0.7697 \| 8.48 \| 3.38 \| 3.89 \| 3.01 \|
	\| \| Diana \| RVCv2 \| 0.9467 \| 36.65 \| 0.7729 \| 15.28 \| 3.53 \| 4.08 \| 3.24 \|
	\| \| \| Seed-VC(Ours) \| 0.9360 \| 41.49 \| 0.7920 \| 8.55 \| 3.49 \| 3.93 \| 3.13 \|
	\| \| Ding Zhen \| RVCv2 \| 0.9197 \| 22.82 \| 0.7591 \| 12.92 \| 3.40 \| 4.02 \| 3.09 \|
	\| \| \| Seed-VC(Ours) \| 0.9247 \| 22.77 \| 0.7721 \| 13.95 \| 3.45 \| 3.82 \| 3.05 \|
	\| \| Kobe Bryant \| RVCv2 \| 0.9415 \| 19.33 \| 0.7507 \| 30.52 \| 3.48 \| 4.02 \| 3.19 \|
	\| \| \| Seed-VC(Ours) \| 0.9082 \| 24.86 \| 0.7764 \| 13.35 \| 3.39 \| 3.93 \| 3.07 \|
	</details>


	Despite Seed-VC is not trained on the target speakers, and only one random utterance is used as prompt, it still constantly outperforms speaker-specific RVCv2 models
	in terms of speaker similarity (SECS) and intelligibility (CER), which demonstrates the superior voice cloning capability and robustness of Seed-VC.

	However, it is observed that Seed-VC's audio quality (DNSMOS) is slightly lower than RVCv2. We take this drawback seriously and
	will give high priority to improve the audio quality in the future.
	PR or issue is welcomed if you find this comparison unfair or inaccurate.

	Chinese ASR result computed by [SenseVoiceSmall](https://github.com/FunAudioLLM/SenseVoice)
	Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model
	We set +12 semitones pitch shift for male-to-female conversion and -12 semitones for female-to-male converison, otherwise 0 pitch shift

	### Zero-shot voice conversion🎙🔁
	We have performed a series of objective evaluations on our Seed-VC's voice conversion capabilities.
	For ease of reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>

	Source audios can be found under `./examples/libritts-test-clean` <br>
	Reference audios can be found under `./examples/reference` <br>

	We evaluate the conversion results in terms of speaker embedding cosine similarity (SECS), word error rate (WER) and character error rate (CER) and compared
	our results with two strong open sourced baselines, namely [OpenVoice](https://github.com/myshell-ai/OpenVoice) and [CosyVoice](https://github.com/FunAudioLLM/CosyVoice).
	Results in the table below shows that our Seed-VC model significantly outperforms the baseline models in both intelligibility and speaker similarity.<br>

	\| Models\Metrics \| SECS↑ \| WER↓ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|----------------\|------------\|-----------\|----------\|----------\|----------\|----------\|
	\| Ground Truth \| 1.0000 \| 8.02 \| 1.57 \| ~ \| ~ \| ~ \|
	\| OpenVoice \| 0.7547 \| 15.46 \| 4.73 \| 3.56 \| 4.02 \| 3.27 \|
	\| CosyVoice \| 0.8440 \| 18.98 \| 7.29 \| 3.51 \| 4.02 \| 3.21 \|
	\| Seed-VC(Ours) \| 0.8676 \| 11.99 \| 2.92 \| 3.42 \| 3.97 \| 3.11 \|

	We have also compared with non-zero-shot voice conversion models for several speakers (based on model availability):

	\| Characters \| Models\Metrics \| SECS↑ \| WER↓ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|---------------------\|----------------\|------------\|-----------\|----------\|----------\|----------\|----------\|
	\| ~ \| Ground Truth \| 1.0000 \| 6.43 \| 1.00 \| ~ \| ~ \| ~ \|
	\| Tokai Teio \| So-VITS-4.0 \| 0.8637 \| 21.46 \| 9.63 \| 3.06 \| 3.66 \| 2.68 \|
	\| \| Seed-VC(Ours) \| 0.8899 \| 15.32 \| 4.66 \| 3.12 \| 3.71 \| 2.72 \|
	\| Milky Green \| So-VITS-4.0 \| 0.6850 \| 48.43 \| 32.50 \| 3.34 \| 3.51 \| 2.82 \|
	\| \| Seed-VC(Ours) \| 0.8072 \| 7.26 \| 1.32 \| 3.48 \| 4.07 \| 3.20 \|
	\| Matikane Tannhuaser \| So-VITS-4.0 \| 0.8594 \| 16.25 \| 8.64 \| 3.25 \| 3.71 \| 2.84 \|
	\| \| Seed-VC(Ours) \| 0.8768 \| 12.62 \| 5.86 \| 3.18 \| 3.83 \| 2.85 \|

	Results show that, despite not being trained on the target speakers, Seed-VC is able to achieve significantly better results than the non-zero-shot models.
	However, this may vary a lot depending on the SoVITS model quality. PR or Issue is welcomed if you find this comparison unfair or inaccurate.
	(Tokai Teio model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
	(Matikane Tannhuaser model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
	(Milky Green model from [sparanoid/milky-green-sovits-4](https://huggingface.co/spaces/sparanoid/milky-green-sovits-4))

	English ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model
	Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model <br>

	You can reproduce the evaluation by running `eval.py` script.
	```bash
	python eval.py
	--source ./examples/libritts-test-clean
	--target ./examples/reference
	--output ./examples/eval/converted
	--diffusion-steps 25
	--length-adjust 1.0
	--inference-cfg-rate 0.7
	--xvector-extractor "resemblyzer"
	--baseline "" # fill in openvoice or cosyvoice to compute baseline result
	--max-samples 100 # max source utterances to go through
	```
	Before that, make sure you have openvoice and cosyvoice repo correctly installed on `../OpenVoice/` and `../CosyVoice/` if you would like to run baseline evaluation.

	### Zero-shot singing voice conversion🎤🎶

	Additional singing voice conversion evaluation is done on [M4Singer](https://github.com/M4Singer/M4Singer) dataset, with 4 target speakers whose audio data is available [here](https://huggingface.co/datasets/XzJosh/audiodataset).
	Speaker similariy is calculated by averaging the cosine similarities between conversion result and all available samples in respective character dataset.
	For each character, one random utterance is chosen as the prompt for zero-shot inference. For comparison, we trained respective [RVCv2-f0-48k](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) model for each character as baseline.
	100 random utterances for each singer type are used as source audio.

	\| Models\Metrics \| F0CORR↑ \| F0RMSE↓ \| SECS↑ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|----------------\|---------\|---------\|------------\|-----------\|----------\|----------\|----------\|
	\| RVCv2 \| 0.9404 \| 30.43 \| 0.7264 \| 28.46 \| 3.41 \| 4.05 \| 3.12 \|
	\| Seed-VC(Ours) \| 0.9375 \| 33.35 \| 0.7405 \| 19.70 \| 3.39 \| 3.96 \| 3.06 \|

	<details>
	<summary>Click to expand detailed evaluation results</summary>

	\| Source Singer Type \| Characters \| Models\Metrics \| F0CORR↑ \| F0RMSE↓ \| SECS↑ \| CER↓ \| SIG↑ \| BAK↑ \| OVRL↑ \|
	\|--------------------\|--------------------\|----------------\|---------\|---------\|------------\|-----------\|------\|------\|----------\|
	\| Alto (Female) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 8.16 \| ~ \| ~ \| ~ \|
	\| \| Azuma (Female) \| RVCv2 \| 0.9617 \| 33.03 \| 0.7352 \| 24.70 \| 3.36 \| 4.07 \| 3.07 \|
	\| \| \| Seed-VC(Ours) \| 0.9658 \| 31.64 \| 0.7341 \| 15.23 \| 3.37 \| 4.02 \| 3.07 \|
	\| \| Diana (Female) \| RVCv2 \| 0.9626 \| 32.56 \| 0.7212 \| 19.67 \| 3.45 \| 4.08 \| 3.17 \|
	\| \| \| Seed-VC(Ours) \| 0.9648 \| 31.94 \| 0.7457 \| 16.81 \| 3.49 \| 3.99 \| 3.15 \|
	\| \| Ding Zhen (Male) \| RVCv2 \| 0.9013 \| 26.72 \| 0.7221 \| 18.53 \| 3.37 \| 4.03 \| 3.06 \|
	\| \| \| Seed-VC(Ours) \| 0.9356 \| 21.87 \| 0.7513 \| 15.63 \| 3.44 \| 3.94 \| 3.09 \|
	\| \| Kobe Bryant (Male) \| RVCv2 \| 0.9215 \| 23.90 \| 0.7495 \| 37.23 \| 3.49 \| 4.06 \| 3.21 \|
	\| \| \| Seed-VC(Ours) \| 0.9248 \| 23.40 \| 0.7602 \| 26.98 \| 3.43 \| 4.02 \| 3.13 \|
	\| Bass (Male) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 8.62 \| ~ \| ~ \| ~ \|
	\| \| Azuma \| RVCv2 \| 0.9288 \| 32.62 \| 0.7148 \| 24.88 \| 3.45 \| 4.10 \| 3.18 \|
	\| \| \| Seed-VC(Ours) \| 0.9383 \| 31.57 \| 0.6960 \| 10.31 \| 3.45 \| 4.03 \| 3.15 \|
	\| \| Diana \| RVCv2 \| 0.9403 \| 30.00 \| 0.7010 \| 14.54 \| 3.53 \| 4.15 \| 3.27 \|
	\| \| \| Seed-VC(Ours) \| 0.9428 \| 30.06 \| 0.7299 \| 9.66 \| 3.53 \| 4.11 \| 3.25 \|
	\| \| Ding Zhen \| RVCv2 \| 0.9061 \| 19.53 \| 0.6922 \| 25.99 \| 3.36 \| 4.09 \| 3.08 \|
	\| \| \| Seed-VC(Ours) \| 0.9169 \| 18.15 \| 0.7260 \| 14.13 \| 3.38 \| 3.98 \| 3.07 \|
	\| \| Kobe Bryant \| RVCv2 \| 0.9302 \| 16.37 \| 0.7717 \| 41.04 \| 3.51 \| 4.13 \| 3.25 \|
	\| \| \| Seed-VC(Ours) \| 0.9176 \| 17.93 \| 0.7798 \| 24.23 \| 3.42 \| 4.08 \| 3.17 \|
	\| Soprano (Female) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 27.92 \| ~ \| ~ \| ~ \|
	\| \| Azuma \| RVCv2 \| 0.9742 \| 47.80 \| 0.7104 \| 38.70 \| 3.14 \| 3.85 \| 2.83 \|
	\| \| \| Seed-VC(Ours) \| 0.9521 \| 64.00 \| 0.7177 \| 33.10 \| 3.15 \| 3.86 \| 2.81 \|
	\| \| Diana \| RVCv2 \| 0.9754 \| 46.59 \| 0.7319 \| 32.36 \| 3.14 \| 3.85 \| 2.83 \|
	\| \| \| Seed-VC(Ours) \| 0.9573 \| 59.70 \| 0.7317 \| 30.57 \| 3.11 \| 3.78 \| 2.74 \|
	\| \| Ding Zhen \| RVCv2 \| 0.9543 \| 31.45 \| 0.6792 \| 40.80 \| 3.41 \| 4.08 \| 3.14 \|
	\| \| \| Seed-VC(Ours) \| 0.9486 \| 33.37 \| 0.6979 \| 34.45 \| 3.41 \| 3.97 \| 3.10 \|
	\| \| Kobe Bryant \| RVCv2 \| 0.9691 \| 25.50 \| 0.6276 \| 61.59 \| 3.43 \| 4.04 \| 3.15 \|
	\| \| \| Seed-VC(Ours) \| 0.9496 \| 32.76 \| 0.6683 \| 39.82 \| 3.32 \| 3.98 \| 3.04 \|
	\| Tenor (Male) \| ~ \| Ground Truth \| 1.0000 \| 0.00 \| ~ \| 5.94 \| ~ \| ~ \| ~ \|
	\| \| Azuma \| RVCv2 \| 0.9333 \| 42.09 \| 0.7832 \| 16.66 \| 3.46 \| 4.07 \| 3.18 \|
	\| \| \| Seed-VC(Ours) \| 0.9162 \| 48.06 \| 0.7697 \| 8.48 \| 3.38 \| 3.89 \| 3.01 \|
	\| \| Diana \| RVCv2 \| 0.9467 \| 36.65 \| 0.7729 \| 15.28 \| 3.53 \| 4.08 \| 3.24 \|
	\| \| \| Seed-VC(Ours) \| 0.9360 \| 41.49 \| 0.7920 \| 8.55 \| 3.49 \| 3.93 \| 3.13 \|
	\| \| Ding Zhen \| RVCv2 \| 0.9197 \| 22.82 \| 0.7591 \| 12.92 \| 3.40 \| 4.02 \| 3.09 \|
	\| \| \| Seed-VC(Ours) \| 0.9247 \| 22.77 \| 0.7721 \| 13.95 \| 3.45 \| 3.82 \| 3.05 \|
	\| \| Kobe Bryant \| RVCv2 \| 0.9415 \| 19.33 \| 0.7507 \| 30.52 \| 3.48 \| 4.02 \| 3.19 \|
	\| \| \| Seed-VC(Ours) \| 0.9082 \| 24.86 \| 0.7764 \| 13.35 \| 3.39 \| 3.93 \| 3.07 \|
	</details>


	Despite Seed-VC is not trained on the target speakers, and only one random utterance is used as prompt, it still constantly outperforms speaker-specific RVCv2 models
	in terms of speaker similarity (SECS) and intelligibility (CER), which demonstrates the superior voice cloning capability and robustness of Seed-VC.

	However, it is observed that Seed-VC's audio quality (DNSMOS) is slightly lower than RVCv2. We take this drawback seriously and
	will give high priority to improve the audio quality in the future.
	PR or issue is welcomed if you find this comparison unfair or inaccurate.

	Chinese ASR result computed by [SenseVoiceSmall](https://github.com/FunAudioLLM/SenseVoice)
	Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model
	We set +12 semitones pitch shift for male-to-female conversion and -12 semitones for female-to-male converison, otherwise 0 pitch shift