|
--- |
|
license: mit |
|
tags: |
|
- lyodos/classic-vc |
|
- lyodos/mmcxli |
|
- voice |
|
- speaker |
|
- voice conversion |
|
--- |
|
|
|
# ClassicVC |
|
|
|
[ClassicVC](https://github.com/lyodos/classic-vc) is an any-to-any voice conversion model that enables users to design their original speaker styles |
|
by selecting the coordinates from the continuous latent spaces. |
|
The model components are implemented using PyTorch and fully compatible with ONNX. |
|
|
|
[MMCXLI](https://github.com/lyodos/mmcxli) provides the dedicated graphical user interface (GUI) for ClassicVC. |
|
It runs on wxPython and ONNX Runtime. |
|
Users can download the ONNX files and try out speech conversion |
|
without having to install PyTorch or train a model with their own voice data. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** Lyodos (Lyodos the City of the Museum) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [GitHub](https://github.com/lyodos/classic-vc) |
|
|
|
---- |
|
|
|
## Uses |
|
|
|
Based on the MIT License, users can use the model codes and checkpoints for research purpose. |
|
It is provided with no guarantees. |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
[MMCXLI](https://github.com/lyodos/mmcxli) |
|
|
|
### Out-of-Scope Use |
|
|
|
This model was prototyped as a hobbyist's research into any-to-any voice conversion, |
|
and we make no guarantees especially regarding its reliability or real-time operation. |
|
|
|
As for use in situations involving an unspecified number of people, such as web broadcasting, |
|
and mission-critical applications, including medical, transportation, infrastructure, and weapon systems, |
|
we cannot prohibit such use as the developer, since the MIT License is the only stated license, but we do not encourage it. |
|
|
|
[More Information Needed] |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
We used three large-scale speech corpora (LibriSpeech, Samr贸mur Children 21.09, and VoxCeleb 1 and 2) |
|
to make the latent space of speakers that can be embedded using the style encoder of ClassicVC |
|
as inclusive as possible of all natural human voice. |
|
|
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
## How to Get Started with the Model |
|
|
|
The [Notebook 01 of the ClassicVC repository](https://github.com/lyodos/classic-vc) provides the procedure for offline (non real-time) voice conversion. |
|
|
|
[The MMCXLI repository](https://github.com/lyodos/mmcxli) provides GUI, which depends on local Python environment. |
|
|
|
---- |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model checkpoints provided here were trained on the following three datasets. |
|
|
|
1. LibriSpeech ASR corpus |
|
* V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964. |
|
* https://ieeexplore.ieee.org/document/7178964 |
|
* https://openslr.org/12/ |
|
|
|
2. Samr贸mur Children 21.09 |
|
|
|
* Mena, Carlos; et al., 2021, Samromur Children 21.09, CLARIN-IS, http://hdl.handle.net/20.500.12537/185. |
|
* https://repository.clarin.is/repository/xmlui/handle/20.500.12537/185 |
|
* https://openslr.org/117/ |
|
|
|
3. VoxCeleb 1 and 2 |
|
|
|
* A. Nagrani*, J. S. Chung*, A. Zisserman, "VoxCeleb: a large-scale speaker identification dataset", Interspeech 2017 |
|
* J. S. Chung*, A. Nagrani*, A. Zisserman, "VoxCeleb2: Deep Speaker Recognition", Interspeech 2018 |
|
* A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman, "VoxCeleb: Large-scale speaker verification in the wild", Computer Speech and Language, 2019 |
|
* https://huggingface.co/datasets/ProgramComputer/voxceleb/tree/main/vox2 |
|
|
|
|
|
### Training Procedure |
|
|
|
The [Notebook 02 of the ClassicVC repository](https://github.com/lyodos/classic-vc) provides the procedure for data preparation. |
|
|
|
The [Notebook 03 of the ClassicVC repository](https://github.com/lyodos/classic-vc) provides the training code. |