|
# CosyVoice |
|
|
|
## Install |
|
|
|
**Clone and install** |
|
|
|
- Clone the repo |
|
``` sh |
|
git clone https://github.com/modelscope/cosyvoice.git |
|
``` |
|
|
|
- Install Conda: please see https://docs.conda.io/en/latest/miniconda.html |
|
- Create Conda env: |
|
|
|
``` sh |
|
conda create -n cosyvoice python=3.8 |
|
conda activate cosyvoice |
|
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com |
|
``` |
|
|
|
**Model download** |
|
|
|
We strongly recommand that you download our pretrained multi_lingual and mutli_emotion model. |
|
|
|
If you are expert in this field, and you are only interested in training your own CosyVoice model from scratch, you can skip this step. |
|
|
|
``` sh |
|
mkdir -p pretrained_models |
|
git clone https://www.modelscope.cn/CosyVoice/multi_lingual_cosytts.git pretrained_models/multi_lingual_cosytts |
|
git clone https://www.modelscope.cn/CosyVoice/multi_emotion_cosytts.git pretrained_models/multi_emotion_cosytts |
|
``` |
|
|
|
**Basic Usage** |
|
|
|
For zero_shot and sft inference, please use models in `pretrained_models/multi_lingual_cosytts` |
|
|
|
``` |
|
from cosyvoice.cli.cosyvoice import CosyVoice |
|
from cosyvoice.utils.file_utils import load_wav |
|
import torchaudio |
|
|
|
cosyvoice = CosyVoice('pretrained_models/multi_lingual_cosytts') |
|
|
|
# sft usage |
|
print(cosyvoice.list_avaliable_spks()) |
|
output = cosyvoice.inference_sft('hello, my name is Jack. What is your name?', 'aishuo') |
|
torchaudio.save('sft.wav', output['tts_speech'], 22050) |
|
|
|
# zero_shot usage |
|
prompt_speech_22050 = load_wav('1089_134686_000002_000000.wav', 22050) |
|
output = cosyvoice.inference_zero_shot('hello, my name is Jack. What is your name?', 'It would be a gloomy secret night.', prompt_speech_22050) |
|
torchaudio.save('zero_shot.wav', output['tts_speech'], 22050) |
|
``` |
|
|
|
For instruct inference, please use models in `pretrained_models/multi_emotion_cosytts` |
|
|
|
``` |
|
from cosyvoice.cli.cosyvoice import CosyVoice |
|
from cosyvoice.utils.file_utils import load_wav |
|
import torchaudio |
|
|
|
cosyvoice = CosyVoice('pretrained_models/multi_emotion_cosytts') |
|
|
|
# instruct usage |
|
prompt_speech_22050 = load_wav('1089_134686_000002_000000.wav', 22050) |
|
output = cosyvoice.inference_instruct('hello, my name is Jack. What is your name?', 'It would be a gloomy secret night.', prompt_speech_22050, 'A serene woman articulates thoughtfully in a high pitch and slow tempo, exuding a peaceful and joyful aura.') |
|
torchaudio.save('instruct.wav', output['tts_speech'], 22050) |
|
``` |
|
|
|
**Advanced Usage** |
|
|
|
For advanced user, we have provided train and inference scripts in `examples/libritts/cosyvoice/run.sh`. |
|
You can get familiar with CosyVoice following this recipie. |
|
|
|
**Start web demo** |
|
|
|
You can use our web demo page to get familiar with CosyVoice quickly. |
|
We only support zero_shot/sft inference in web demo. |
|
|
|
Please see the demo website for details. |
|
|
|
``` |
|
python3 webui.py --port 50000 --model_dir pretrained_models/multi_lingual_cosytts |
|
``` |
|
|
|
**Build for deployment** |
|
|
|
Optionally, if you want to use grpc for service deployment, |
|
you can run following steps. Otherwise, you can just ignore this step. |
|
|
|
``` sh |
|
cd runtime/python |
|
docker build -t cosyvoice:v1.0 . |
|
# change multi_lingual_cosytts to multi_emotion_cosytts if you want to use instruct inference |
|
docker run -d --runtime=nvidia -v `pwd`/../../pretrained_models/multi_lingual_cosytts:/opt/cosyvoice/cosyvoice/runtime/pretrained_models -p 50000:50000 cosyvoice:v1.0 |
|
python3 client.py --port 50000 --mode <sft|zero_shot|instruct> |
|
``` |