CosyVoice commited on
Commit
475b5db
·
1 Parent(s): 7e0c26f
Files changed (9) hide show
  1. .gitattributes +5 -0
  2. README.md +85 -39
  3. campplus.onnx +3 -0
  4. cosyvoice.yaml +197 -0
  5. flow.pt +3 -0
  6. hift.pt +3 -0
  7. llm.pt +3 -0
  8. speech_tokenizer_v1.onnx +3 -0
  9. spk2info.pt +3 -0
.gitattributes CHANGED
@@ -35,3 +35,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *.gguf* filter=lfs diff=lfs merge=lfs -text
36
  *.ggml filter=lfs diff=lfs merge=lfs -text
37
  *.llamafile* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
35
  *.gguf* filter=lfs diff=lfs merge=lfs -text
36
  *.ggml filter=lfs diff=lfs merge=lfs -text
37
  *.llamafile* filter=lfs diff=lfs merge=lfs -text
38
+ flow.pt filter=lfs diff=lfs merge=lfs -text
39
+ hift.pt filter=lfs diff=lfs merge=lfs -text
40
+ llm.pt filter=lfs diff=lfs merge=lfs -text
41
+ speech_tokenizer_v1.onnx filter=lfs diff=lfs merge=lfs -text
42
+ campplus.onnx filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,51 +1,97 @@
1
- ---
2
- frameworks:
3
- - Pytorch
4
- license: Apache License 2.0
5
- tasks:
6
- - text-to-speech
7
 
8
- #model-type:
9
- ##如 gpt、phi、llama、chatglm、baichuan 等
10
- #- gpt
11
 
12
- #domain:
13
- ##如 nlp、cv、audio、multi-modal
14
- #- nlp
15
 
16
- #language:
17
- ##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
18
- #- cn
 
 
 
 
 
 
 
 
 
 
19
 
20
- #metrics:
21
- ##如 CIDEr、Blue、ROUGE 等
22
- #- CIDEr
23
 
24
- #tags:
25
- ##各种自定义,包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
26
- #- pretrained
27
 
28
- #tools:
29
- ##如 vllm、fastchat、llamacpp、AdaSeq 等
30
- #- vllm
31
- ---
32
- ### 当前模型的贡献者未提供更加详细的模型介绍。模型文件和权重,可浏览“模型文件”页面获取。
33
- #### 您可以通过如下git clone命令,或者ModelScope SDK来下载模型
34
 
35
- SDK下载
36
- ```bash
37
- #安装ModelScope
38
- pip install modelscope
39
  ```
40
- ```python
41
- #SDK模型下载
42
- from modelscope import snapshot_download
43
- model_dir = snapshot_download('speech_tts/CosyVoice-300M-SFT')
 
44
  ```
45
- Git下载
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ```
47
- #Git模型下载
48
- git clone https://www.modelscope.cn/speech_tts/CosyVoice-300M-SFT.git
 
49
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- <p style="color: lightgrey;">如果您是本模型的贡献者,我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>,及时完善模型卡片内容。</p>
 
 
 
 
 
 
 
1
+ # CosyVoice
 
 
 
 
 
2
 
3
+ ## Install
 
 
4
 
5
+ **Clone and install**
 
 
6
 
7
+ - Clone the repo
8
+ ``` sh
9
+ git clone https://github.com/modelscope/cosyvoice.git
10
+ ```
11
+
12
+ - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
13
+ - Create Conda env:
14
+
15
+ ``` sh
16
+ conda create -n cosyvoice python=3.8
17
+ conda activate cosyvoice
18
+ pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
19
+ ```
20
 
21
+ **Model download**
 
 
22
 
23
+ We strongly recommand that you download our pretrained multi_lingual and mutli_emotion model.
 
 
24
 
25
+ If you are expert in this field, and you are only interested in training your own CosyVoice model from scratch, you can skip this step.
 
 
 
 
 
26
 
27
+ ``` sh
28
+ mkdir -p pretrained_models
29
+ git clone https://www.modelscope.cn/CosyVoice/multi_lingual_cosytts.git pretrained_models/multi_lingual_cosytts
30
+ git clone https://www.modelscope.cn/CosyVoice/multi_emotion_cosytts.git pretrained_models/multi_emotion_cosytts
31
  ```
32
+
33
+ **Basic Usage**
34
+
35
+ For zero_shot and sft inference, please use models in `pretrained_models/multi_lingual_cosytts`
36
+
37
  ```
38
+ from cosyvoice.cli.cosyvoice import CosyVoice
39
+ from cosyvoice.utils.file_utils import load_wav
40
+ import torchaudio
41
+
42
+ cosyvoice = CosyVoice('pretrained_models/multi_lingual_cosytts')
43
+
44
+ # sft usage
45
+ print(cosyvoice.list_avaliable_spks())
46
+ output = cosyvoice.inference_sft('hello, my name is Jack. What is your name?', 'aishuo')
47
+ torchaudio.save('sft.wav', output['tts_speech'], 22050)
48
+
49
+ # zero_shot usage
50
+ prompt_speech_22050 = load_wav('1089_134686_000002_000000.wav', 22050)
51
+ output = cosyvoice.inference_zero_shot('hello, my name is Jack. What is your name?', 'It would be a gloomy secret night.', prompt_speech_22050)
52
+ torchaudio.save('zero_shot.wav', output['tts_speech'], 22050)
53
  ```
54
+
55
+ For instruct inference, please use models in `pretrained_models/multi_emotion_cosytts`
56
+
57
  ```
58
+ from cosyvoice.cli.cosyvoice import CosyVoice
59
+ from cosyvoice.utils.file_utils import load_wav
60
+ import torchaudio
61
+
62
+ cosyvoice = CosyVoice('pretrained_models/multi_emotion_cosytts')
63
+
64
+ # instruct usage
65
+ prompt_speech_22050 = load_wav('1089_134686_000002_000000.wav', 22050)
66
+ output = cosyvoice.inference_instruct('hello, my name is Jack. What is your name?', 'It would be a gloomy secret night.', prompt_speech_22050, 'A serene woman articulates thoughtfully in a high pitch and slow tempo, exuding a peaceful and joyful aura.')
67
+ torchaudio.save('instruct.wav', output['tts_speech'], 22050)
68
+ ```
69
+
70
+ **Advanced Usage**
71
+
72
+ For advanced user, we have provided train and inference scripts in `examples/libritts/cosyvoice/run.sh`.
73
+ You can get familiar with CosyVoice following this recipie.
74
+
75
+ **Start web demo**
76
+
77
+ You can use our web demo page to get familiar with CosyVoice quickly.
78
+ We only support zero_shot/sft inference in web demo.
79
+
80
+ Please see the demo website for details.
81
+
82
+ ```
83
+ python3 webui.py --port 50000 --model_dir pretrained_models/multi_lingual_cosytts
84
+ ```
85
+
86
+ **Build for deployment**
87
+
88
+ Optionally, if you want to use grpc for service deployment,
89
+ you can run following steps. Otherwise, you can just ignore this step.
90
 
91
+ ``` sh
92
+ cd runtime/python
93
+ docker build -t cosyvoice:v1.0 .
94
+ # change multi_lingual_cosytts to multi_emotion_cosytts if you want to use instruct inference
95
+ docker run -d --runtime=nvidia -v `pwd`/../../pretrained_models/multi_lingual_cosytts:/opt/cosyvoice/cosyvoice/runtime/pretrained_models -p 50000:50000 cosyvoice:v1.0
96
+ python3 client.py --port 50000 --mode <sft|zero_shot|instruct>
97
+ ```
campplus.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6ac6a63997761ae2997373e2ee1c47040854b4b759ea41ec48e4e42df0f4d73
3
+ size 28303423
cosyvoice.yaml ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # set random seed, so that you may reproduce your result.
2
+ __set_seed1: !apply:random.seed [1986]
3
+ __set_seed2: !apply:numpy.random.seed [1986]
4
+ __set_seed3: !apply:torch.manual_seed [1986]
5
+ __set_seed4: !apply:torch.cuda.manual_seed_all [1986]
6
+
7
+ # fixed params
8
+ sample_rate: 22050
9
+ text_encoder_input_size: 512
10
+ llm_input_size: 1024
11
+ llm_output_size: 1024
12
+ spk_embed_dim: 192
13
+
14
+ # model params
15
+ # for all class/function included in this repo, we use !<name> or !<new> for intialization, so that user may find all corresponding class/function according to one single yaml.
16
+ # for system/third_party class/function, we do not require this.
17
+ llm: !new:cosyvoice.llm.llm.TransformerLM
18
+ text_encoder_input_size: !ref <text_encoder_input_size>
19
+ llm_input_size: !ref <llm_input_size>
20
+ llm_output_size: !ref <llm_output_size>
21
+ text_token_size: 51866
22
+ speech_token_size: 4096
23
+ length_normalized_loss: True
24
+ lsm_weight: 0
25
+ spk_embed_dim: !ref <spk_embed_dim>
26
+ text_encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder
27
+ input_size: !ref <text_encoder_input_size>
28
+ output_size: 1024
29
+ attention_heads: 16
30
+ linear_units: 4096
31
+ num_blocks: 6
32
+ dropout_rate: 0.1
33
+ positional_dropout_rate: 0.1
34
+ attention_dropout_rate: 0
35
+ normalize_before: True
36
+ input_layer: 'linear'
37
+ pos_enc_layer_type: 'rel_pos_espnet'
38
+ selfattention_layer_type: 'rel_selfattn'
39
+ use_cnn_module: False
40
+ macaron_style: False
41
+ use_dynamic_chunk: False
42
+ use_dynamic_left_chunk: False
43
+ static_chunk_size: 1
44
+ llm: !new:cosyvoice.transformer.encoder.TransformerEncoder
45
+ input_size: !ref <llm_input_size>
46
+ output_size: !ref <llm_output_size>
47
+ attention_heads: 16
48
+ linear_units: 4096
49
+ num_blocks: 14
50
+ dropout_rate: 0.1
51
+ positional_dropout_rate: 0.1
52
+ attention_dropout_rate: 0
53
+ input_layer: 'linear_legacy'
54
+ pos_enc_layer_type: 'rel_pos_espnet'
55
+ selfattention_layer_type: 'rel_selfattn'
56
+ static_chunk_size: 1
57
+
58
+ flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
59
+ input_size: 512
60
+ output_size: 80
61
+ spk_embed_dim: !ref <spk_embed_dim>
62
+ output_type: 'mel'
63
+ vocab_size: 4096
64
+ input_frame_rate: 50
65
+ only_mask_loss: True
66
+ encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder
67
+ output_size: 512
68
+ attention_heads: 8
69
+ linear_units: 2048
70
+ num_blocks: 6
71
+ dropout_rate: 0.1
72
+ positional_dropout_rate: 0.1
73
+ attention_dropout_rate: 0.1
74
+ normalize_before: True
75
+ input_layer: 'linear'
76
+ pos_enc_layer_type: 'rel_pos_espnet'
77
+ selfattention_layer_type: 'rel_selfattn'
78
+ input_size: 512
79
+ use_cnn_module: False
80
+ macaron_style: False
81
+ length_regulator: !new:cosyvoice.flow.length_regulator.InterpolateRegulator
82
+ channels: 80
83
+ sampling_ratios: [1, 1, 1, 1]
84
+ decoder: !new:cosyvoice.flow.flow_matching.ConditionalCFM
85
+ in_channels: 240
86
+ n_spks: 1
87
+ spk_emb_dim: 80
88
+ cfm_params: !new:omegaconf.DictConfig
89
+ content:
90
+ sigma_min: 1e-06
91
+ solver: 'euler'
92
+ t_scheduler: 'cosine'
93
+ training_cfg_rate: 0.2
94
+ inference_cfg_rate: 0.7
95
+ reg_loss_type: 'l1'
96
+ estimator: !new:cosyvoice.flow.decoder.ConditionalDecoder
97
+ in_channels: 320
98
+ out_channels: 80
99
+ channels: [256, 256]
100
+ dropout: 0
101
+ attention_head_dim: 64
102
+ n_blocks: 4
103
+ num_mid_blocks: 12
104
+ num_heads: 8
105
+ act_fn: 'gelu'
106
+
107
+ hift: !new:cosyvoice.hifigan.generator.HiFTGenerator
108
+ in_channels: 80
109
+ base_channels: 512
110
+ nb_harmonics: 8
111
+ sampling_rate: !ref <sample_rate>
112
+ nsf_alpha: 0.1
113
+ nsf_sigma: 0.003
114
+ nsf_voiced_threshold: 10
115
+ upsample_rates: [8, 8]
116
+ upsample_kernel_sizes: [16, 16]
117
+ istft_params:
118
+ n_fft: 16
119
+ hop_len: 4
120
+ resblock_kernel_sizes: [3, 7, 11]
121
+ resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
122
+ source_resblock_kernel_sizes: [7, 11]
123
+ source_resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5]]
124
+ lrelu_slope: 0.1
125
+ audio_limit: 0.99
126
+ f0_predictor: !new:cosyvoice.hifigan.f0_predictor.ConvRNNF0Predictor
127
+ num_class: 1
128
+ in_channels: 80
129
+ cond_channels: 512
130
+
131
+ # processor functions
132
+ parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
133
+ get_tokenizer: !name:whisper.tokenizer.get_tokenizer
134
+ multilingual: True
135
+ num_languages: 100
136
+ language: 'en'
137
+ task: 'transcribe'
138
+ allowed_special: 'all'
139
+ tokenize: !name:cosyvoice.dataset.processor.tokenize
140
+ get_tokenizer: !ref <get_tokenizer>
141
+ allowed_special: !ref <allowed_special>
142
+ filter: !name:cosyvoice.dataset.processor.filter
143
+ max_length: 40960
144
+ min_length: 0
145
+ token_max_length: 200
146
+ token_min_length: 1
147
+ resample: !name:cosyvoice.dataset.processor.resample
148
+ resample_rate: !ref <sample_rate>
149
+ feat_extractor: !name:matcha.utils.audio.mel_spectrogram
150
+ n_fft: 1024
151
+ num_mels: 80
152
+ sampling_rate: !ref <sample_rate>
153
+ hop_size: 256
154
+ win_size: 1024
155
+ fmin: 0
156
+ fmax: 8000
157
+ center: False
158
+ compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
159
+ feat_extractor: !ref <feat_extractor>
160
+ parse_embedding: !name:cosyvoice.dataset.processor.parse_embedding
161
+ normalize: True
162
+ shuffle: !name:cosyvoice.dataset.processor.shuffle
163
+ shuffle_size: 1000
164
+ sort: !name:cosyvoice.dataset.processor.sort
165
+ sort_size: 500 # sort_size should be less than shuffle_size
166
+ batch: !name:cosyvoice.dataset.processor.batch
167
+ batch_type: 'dynamic'
168
+ max_frames_in_batch: 2000
169
+ padding: !name:cosyvoice.dataset.processor.padding
170
+
171
+ # dataset processor pipeline
172
+ data_pipeline: [
173
+ !ref <parquet_opener>,
174
+ !ref <tokenize>,
175
+ !ref <filter>,
176
+ !ref <resample>,
177
+ !ref <compute_fbank>,
178
+ !ref <parse_embedding>,
179
+ !ref <shuffle>,
180
+ !ref <sort>,
181
+ !ref <batch>,
182
+ !ref <padding>,
183
+ ]
184
+
185
+ # train conf
186
+ train_conf:
187
+ optim: adam
188
+ optim_conf:
189
+ lr: 0.001
190
+ scheduler: warmuplr
191
+ scheduler_conf:
192
+ warmup_steps: 2500
193
+ max_epoch: 200
194
+ grad_clip: 5
195
+ accum_grad: 2
196
+ log_interval: 100
197
+ save_per_step: -1
flow.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21eae78c105b5e1c6c337b04f667843377651b4bcfb2d43247ed3ad7fd0a3470
3
+ size 419900943
hift.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91e679b6ca1eff71187ffb4f3ab0444935594cdcc20a9bd12afad111ef8d6012
3
+ size 81896716
llm.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15e4bc254f13f06502f0e93fa253340638538cf75827fd8b690f8b37b4e0d5c6
3
+ size 1242994835
speech_tokenizer_v1.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23b5a723ed9143aebfd9ffda14ac4c21231f31c35ef837b6a13bb9e5488abb1e
3
+ size 522624269
spk2info.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb491609da25a3514b76d76d55dbf26686394c378d596128fb3e15f2b74cdf44
3
+ size 1317821