gongting commited on
Commit
d5431db
·
verified ·
1 Parent(s): 9baaa01

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -434
README.md CHANGED
@@ -1,489 +1,227 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- pipeline_tag: image-text-to-text
6
- tags:
7
- - multimodal
8
- base_model: qwen/Qwen2-VL-2B-Instruct
9
- studios:
10
- - qwen/Qwen2-VL-2B-Instruct-demo
11
- ---
12
 
13
- # Qwen2-VL-2B-Instruct
 
14
 
15
- ## Introduction
16
 
17
- We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.
18
 
19
- ### What’s New in Qwen2-VL?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- #### Key Enhancements:
22
 
23
- * **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
24
 
25
- * **Understanding videos of 20min+**: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
26
 
27
- * **Agent that can operate your mobiles, robots, etc.**: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
28
 
29
- * **Multilingual Support**: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
 
30
 
31
- #### Model Architecture Updates:
32
 
33
- * **Naive Dynamic Resolution**: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
 
34
 
35
- * **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
36
 
37
- ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/qwen2_vl.jpg)
 
 
38
 
39
- We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 2B Qwen2-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub](https://github.com/QwenLM/Qwen2-VL).
40
 
 
41
 
 
42
 
43
- ## Evaluation
44
-
45
- ### Image Benchmarks
46
-
47
- | Benchmark | InternVL2-2B | MiniCPM-V 2.0 | **Qwen2-VL-2B** |
48
- | :--- | :---: | :---: | :---: |
49
- | DocVQA<sub>test</sub> | 86.9 | - | **90.1** |
50
- | InfoVQA<sub>test</sub> | 58.9 | - | **65.5** |
51
- | ChartQA<sub>test</sub> | **76.2** | - | 73.5 |
52
- | TextVQA<sub>val</sub> | 73.4 | - | **79.7** |
53
- | OCRBench | 781 | 605 | **794** |
54
- | MTVQA | - | - | **20.0** |
55
- | MMMU<sub>val</sub> | 36.3 | 38.2 | **41.1** |
56
- | RealWorldQA | 57.3 | 55.8 | **62.9** |
57
- | MME<sub>sum</sub> | **1876.8** | 1808.6 | 1872.0 |
58
- | MMBench-EN<sub>test</sub> | 73.2 | 69.1 | **74.9** |
59
- | MMBench-CN<sub>test</sub> | 70.9 | 66.5 | **73.5** |
60
- | MMBench-V1.1<sub>test</sub> | 69.6 | 65.8 | **72.2** |
61
- | MMT-Bench<sub>test</sub> | - | - | **54.5** |
62
- | MMStar | **49.8** | 39.1 | 48.0 |
63
- | MMVet<sub>GPT-4-Turbo</sub> | 39.7 | 41.0 | **49.5** |
64
- | HallBench<sub>avg</sub> | 38.0 | 36.1 | **41.7** |
65
- | MathVista<sub>testmini</sub> | **46.0** | 39.8 | 43.0 |
66
- | MathVision | - | - | **12.4** |
67
 
68
- ### Video Benchmarks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- | Benchmark | **Qwen2-VL-2B** |
71
- | :--- | :---: |
72
- | MVBench | **63.2** |
73
- | PerceptionTest<sub>test</sub> | **53.9** |
74
- | EgoSchema<sub>test</sub> | **54.9** |
75
- | Video-MME<sub>wo/w subs</sub> | **55.6**/**60.4** |
76
 
 
77
 
78
- ## Requirements
79
- The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
 
 
 
 
 
 
 
80
  ```
81
- KeyError: 'qwen2_vl'
 
 
 
 
 
 
 
 
 
82
  ```
83
 
84
- ## Quickstart
85
- We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
86
 
87
- ```bash
88
- pip install qwen-vl-utils
 
 
 
 
 
89
  ```
90
 
91
- Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
92
 
93
- ```python
94
- from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
95
- from qwen_vl_utils import process_vision_info
96
- from modelscope import snapshot_download
97
- model_dir = snapshot_download("qwen/Qwen2-VL-2B-Instruct")
98
-
99
- # default: Load the model on the available device(s)
100
- model = Qwen2VLForConditionalGeneration.from_pretrained(
101
- model_dir, torch_dtype="auto", device_map="auto"
102
- )
103
-
104
- # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
105
- # model = Qwen2VLForConditionalGeneration.from_pretrained(
106
- # model_dir,
107
- # torch_dtype=torch.bfloat16,
108
- # attn_implementation="flash_attention_2",
109
- # device_map="auto",
110
- # )
111
-
112
- # default processer
113
- processor = AutoProcessor.from_pretrained(model_dir)
114
-
115
- # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
116
- # min_pixels = 256*28*28
117
- # max_pixels = 1280*28*28
118
- # processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
119
-
120
- messages = [
121
- {
122
- "role": "user",
123
- "content": [
124
- {
125
- "type": "image",
126
- "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
127
- },
128
- {"type": "text", "text": "Describe this image."},
129
- ],
130
- }
131
- ]
132
-
133
- # Preparation for inference
134
- text = processor.apply_chat_template(
135
- messages, tokenize=False, add_generation_prompt=True
136
- )
137
- image_inputs, video_inputs = process_vision_info(messages)
138
- inputs = processor(
139
- text=[text],
140
- images=image_inputs,
141
- videos=video_inputs,
142
- padding=True,
143
- return_tensors="pt",
144
- )
145
- inputs = inputs.to("cuda")
146
-
147
- # Inference: Generation of the output
148
- generated_ids = model.generate(**inputs, max_new_tokens=128)
149
- generated_ids_trimmed = [
150
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
151
- ]
152
- output_text = processor.batch_decode(
153
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
154
- )
155
- print(output_text)
156
- ```
157
- <details>
158
- <summary>Without qwen_vl_utils</summary>
159
 
160
- ```python
161
- from PIL import Image
162
- import requests
163
- import torch
164
- from torchvision import io
165
- from typing import Dict
166
- from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
167
- from modelscope import snapshot_download
168
- model_dir = snapshot_download("qwen/Qwen2-VL-2B-Instruct")
169
- # Load the model in half-precision on the available device(s)
170
- model = Qwen2VLForConditionalGeneration.from_pretrained(
171
- model_dir, torch_dtype="auto", device_map="auto"
172
- )
173
- processor = AutoProcessor.from_pretrained(model_dir)
174
-
175
- # Image
176
- url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
177
- image = Image.open(requests.get(url, stream=True).raw)
178
-
179
- conversation = [
180
- {
181
- "role": "user",
182
- "content": [
183
- {
184
- "type": "image",
185
- },
186
- {"type": "text", "text": "Describe this image."},
187
- ],
188
- }
189
- ]
190
-
191
-
192
- # Preprocess the inputs
193
- text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
194
- # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
195
-
196
- inputs = processor(
197
- text=[text_prompt], images=[image], padding=True, return_tensors="pt"
198
- )
199
- inputs = inputs.to("cuda")
200
-
201
- # Inference: Generation of the output
202
- output_ids = model.generate(**inputs, max_new_tokens=128)
203
- generated_ids = [
204
- output_ids[len(input_ids) :]
205
- for input_ids, output_ids in zip(inputs.input_ids, output_ids)
206
- ]
207
- output_text = processor.batch_decode(
208
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
209
- )
210
- print(output_text)
211
  ```
212
- </details>
213
-
214
- <details>
215
- <summary>Multi image inference</summary>
216
 
 
217
  ```python
218
- # Messages containing multiple images and a text query
219
- messages = [
220
- {
221
- "role": "user",
222
- "content": [
223
- {"type": "image", "image": "file:///path/to/image1.jpg"},
224
- {"type": "image", "image": "file:///path/to/image2.jpg"},
225
- {"type": "text", "text": "Identify the similarities between these images."},
226
- ],
227
- }
228
- ]
229
-
230
- # Preparation for inference
231
- text = processor.apply_chat_template(
232
- messages, tokenize=False, add_generation_prompt=True
233
- )
234
- image_inputs, video_inputs = process_vision_info(messages)
235
- inputs = processor(
236
- text=[text],
237
- images=image_inputs,
238
- videos=video_inputs,
239
- padding=True,
240
- return_tensors="pt",
241
- )
242
- inputs = inputs.to("cuda")
243
-
244
- # Inference
245
- generated_ids = model.generate(**inputs, max_new_tokens=128)
246
- generated_ids_trimmed = [
247
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
248
- ]
249
- output_text = processor.batch_decode(
250
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
251
- )
252
- print(output_text)
253
- ```
254
- </details>
255
 
256
- <details>
257
- <summary>Video inference</summary>
 
 
 
258
 
259
- ```python
260
- # Messages containing a images list as a video and a text query
261
- messages = [
262
- {
263
- "role": "user",
264
- "content": [
265
- {
266
- "type": "video",
267
- "video": [
268
- "file:///path/to/frame1.jpg",
269
- "file:///path/to/frame2.jpg",
270
- "file:///path/to/frame3.jpg",
271
- "file:///path/to/frame4.jpg",
272
- ],
273
- "fps": 1.0,
274
- },
275
- {"type": "text", "text": "Describe this video."},
276
- ],
277
- }
278
- ]
279
- # Messages containing a video and a text query
280
- messages = [
281
- {
282
- "role": "user",
283
- "content": [
284
- {
285
- "type": "video",
286
- "video": "file:///path/to/video1.mp4",
287
- "max_pixels": 360 * 420,
288
- "fps": 1.0,
289
- },
290
- {"type": "text", "text": "Describe this video."},
291
- ],
292
- }
293
- ]
294
-
295
- # Preparation for inference
296
- text = processor.apply_chat_template(
297
- messages, tokenize=False, add_generation_prompt=True
298
- )
299
- image_inputs, video_inputs = process_vision_info(messages)
300
- inputs = processor(
301
- text=[text],
302
- images=image_inputs,
303
- videos=video_inputs,
304
- padding=True,
305
- return_tensors="pt",
306
- )
307
- inputs = inputs.to("cuda")
308
-
309
- # Inference
310
- generated_ids = model.generate(**inputs, max_new_tokens=128)
311
- generated_ids_trimmed = [
312
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
313
- ]
314
- output_text = processor.batch_decode(
315
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
316
- )
317
- print(output_text)
318
- ```
319
- </details>
320
 
321
- <details>
322
- <summary>Batch inference</summary>
 
 
323
 
 
324
  ```python
325
- # Sample messages for batch inference
326
- messages1 = [
327
- {
328
- "role": "user",
329
- "content": [
330
- {"type": "image", "image": "file:///path/to/image1.jpg"},
331
- {"type": "image", "image": "file:///path/to/image2.jpg"},
332
- {"type": "text", "text": "What are the common elements in these pictures?"},
333
- ],
334
- }
335
- ]
336
- messages2 = [
337
- {"role": "system", "content": "You are a helpful assistant."},
338
- {"role": "user", "content": "Who are you?"},
339
- ]
340
- # Combine messages for batch processing
341
- messages = [messages1, messages1]
342
-
343
- # Preparation for batch inference
344
- texts = [
345
- processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
346
- for msg in messages
347
- ]
348
- image_inputs, video_inputs = process_vision_info(messages)
349
- inputs = processor(
350
- text=texts,
351
- images=image_inputs,
352
- videos=video_inputs,
353
- padding=True,
354
- return_tensors="pt",
355
- )
356
- inputs = inputs.to("cuda")
357
-
358
- # Batch Inference
359
- generated_ids = model.generate(**inputs, max_new_tokens=128)
360
- generated_ids_trimmed = [
361
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
362
- ]
363
- output_texts = processor.batch_decode(
364
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
365
- )
366
- print(output_texts)
367
  ```
368
- </details>
369
 
370
- ### More Usage Tips
371
 
372
- For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
373
 
374
- ```python
375
- # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
376
- ## Local file path
377
- messages = [
378
- {
379
- "role": "user",
380
- "content": [
381
- {"type": "image", "image": "file:///path/to/your/image.jpg"},
382
- {"type": "text", "text": "Describe this image."},
383
- ],
384
- }
385
- ]
386
- ## Image URL
387
- messages = [
388
- {
389
- "role": "user",
390
- "content": [
391
- {"type": "image", "image": "http://path/to/your/image.jpg"},
392
- {"type": "text", "text": "Describe this image."},
393
- ],
394
- }
395
- ]
396
- ## Base64 encoded image
397
- messages = [
398
- {
399
- "role": "user",
400
- "content": [
401
- {"type": "image", "image": "data:image;base64,/9j/..."},
402
- {"type": "text", "text": "Describe this image."},
403
- ],
404
- }
405
- ]
406
  ```
407
- #### Image Resolution for performance boost
408
 
409
- The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
410
 
411
- ```python
412
- min_pixels = 256 * 28 * 28
413
- max_pixels = 1280 * 28 * 28
414
- processor = AutoProcessor.from_pretrained(
415
- model_dir, min_pixels=min_pixels, max_pixels=max_pixels
416
- )
417
- ```
418
 
419
- Besides, We provide two methods for fine-grained control over the image size input to the model:
420
 
421
- 1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
422
-
423
- 2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
424
 
425
- ```python
426
- # min_pixels and max_pixels
427
- messages = [
428
- {
429
- "role": "user",
430
- "content": [
431
- {
432
- "type": "image",
433
- "image": "file:///path/to/your/image.jpg",
434
- "resized_height": 280,
435
- "resized_width": 420,
436
- },
437
- {"type": "text", "text": "Describe this image."},
438
- ],
439
- }
440
- ]
441
- # resized_height and resized_width
442
- messages = [
443
- {
444
- "role": "user",
445
- "content": [
446
- {
447
- "type": "image",
448
- "image": "file:///path/to/your/image.jpg",
449
- "min_pixels": 50176,
450
- "max_pixels": 50176,
451
- },
452
- {"type": "text", "text": "Describe this image."},
453
- ],
454
- }
455
- ]
456
  ```
457
 
458
- ## Limitations
459
-
460
- While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:
461
 
462
- 1. Lack of Audio Support: The current model does **not comprehend audio information** within videos.
463
- 2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered.
464
- 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
465
- 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
466
- 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
467
- 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.
468
 
469
- These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
470
 
 
471
 
472
- ## Citation
473
 
474
- If you find our work helpful, feel free to give us a cite.
 
 
 
 
475
 
476
- ```
477
- @article{Qwen2-VL,
478
- title={Qwen2-VL},
479
- author={Qwen team},
480
- year={2024}
481
- }
482
-
483
- @article{Qwen-VL,
484
- title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
485
- author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
486
- journal={arXiv preprint arXiv:2308.12966},
487
- year={2023}
488
- }
489
- ```
 
1
+ [![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)](https://github.com/Akshay090/svg-banners)
 
 
 
 
 
 
 
 
 
 
2
 
3
+ ## 👉🏻 CosyVoice 👈🏻
4
+ **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/spaces/FunAudioLLM/CosyVoice2-0.5B)
5
 
6
+ **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice-300M)
7
 
8
+ ## Highlight🔥
9
 
10
+ **CosyVoice 2.0** has been released! Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.
11
+ ### Multilingual
12
+ - **Supported Language**: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
13
+ - **Crosslingual & Mixlingual**:Support zero-shot voice cloning for cross-lingual and code-switching scenarios.
14
+ ### Ultra-Low Latency
15
+ - **Bidirectional Streaming Support**: CosyVoice 2.0 integrates offline and streaming modeling technologies.
16
+ - **Rapid First Packet Synthesis**: Achieves latency as low as 150ms while maintaining high-quality audio output.
17
+ ### High Accuracy
18
+ - **Improved Pronunciation**: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
19
+ - **Benchmark Achievements**: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.
20
+ ### Strong Stability
21
+ - **Consistency in Timbre**: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
22
+ - **Cross-language Synthesis**: Marked improvements compared to version 1.0.
23
+ ### Natural Experience
24
+ - **Enhanced Prosody and Sound Quality**: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
25
+ - **Emotional and Dialectal Flexibility**: Now supports more granular emotional controls and accent adjustments.
26
 
27
+ ## Roadmap
28
 
29
+ - [x] 2024/12
30
 
31
+ - [x] 25hz cosyvoice 2.0 released
32
 
33
+ - [x] 2024/09
34
 
35
+ - [x] 25hz cosyvoice base model
36
+ - [x] 25hz cosyvoice voice conversion model
37
 
38
+ - [x] 2024/08
39
 
40
+ - [x] Repetition Aware Sampling(RAS) inference for llm stability
41
+ - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
42
 
43
+ - [x] 2024/07
44
 
45
+ - [x] Flow matching training support
46
+ - [x] WeTextProcessing support when ttsfrd is not available
47
+ - [x] Fastapi server and client
48
 
 
49
 
50
+ ## Install
51
 
52
+ **Clone and install**
53
 
54
+ - Clone the repo
55
+ ``` sh
56
+ git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
57
+ # If you failed to clone submodule due to network failures, please run following command until success
58
+ cd CosyVoice
59
+ git submodule update --init --recursive
60
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
+ - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
63
+ - Create Conda env:
64
+
65
+ ``` sh
66
+ conda create -n cosyvoice python=3.10
67
+ conda activate cosyvoice
68
+ # pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platform.
69
+ conda install -y -c conda-forge pynini==2.1.5
70
+ pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
71
+
72
+ # If you encounter sox compatibility issues
73
+ # ubuntu
74
+ sudo apt-get install sox libsox-dev
75
+ # centos
76
+ sudo yum install sox sox-devel
77
+ ```
78
 
79
+ **Model download**
 
 
 
 
 
80
 
81
+ We strongly recommend that you download our pretrained `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource.
82
 
83
+ ``` python
84
+ # SDK模型下载
85
+ from modelscope import snapshot_download
86
+ snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
87
+ snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
88
+ snapshot_download('iic/CosyVoice-300M-25Hz', local_dir='pretrained_models/CosyVoice-300M-25Hz')
89
+ snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
90
+ snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
91
+ snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
92
  ```
93
+
94
+ ``` sh
95
+ # git模型下载,请确保已安装git lfs
96
+ mkdir -p pretrained_models
97
+ git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
98
+ git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
99
+ git clone https://www.modelscope.cn/iic/CosyVoice-300M-25Hz.git pretrained_models/CosyVoice-300M-25Hz
100
+ git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
101
+ git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
102
+ git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd
103
  ```
104
 
105
+ Optionally, you can unzip `ttsfrd` resouce and install `ttsfrd` package for better text normalization performance.
 
106
 
107
+ Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use WeTextProcessing by default.
108
+
109
+ ``` sh
110
+ cd pretrained_models/CosyVoice-ttsfrd/
111
+ unzip resource.zip -d .
112
+ pip install ttsfrd_dependency-0.1-py3-none-any.whl
113
+ pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
114
  ```
115
 
116
+ **Basic Usage**
117
 
118
+ We strongly recommend using `CosyVoice2-0.5B` for better performance.
119
+ Follow code below for detailed usage of each model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
+ ``` python
122
+ import sys
123
+ sys.path.append('third_party/Matcha-TTS')
124
+ from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
125
+ from cosyvoice.utils.file_utils import load_wav
126
+ import torchaudio
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ```
 
 
 
 
128
 
129
+ **CosyVoice2 Usage**
130
  ```python
131
+ cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
+ # NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
134
+ # zero_shot usage
135
+ prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
136
+ for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
137
+ torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
138
 
139
+ # fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
140
+ for i, j in enumerate(cosyvoice.inference_cross_lingual('在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。', prompt_speech_16k, stream=False)):
141
+ torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
+ # instruct usage
144
+ for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
145
+ torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
146
+ ```
147
 
148
+ **CosyVoice Usage**
149
  ```python
150
+ cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT', load_jit=False, load_trt=False, fp16=False)
151
+ # sft usage
152
+ print(cosyvoice.list_available_spks())
153
+ # change stream=True for chunk stream inference
154
+ for i, j in enumerate(cosyvoice.inference_sft('你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?', '中文女', stream=False)):
155
+ torchaudio.save('sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
156
+
157
+ cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M') # or change to pretrained_models/CosyVoice-300M-25Hz for 25Hz inference
158
+ # zero_shot usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean
159
+ prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
160
+ for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
161
+ torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
162
+ # cross_lingual usage
163
+ prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
164
+ for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
165
+ torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
166
+ # vc usage
167
+ prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
168
+ source_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
169
+ for i, j in enumerate(cosyvoice.inference_vc(source_speech_16k, prompt_speech_16k, stream=False)):
170
+ torchaudio.save('vc_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
171
+
172
+ cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-Instruct')
173
+ # instruct usage, support <laughter></laughter><strong></strong>[laughter][breath]
174
+ for i, j in enumerate(cosyvoice.inference_instruct('在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.', stream=False)):
175
+ torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  ```
 
177
 
178
+ **Start web demo**
179
 
180
+ You can use our web demo page to get familiar with CosyVoice quickly.
181
 
182
+ Please see the demo website for details.
183
+
184
+ ``` python
185
+ # change iic/CosyVoice-300M-SFT for sft inference, or iic/CosyVoice-300M-Instruct for instruct inference
186
+ python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  ```
 
188
 
189
+ **Advanced Usage**
190
 
191
+ For advanced user, we have provided train and inference scripts in `examples/libritts/cosyvoice/run.sh`.
 
 
 
 
 
 
192
 
193
+ **Build for deployment**
194
 
195
+ Optionally, if you want service deployment,
196
+ you can run following steps.
 
197
 
198
+ ``` sh
199
+ cd runtime/python
200
+ docker build -t cosyvoice:v1.0 .
201
+ # change iic/CosyVoice-300M to iic/CosyVoice-300M-Instruct if you want to use instruct inference
202
+ # for grpc usage
203
+ docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity"
204
+ cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
205
+ # for fastapi usage
206
+ docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python3 server.py --port 50000 --model_dir iic/CosyVoice-300M && sleep infinity"
207
+ cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  ```
209
 
210
+ ## Discussion & Communication
 
 
211
 
212
+ You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
 
 
 
 
 
213
 
214
+ You can also scan the QR code to join our official Dingding chat group.
215
 
216
+ <img src="./asset/dingding.png" width="250px">
217
 
218
+ ## Acknowledge
219
 
220
+ 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
221
+ 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
222
+ 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
223
+ 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
224
+ 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
225
 
226
+ ## Disclaimer
227
+ The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.