## 测试效果

- 测试代码: [speed_test.ipynb](speed_test.ipynb)
- 测试环境: Intel i5-12400 CPU, 48GB RAM, 1x NVIDIA GeForce RTX 4070
- 运行环境: Ubuntu 24.04.1 LTS, cuda 12.4, python 3.10.16
- 测试说明: 单任务执行的数据（非并发测试）


## 默认情况下使用

In [None]:
import time
import asyncio
import torchaudio

import sys
sys.path.append('third_party/Matcha-TTS')

from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav

prompt_text = '希望你以后能够做得比我还好哟'
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)

# cosyvoice = CosyVoice2('./pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=True)
cosyvoice = CosyVoice2('./pretrained_models/CosyVoice2-0.5B', load_jit=True, load_trt=True, fp16=True)

## 使用vllm加速llm推理

#### 1. **安装依赖**

(该依赖环境下可以运行原本cosyvoice2代码)
```bash
pip install -r requirements_vllm.txt
```

#### 2. **文件复制**
将 pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN 文件夹下的部分文件复制到下载的CosyVoice2-0.5B模型文件夹下，并替换 config.json 文件中的 Qwen2ForCausalLM 为 CosyVoice2Model。
```bash
cp pretrained_models/CosyVoice2-0.5B/CosyVoice-BlankEN/{config.json,tokenizer_config.json,vocab.json,merges.txt} pretrained_models/CosyVoice2-0.5B/
sed -i 's/Qwen2ForCausalLM/CosyVoice2Model/' pretrained_models/CosyVoice2-0.5B/config.json
```

#### **注意：**

- 使用 load_trt 后，需要进行 **预热** 10次推理以上，使用流式推理预热效果较好
- 在 jupyter notebook 中，如果要使用 **vllm** 运行下列代码，需要将vllm_use_cosyvoice2_model.py正确复制到 vllm 包中，并注册到 _VLLM_MODELS 字典中。运行下面的 code 完成

In [None]:
import os
import shutil

# 获取vllm包的安装路径
try:
    import vllm
except ImportError:
    raise ImportError("vllm package not installed")


vllm_path = os.path.dirname(vllm.__file__)
print(f"vllm package path: {vllm_path}")

# 定义目标路径
target_dir = os.path.join(vllm_path, "model_executor", "models")
target_file = os.path.join(target_dir, "cosyvoice2.py")

# 复制模型文件
source_file = "./cosyvoice/llm/vllm_use_cosyvoice2_model.py"
if not os.path.exists(source_file):
    raise FileNotFoundError(f"Source file {source_file} not found")

shutil.copy(source_file, target_file)
print(f"Copied {source_file} to {target_file}")

# 修改registry.py文件
registry_path = os.path.join(target_dir, "registry.py")
new_entry = '    "CosyVoice2Model": ("cosyvoice2", "CosyVoice2Model"),  # noqa: E501\n'

# 读取并修改文件内容
with open(registry_path, "r") as f:
    lines = f.readlines()

# 检查是否已存在条目
entry_exists = any("CosyVoice2Model" in line for line in lines)

if not entry_exists:
    # 寻找插入位置
    insert_pos = None
    for i, line in enumerate(lines):
        if line.strip().startswith("**_FALLBACK_MODEL"):
            insert_pos = i + 1
            break
    
    if insert_pos is None:
        raise ValueError("Could not find insertion point in registry.py")
    
    # 插入新条目
    lines.insert(insert_pos, new_entry)
    
    # 写回文件
    with open(registry_path, "w") as f:
        f.writelines(lines)
    print("Successfully updated registry.py")
else:
    print("Entry already exists in registry.py, skipping modification")

print("All operations completed successfully!")

In [1]:
import time
import asyncio
import torchaudio

import sys
sys.path.append('third_party/Matcha-TTS')

from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav

prompt_text = '希望你以后能够做得比我还好哟'
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)

# cosyvoice = CosyVoice2(
#     './pretrained_models/CosyVoice2-0.5B', 
#     load_jit=False, 
#     load_trt=False, 
#     fp16=True, 
#     use_vllm=True,
# )
cosyvoice = CosyVoice2(
    './pretrained_models/CosyVoice2-0.5B', 
    load_jit=True, 
    load_trt=True, 
    fp16=True, 
    use_vllm=True,
)

failed to import ttsfrd, use WeTextProcessing instead


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
  deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
2025-03-08 00:37:04,867 INFO input frame rate=25
2025-03-08 00:37:06,103 WETEXT INFO found existing fst: /opt/anaconda3/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst
2025-03-08 00:37:06,103 INFO found existing fst: /opt/anaconda3/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_tagger.fst
2025-03-08 00:37:06,104 WETEXT INFO                     /opt/anaconda3/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst
2025-03-08 00:37:06,104 INFO                     /opt/anaconda3/envs/cosyvoice/lib/python3.10/site-packages/tn/zh_tn_verbalizer.fst
2025-03-08 00:37:06,104 WETEXT INFO skip building fst for zh_normalizer ...
2025-03-08 00:37:06,104 INFO skip building fst for zh_normalizer ...
2025-03-08 00:37:06,313 WETEXT INFO found existing fst: /opt/anaconda3/envs/cosyvoice/lib/python3.1

INFO 03-08 00:37:07 __init__.py:207] Automatically detected platform cuda.
INFO 03-08 00:37:07 config.py:560] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
INFO 03-08 00:37:07 config.py:1624] Chunked prefill is enabled with max_num_batched_tokens=1024.
INFO 03-08 00:37:10 __init__.py:207] Automatically detected platform cuda.
INFO 03-08 00:37:11 core.py:50] Initializing a V1 LLM engine (v0.7.3.dev213+gede41bc7.d20250219) with config: model='./pretrained_models/CosyVoice2-0.5B', speculative_config=None, tokenizer='./pretrained_models/CosyVoice2-0.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_con

  return func(*args, **kwargs)
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.12it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.12it/s]



INFO 03-08 00:37:12 gpu_model_runner.py:1068] Loading model weights took 0.9532 GB and 1.023026 seconds
INFO 03-08 00:37:16 backends.py:408] Using cache directory: /home/qihua/.cache/vllm/torch_compile_cache/29f70599cb/rank_0 for vLLM's torch.compile
INFO 03-08 00:37:16 backends.py:418] Dynamo bytecode transform time: 3.62 s
INFO 03-08 00:37:16 backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 03-08 00:37:19 monitor.py:33] torch.compile takes 3.62 s in total
INFO 03-08 00:37:20 kv_cache_utils.py:524] GPU KV cache size: 216,560 tokens
INFO 03-08 00:37:20 kv_cache_utils.py:527] Maximum concurrency for 1,024 tokens per request: 211.48x


2025-03-08 00:37:30,767 DEBUG Using selector: EpollSelector


INFO 03-08 00:37:30 gpu_model_runner.py:1375] Graph capturing finished in 11 secs, took 0.37 GiB
INFO 03-08 00:37:30 core.py:116] init engine (profile, create kv cache, warmup model) took 17.82 seconds
inference_processor
[03/08/2025-00:37:31] [TRT] [I] Loaded engine size: 158 MiB
[03/08/2025-00:37:31] [TRT] [I] [MS] Running engine with multi stream info
[03/08/2025-00:37:31] [TRT] [I] [MS] Number of aux streams is 1
[03/08/2025-00:37:31] [TRT] [I] [MS] Number of total worker streams is 2
[03/08/2025-00:37:31] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[03/08/2025-00:37:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +4545, now: CPU 0, GPU 4681 (MiB)


inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor
inference_processor


In [16]:
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

  0%|          | 0/1 [00:00<?, ?it/s]2025-03-08 00:38:59,777 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-03-08 00:39:00,917 INFO yield speech len 11.68, rtf 0.09757431402598342
100%|██████████| 1/1 [00:01<00:00,  1.47s/it]


In [17]:
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=True)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

  0%|          | 0/1 [00:00<?, ?it/s]2025-03-08 00:39:01,208 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-03-08 00:39:01,587 INFO yield speech len 1.84, rtf 0.20591642545617145
2025-03-08 00:39:01,790 INFO yield speech len 2.0, rtf 0.10057318210601807
2025-03-08 00:39:02,116 INFO yield speech len 2.0, rtf 0.16271138191223145
2025-03-08 00:39:02,367 INFO yield speech len 2.0, rtf 0.1247786283493042
2025-03-08 00:39:02,640 INFO yield speech len 2.0, rtf 0.13561689853668213
2025-03-08 00:39:02,980 INFO yield speech len 1.88, rtf 0.1803158445561186
100%|██████████| 1/1 [00:02<00:00,  2.05s/it]


In [18]:
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'

    
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

2025-03-08 00:39:02,990 INFO get tts_text generator, will skip text_normalize!
  0%|          | 0/1 [00:00<?, ?it/s]2025-03-08 00:39:02,991 INFO get tts_text generator, will return _extract_text_token_generator!
2025-03-08 00:39:03,236 INFO synthesis text <generator object text_generator at 0x79c694dae340>
2025-03-08 00:39:03,237 INFO not enough text token to decode, wait for more
2025-03-08 00:39:03,252 INFO get fill token, need to append more text token
2025-03-08 00:39:03,253 INFO append 5 text token
2025-03-08 00:39:03,311 INFO get fill token, need to append more text token
2025-03-08 00:39:03,312 INFO append 5 text token
2025-03-08 00:39:03,456 INFO no more text token, decode until met eos
2025-03-08 00:39:04,861 INFO yield speech len 15.16, rtf 0.1072180145334128
100%|██████████| 1/1 [00:01<00:00,  1.88s/it]


In [19]:
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=True)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

2025-03-08 00:39:04,878 INFO get tts_text generator, will skip text_normalize!
  0%|          | 0/1 [00:00<?, ?it/s]2025-03-08 00:39:04,880 INFO get tts_text generator, will return _extract_text_token_generator!
2025-03-08 00:39:05,151 INFO synthesis text <generator object text_generator at 0x79c694dad690>
2025-03-08 00:39:05,152 INFO not enough text token to decode, wait for more
2025-03-08 00:39:05,169 INFO get fill token, need to append more text token
2025-03-08 00:39:05,169 INFO append 5 text token
2025-03-08 00:39:05,292 INFO get fill token, need to append more text token
2025-03-08 00:39:05,293 INFO append 5 text token
2025-03-08 00:39:05,438 INFO no more text token, decode until met eos
2025-03-08 00:39:05,638 INFO yield speech len 1.84, rtf 0.26492670826289966
2025-03-08 00:39:05,841 INFO yield speech len 2.0, rtf 0.10065567493438721
2025-03-08 00:39:06,164 INFO yield speech len 2.0, rtf 0.16065263748168945
2025-03-08 00:39:06,422 INFO yield speech len 2.0, rtf 0.1279166936874

In [20]:
# instruct usage
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
    torchaudio.save('instruct2_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)


  0%|          | 0/1 [00:00<?, ?it/s]2025-03-08 00:39:07,592 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-03-08 00:39:08,925 INFO yield speech len 11.24, rtf 0.11861237342671567
100%|██████████| 1/1 [00:01<00:00,  1.58s/it]
