Text-to-Speech
Transformers
tts
voice
wip
VisoLearn sambal commited on
Commit
0ed9c44
Β·
verified Β·
0 Parent(s):

Duplicate from MBZUAI/LLMVoX

Browse files

Co-authored-by: shikhar <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ pipeline_tag: text-to-speech
4
+ ---
5
+
6
+ This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).
7
+
8
+ For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.
9
+
10
+ # LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
11
+
12
+ <div>
13
+ <a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
14
+ <a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a>
15
+ <a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a>
16
+ <a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
17
+ </div>
18
+
19
+ **Authors:**
20
+ **[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file)**, **[Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)**, **[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)**, **[Fahad Khan](https://sites.google.com/view/fahadkhans/home)**, **[Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en)**, **[Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ)**, **[Salman Khan](https://salman-h-khan.github.io/)**, **[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)**
21
+
22
+ **Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE**
23
+
24
+ <p align="center">
25
+ <img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px">
26
+ </p>
27
+
28
+ <video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="500" controls></video>
29
+
30
+ ## Overview
31
+
32
+ LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency.
33
+
34
+ Key features:
35
+ - πŸš€ **Lightweight & Fast**: Only 30M parameters with end-to-end latency as low as 300ms
36
+ - πŸ”Œ **LLM-Agnostic**: Works with any LLM and Vision-Language Model without fine-tuning
37
+ - 🌊 **Multi-Queue Streaming**: Enables continuous, low-latency speech generation
38
+ - 🌐 **Multilingual Support**: Adaptable to new languages with dataset adaptation
39
+
40
+ ## Quick Start
41
+
42
+ ### Installation
43
+
44
+ ```bash
45
+ # Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU
46
+
47
+ git clone https://github.com/mbzuai-oryx/LLMVoX.git
48
+ cd LLMVoX
49
+
50
+ conda create -n llmvox python=3.9
51
+ conda activate llmvox
52
+
53
+ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
54
+ pip install flash-attn --no-build-isolation
55
+ pip install -r requirements.txt
56
+
57
+ # Download checkpoints from Hugging Face
58
+ # https://huggingface.co/MBZUAI/LLMVoX/tree/main
59
+ mkdir -p CHECKPOINTS
60
+ # Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
61
+ ```
62
+
63
+ ### Voice Chat
64
+
65
+ ```bash
66
+ # Basic usage
67
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
68
+
69
+ # With multiple GPUs
70
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
71
+ --llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2
72
+
73
+ # Balance latency/quality
74
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
75
+ --initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
76
+ ```
77
+
78
+ ### Text Chat & Visual Speech
79
+
80
+ ```bash
81
+ # Text-to-Speech
82
+ python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
83
+
84
+ # Visual Speech (Speech + Image β†’ Speech)
85
+ python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \
86
+ --eos_token "<|im_end|>"
87
+
88
+ # Multimodal (support for models like Phi-4)
89
+ python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \
90
+ --eos_token "<|end|>"
91
+ ```
92
+
93
+ ## API Reference
94
+
95
+ | Endpoint | Purpose | Required Parameters |
96
+ |----------|---------|---------------------|
97
+ | `/tts` | Text-to-speech | `text`: String to convert |
98
+ | `/voicechat` | Voice conversations | `audio_base64`, `source_language`, `target_language` |
99
+ | `/multimodalchat` | Voice + multiple images | `audio_base64`, `image_list` |
100
+ | `/vlmschat` | Voice + single image | `audio_base64`, `image_base64`, `source_language`, `target_language` |
101
+
102
+ ## Local UI Demo
103
+
104
+ <p align="center">
105
+ <img src="assets/ui.png" alt="Demo UI" width="800px">
106
+ </p>
107
+
108
+ ```bash
109
+ # Start server
110
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT
111
+
112
+ # Launch UI
113
+ python run_ui.py --ip STREAMING_SERVER_IP --port PORT
114
+ ```
115
+
116
+ ## Citation
117
+
118
+ ```bibtex
119
+ @article{shikhar2025llmvox,
120
+ title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
121
+ author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
122
+ journal={arXiv preprint arXiv:2503.04724},
123
+ year={2025}
124
+ }
125
+ ```
126
+
127
+ ## Acknowledgments
128
+
129
+ - [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT)
130
+ - [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
131
+ - [Whisper](https://github.com/openai/whisper)
132
+ - [Neural G2P](https://github.com/lingjzhu/CharsiuG2P)
133
+
134
+ ## License
135
+
136
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
assets/arch_diagram.svg ADDED
assets/ui.png ADDED
ckpt_english_tiny.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:568346240c3d240f7b70ab32a7ad4d7247a8f0fb704d1b5cbc66a77370b0939a
3
+ size 453105258
config.json ADDED
File without changes
wavtokenizer_large_speech_320_24k.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7450020c154f6aba033cb8651466cb79cb1b1cdd10ea64eaba68e7871cabcc5a
3
+ size 1754880958