|
# llama.cpp/example/tts |
|
This example demonstrates the Text To Speech feature. It uses a |
|
[model](https://www.outeai.com/blog/outetts-0.2-500m) from |
|
[outeai](https://www.outeai.com/). |
|
|
|
## Quickstart |
|
If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the |
|
following command and the required models will be downloaded automatically: |
|
```console |
|
$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav |
|
``` |
|
For details about the models and how to convert them to the required format |
|
see the following sections. |
|
|
|
### Model conversion |
|
Checkout or download the model that contains the LLM model: |
|
```console |
|
$ pushd models |
|
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M |
|
$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull |
|
$ popd |
|
``` |
|
Convert the model to .gguf format: |
|
```console |
|
(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \ |
|
--outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16 |
|
``` |
|
The generated model will be `models/outetts-0.2-0.5B-f16.gguf`. |
|
|
|
We can optionally quantize this to Q8_0 using the following command: |
|
```console |
|
$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \ |
|
models/outetts-0.2-0.5B-q8_0.gguf q8_0 |
|
``` |
|
The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`. |
|
|
|
Next we do something simlar for the audio decoder. First download or checkout |
|
the model for the voice decoder: |
|
```console |
|
$ pushd models |
|
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token |
|
$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull |
|
$ popd |
|
``` |
|
This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to |
|
huggingface format: |
|
```console |
|
(venv) python examples/tts/convert_pt_to_hf.py \ |
|
models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt |
|
... |
|
Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors |
|
Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json |
|
Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json |
|
``` |
|
Then we can convert the huggingface format to gguf: |
|
```console |
|
(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \ |
|
--outfile models/wavtokenizer-large-75-f16.gguf --outtype f16 |
|
... |
|
INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf |
|
``` |
|
|
|
### Running the example |
|
|
|
With both of the models generated, the LLM model and the voice decoder model, |
|
we can run the example: |
|
```console |
|
$ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \ |
|
-mv ./models/wavtokenizer-large-75-f16.gguf \ |
|
-p "Hello world" |
|
... |
|
main: audio written to file 'output.wav' |
|
``` |
|
The output.wav file will contain the audio of the prompt. This can be heard |
|
by playing the file with a media player. On Linux the following command will |
|
play the audio: |
|
```console |
|
$ aplay output.wav |
|
``` |
|
|
|
### Running the example with llama-server |
|
Running this example with `llama-server` is also possible and requires two |
|
server instances to be started. One will serve the LLM model and the other |
|
will serve the voice decoder model. |
|
|
|
The LLM model server can be started with the following command: |
|
```console |
|
$ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020 |
|
``` |
|
|
|
And the voice decoder model server can be started using: |
|
```console |
|
./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none |
|
``` |
|
|
|
Then we can run [tts-outetts.py](tts-outetts.py) to generate the audio. |
|
|
|
First create a virtual environment for python and install the required |
|
dependencies (this in only required to be done once): |
|
```console |
|
$ python3 -m venv venv |
|
$ source venv/bin/activate |
|
(venv) pip install requests numpy |
|
``` |
|
|
|
And then run the python script using: |
|
```conole |
|
(venv) python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world" |
|
spectrogram generated: n_codes: 90, n_embd: 1282 |
|
converting to audio ... |
|
audio generated: 28800 samples |
|
audio written to file "output.wav" |
|
``` |
|
And to play the audio we can again use aplay or any other media player: |
|
```console |
|
$ aplay output.wav |
|
``` |
|
|