The SOTA Text-to-speech and Zero Shot Voice cloning model that no one knows about...

Community Article Published January 20, 2025

Quick Links:

Spaces DEMO : https://huggingface.co/spaces/srinivasbilla/llasa-3b-tts
Model : https://huggingface.co/HKUST-Audio/Llasa-3B
Github : https://github.com/zhenye234/LLaSA_training

Hello everyone, I've been having a lot of fun lately playing around with Llasa (https://huggingface.co/HKUST-Audio/Llasa-3B). An open source llama3 3B finetune that acts as a text to speech model. Not only does it do incredibly realistic text to speech, it can also clone any voice with only a couple seconds of sample audio.

Its so good that I had to sign up to huggingface pro, get zero gpu access and write a blog to show it off to the community. While the authors note that their paper is coming soon that didnt stop me from tinkering and figuring out how to use this model.

Voice Cloning

This is a Llama 3.2 3B finetune/continued pretrain to adapt the model to generate speech tokens without any change in model architecture. The only addition is the audio tokenizer xcodec2

Before I ramble about all the cool things I discovered it can do. I set up a space for people to try here and here are some sample in the wild voice clones i made (These are not real people, I used sample audio from elevenlabs voices)

Alex

Reference Let me know in the comment section below. This is the COD Archive, and I'll see you tomorrow. Take care. Clone Hey guys, what's up? Alex here, back at it again with another video. Today we will be learning how to clone voices with a state-of-the-art text-to-speech model. Exciting, right? Let's dive right in.

Amelia

Reference Hi! I'm Amelia, a super high quality English voice. I love to read. Seriously, I'm a total bookworm. So what are you waiting for? Get me reading! Clone All you need is a short clean audio sample of just 5 to 10 seconds. Then the model can generate a high quality speech sample mimicking the voice, tone and style of speech and even accent.

Russel

Reference it is not enough to have a good mind the main thing is to use it well Clone The model was trained on a ~~160,000~~ 250,000 hours of audio tokenized by Xcodec2, which converts audio to tokens at a very efficient 50 tokens per second.

Varying style of speech

Whisper

The given sample audio is very important. It dictates how the rest of the audio that follows sounds like. So whispers in equals whispers out.

Emotions

Confusion I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. I'm just so confused.

Anger I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. I'm just so annoyed.

Laughing I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. It's actually quite funny.

Optimus prime? This is an example of where the model struggles a lot. It cant really quite capture how Peter Cullen is voicing optimus prime.

8B?

The authors have an 8B model space which is currently empty, it would be interesting to see how good that is given that the 3B is already so good for most voices. Also does lora finetuning work? Can we merge and mix voices? There is so much to tinker with and I can't wait for the official paper to come out.

Hope you enjoyed my first blog post/ramble.

P.S i love that its basically just a llama model in disguise.

As I mentioned earlier the only addition is the xcodec2 audio tokenizer model, everything else is just llama 3 inference with some correct prompt templating and tokenisation. See my ZERO Space's app.py file for inference code in hf transformers. But since its just a llama 3 model, theres nothing stopping us from using a more optimised inference library like vllm like this:

note i cloned the repos into my profile since they are gated so its easier to run as a demo...

from transformers import pipeline, AutoTokenizer
import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model
from IPython import display
import torchaudio
from vllm import LLM, SamplingParams

llm = LLM(model="srinivasbilla/llasa-3b", gpu_memory_utilization=0.5, max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained('srinivasbilla/llasa-3b')

model_path = "srinivasbilla/xcodec2"
 
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()

whisper_turbo_pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3-turbo", device='cuda')

sampling_params = SamplingParams(temperature=0.8, top_p=1, max_tokens=2048, stop=['<|SPEECH_GENERATION_END|>'], stop_token_ids=[128261])


def ids_to_speech_tokens(speech_ids):
 
    speech_tokens_str = []
    for speech_id in speech_ids:
        speech_tokens_str.append(f"<|s_{speech_id}|>")
    return speech_tokens_str

def extract_speech_ids(speech_tokens_str):
    return [int(x.replace('s_', '')) for x in speech_tokens_str[2:-2].split('|><|')]


def text_to_speech(sample_audio_path, target_text, sampling_params, prompt_text=None):
    waveform, sample_rate = torchaudio.load(sample_audio_path)

    # Check if the audio is stereo (i.e., has more than one channel)
    if waveform.size(0) > 1:
        # Convert stereo to mono by averaging the channels
        waveform_mono = torch.mean(waveform, dim=0, keepdim=True)
    else:
        # If already mono, just use the original waveform
        waveform_mono = waveform

    waveform_16k = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform_mono)
    torchaudio.save('/local_disk0/input.wav', waveform_16k, 16000)

    # only 16khz speech support!
    prompt_wav, sr = sf.read("/local_disk0/input.wav") # English prompt
    prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)

    if prompt_text is None:
        prompt_text = whisper_turbo_pipe('/local_disk0/input.wav')['text'].strip()
        print(prompt_text)

    input_text = prompt_text + ' ' + target_text

    #TTS start!
    with torch.no_grad():
        # Encode the prompt wav
        vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)

        vq_code_prompt = vq_code_prompt[0,0,:]
        # Convert int 12345 to token <|s_12345|>
        speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)

        formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

        # Tokenize the text and the speech prefix
        chat = [
            {"role": "user", "content": "Convert the text to speech:" + formatted_text},
            {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
        ]

        input_ids = tokenizer.apply_chat_template(
            chat, 
            tokenize=False, 
            continue_final_message=True
        )

        outputs = llm.generate([input_ids], sampling_params)

        generated_text = outputs[0].outputs[0].text

        speech_tokens = extract_speech_ids(generated_text)
        speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
        gen_wav = Codec_model.decode_code(speech_tokens)
    
    return gen_wav[0, 0, :].cpu().numpy()

and then inference:

target_text = """The model was trained on a 160 *thousand* hours of audio tokenized by X codec 2. Which converts audio to tokens at a very efficient 50 tokens per second."""
audio_out, text_out = text_to_speech(
    "./sample_voices/voice_preview_neal.mp3",
    target_text,
    sampling_params=sampling_params,
    prompt_text="it is not enough to have a good mind the main thing is to use it well",
)
display.Audio(audio_out, rate=16000)

Community

soraygoular

Jan 21

Holy moly my goodness this model is amazing, thank you for writing this blog and hosting a demo, this is literally the best the best TTS I've ever seen 10 times better than any other model I've seen

MrDragonFox

Jan 21

•

edited Jan 21

8b repo empty and dataset empty too .. well its a little off from sota .... tbh glm4voice had better results - but its certainly a "ok" poc

gh repo empty / no paper

srinivasbilla

Article author Jan 21

Yeah the authors said 8b by end of month and paper not sure. I havent heard of glm4voice tbf. ill check it out

setfunctionenvironment

Jan 21

This comment has been hidden

Nimble6743

Jan 21

I really want to run this but i'm having a really hard time getting vllm and xcodec2 to run in the same environment. Can anyone possibly help me out with what versions would work together?

MrDragonFox

Jan 21

just limit vllm to 1 gpu and run the rest on a other one .. or use -gmu

gmarcilhacy

Jan 21

Can you explain how you achieved the emotions? Was it the reference voice that had the emotion or was it in the text prompt?

Nimble6743

Jan 21

Did you figure this out? The emotion changes for me depending on the content sometimes but I haven't been able to guide it successfully toward a specific emotion

regularfry

Jan 22

If it's "just llama", presumably that means you can use existing control vector implementations to steer the model. Would that give you prosody and emotion control independent of the input sample?

srinivasbilla

Article author Jan 22

That is interesting, not sure ive never tried vector control implementation. Can you give an example on how to do it?

silvacarl

Jan 23

Where does this directory come from?

"./sample_voices/voice_preview_neal.mp3",

Million

Jan 26

I tried some Chinese, and it seems the effect is much worse than English.

avinash7878

Jan 28

what is the gpu requirement for this ?
Also is there any limitations of voice output length or prompt length ?

srinivasbilla

Article author Jan 29

around 10gb, and around 300 chars is the sweet spot. you can chunk text and do it though

pheonis

Feb 1

Llasa 8b released by the creator. Will you explore that?

srinivasbilla

Article author Feb 1

Yes! Thanks for letting me know

ms13d

Feb 4

Yesterday's GitHub update was great!!

But I'm having a problem. The Huggingface spaces you have generate very natural and too close to the given reference audio.
But when i installed the GitHub version it was a little different like a bit more fast speech and doesn't respect given reference audio, sounds too robotic, and also it takes 3-4 generations to get a perfect audio (not the previously said problems but the audio is morphed into non-verbal sounds).
is there any custom configuration you did in the Huggingface spaces?
my config is this
max_length=2048,
top_p=1,
temperature=0.8

Delfshkrimm

Feb 6

Hi! Thanks for the discovery :) How hard would it be to train this model for French TTS ? Would it only require 100k hours of audio dataset? or some other (maybe complete) code? I'm wondering since Llama3.2 is multilingual how does this relate to this model?

qiuyang05

Mar 6

Hi!what prompt do you set can make model generate emotions？

srinivasbilla

Article author Mar 28

It's not prompted. The source Audio had that emotional context and the model simply copied it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote