coqui/XTTS-v2 · text-token synchronization with audio production

Hello. I need to detect phonemes converted to audio in inference on XTTS_v2. I get PCM audio, but I would also like the phonemes along with the start and stop times for each, similar to what Espeak does; is it possible to get this information?

My code (from examples):

print("Loading config ..")

config = XttsConfig()
config.load_json("config.json")

print("Loading model ..")

model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/home/roko/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/")
model.cuda()

print("Computing speaker latents ..")

gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["voices/output-Gitta Nikolina.wav", "voices/output-Rosemary Okafor.wav", "voices/output-Maja Ruoho.wav"],
max_ref_length=30,
gpt_cond_len=6,
gpt_cond_chunk_len=6,
librosa_trim_db=None,
sound_norm_refs=False,
load_sr=22050
)

txt = """
.....
"""

print("Inference...")

t0 = time.time()

chunks = model.inference_stream(
text=txt,
language="it",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
stream_chunk_size=20,
overlap_wav_len=1024,
enable_text_splitting=True,
temperature=0.7,
top_k=50,
top_p=0.85,
length_penalty=1.0,
repetition_penalty=10.0,
do_sample=True,
speed=1.0
)

wav_chuncks = []

for i, chunk in enumerate(chunks):
if i == 0:
print(f"Time to first chunck: {time.time() - t0}")
print(f"Received chunk {i} of audio length {chunk.shape[-1]}")
wav_chuncks.append(chunk)

wav = torch.cat(wav_chuncks, dim=0)
torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 22050)

Thanks.