text-token synchronization with audio production
Hello. I need to detect phonemes converted to audio in inference on XTTS_v2. I get PCM audio, but I would also like the phonemes along with the start and stop times for each, similar to what Espeak does; is it possible to get this information?
My code (from examples):
print("Loading config ..")
config = XttsConfig()
config.load_json("config.json")
print("Loading model ..")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/home/roko/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/")
model.cuda()
print("Computing speaker latents ..")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["voices/output-Gitta Nikolina.wav", "voices/output-Rosemary Okafor.wav", "voices/output-Maja Ruoho.wav"],
max_ref_length=30,
gpt_cond_len=6,
gpt_cond_chunk_len=6,
librosa_trim_db=None,
sound_norm_refs=False,
load_sr=22050
)
txt = """
.....
"""
print("Inference...")
t0 = time.time()
chunks = model.inference_stream(
text=txt,
language="it",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
stream_chunk_size=20,
overlap_wav_len=1024,
enable_text_splitting=True,
temperature=0.7,
top_k=50,
top_p=0.85,
length_penalty=1.0,
repetition_penalty=10.0,
do_sample=True,
speed=1.0
)
wav_chuncks = []
for i, chunk in enumerate(chunks):
if i == 0:
print(f"Time to first chunck: {time.time() - t0}")
print(f"Received chunk {i} of audio length {chunk.shape[-1]}")
wav_chuncks.append(chunk)
wav = torch.cat(wav_chuncks, dim=0)
torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 22050)
Thanks.