بسم اله الرحمن الرحیم - هست کلید در گنج حکیم
Matcha TTS For persian language
This recepie is for training persian/english tts models (the middle part of the tts, converting ipa phonemes to melspectograms).
The main repo is here.
To do this, you probably need a graphic card with 8GBs of vram for 12 hours or more, supporting pytorch (or use google-colab, kaggle notebook, ...).
Remeber than a classic TTS system consists of these parts:
- Vowelizer: Converting text to ipa (International Phonetic Association). Most of the error sin reading(WER) correspond to this part. E-speak library via espeak-ng or piper_phonemizer is usually used for this part.
- TTS Model: Converting ipa to melspectograms.
- Vocoder: Converts melspectogram diagrams to sound. hifigan is usually used and gives natural results.
Setup environment
sudo apt-get install python3.10-venv
python3.10 -m venv matcha-tts-env
source matcha-tts-env/bin/activate
Install requirements
git clone [email protected]:shivammehta25/Matcha-TTS.git --depth 1
cd Matcha-TTS
pip install -e .
Prepare the dataset
The data structure should be sth like here
Split metadata.csv (dataset text in forms of LJ speech format) to train.txt, val.txt and test.txt files using split_metadata_csv.py
import random
# File paths
input_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/metadata.csv"
wav_folder = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/wav"
train_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/train.txt"
validation_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/val.txt"
test_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/test.txt"
# Read the file as raw text
with open(input_file, "r", encoding="utf-8") as f:
lines = f.readlines()
# Transform the format
transformed_lines = []
for line in lines:
file_id, text = line.strip().split("|", 1) # Split on the first "|"
transformed_line = f"{wav_folder}/{file_id}.wav|{text}"
transformed_lines.append(transformed_line)
# Shuffle the data
random.shuffle(transformed_lines)
# Calculate split sizes
total_lines = len(transformed_lines)
train_size = int(0.95 * total_lines)
validation_size = int(0.045 * total_lines)
test_size = total_lines - train_size - validation_size
# Split the data
train_data = transformed_lines[:train_size]
validation_data = transformed_lines[train_size:train_size + validation_size]
test_data = transformed_lines[train_size + validation_size:]
# Save to files
with open(train_file, "w", encoding="utf-8") as f:
f.write("\n".join(train_data))
with open(validation_file, "w", encoding="utf-8") as f:
f.write("\n".join(validation_data))
with open(test_file, "w", encoding="utf-8") as f:
f.write("\n".join(test_data))
print(f"Data split and saved successfully!")
print(f"Train: {len(train_data)} lines")
print(f"Validation: {len(validation_data)} lines")
print(f"Test: {len(test_data)} lines")
Initiate configuration files
copy and edit configs/data/ljspeech.yaml to configs/data/custom.yaml
copy and edit configs/experiment/ljspeech.yaml to configs/experiment/custom.yaml
Inside configs/data/custom.yaml, change:
train_filelist_path: /home/oem/Basir/TTS/Datasets/Phone-Online/Male/train.txt
valid_filelist_path: /home/oem/Basir/TTS/Datasets/Phone-Online/Male/val.txt
Generate normalisation statistics with the yaml file of dataset configuration
./matcha-tts-env/bin/matcha-data-stats -i custom.yaml -f
Output:
{'mel_mean': -7.081411, 'mel_std': 3.500973}
Update these values in configs/data/custom.yaml under data_statistics key.
** If freq == 12KHz, it gives warning to reduce n_mels, but hifigan gives error when trying to train such a vocoder, so don't touch n_mels
Manage vram usage
For a minimum (8GB) memory, reduce batch_size in configs/data/custom.yaml: batch_size: 14
(NOT NEEDED for me with 8GB of vram): for a minimal memory, add below to configs/experiment/custom.yaml
model:
out_size: 172
Set initial checkpoint
Set initial ckpt (or null) by setting ckpt_path in configs/train.yaml
Changes needed in code for persian
In matcha/text/cleaners.py, phonemizer.backend.EspeakBackend part: language="fa",
Run:
pip install piper-phonemize
- In cleaners.py:
add below english_cleaners_piper:
import piper_phonemize
def persian_cleaners_piper(text):
"""Pipeline for Persian text, including abbreviation expansion. + punctuation + stress"""
#text = convert_to_ascii(text)
text = lowercase(text)
text = expand_abbreviations(text)
phonemes = "".join(piper_phonemize.phonemize_espeak(text=text, voice="fa")[0])
phonemes = collapse_whitespace(phonemes)
# Remove unwanted symbols (e.g., '1')
unwanted_symbols = {'1', '-'} # Add any other unwanted symbols here
filtered_phonemes = "".join([char for char in phonemes if char not in unwanted_symbols])
return filtered_phonemes
- Also set cleaner in configs/data/custom.yaml:
cleaners: [persian_cleaners_piper]
- Replace symbols.py by:
def read_tokens():
tokens = []
with open("/home/oem/Basir/TTS/Matcha/Matcha-TTS/configs/tokens/tokens_sherpa_with_fa.txt", "r", encoding="utf-8") as f:
for line in f:
# Remove the newline character at the end
line = line.rstrip("\n")
# Split into token and number, preserving whitespace
if " " in line:
token = line[:line.index(" ")] # Extract everything before the first space
if len(token) == 0: # White-space
token = ' '
else:
token = line # If there's no space, the entire line is the token
tokens.append(token)
return tokens
symbols = read_tokens()
Change tokens_sherpa_with_fa address with your own one.
- In matcha/cli.py change this line to:
intersperse(text_to_sequence(text, ["persian_cleaners_piper"])[0], 0),
Other changes
- For possible errors(due to python updates), change save_figure_to_numpy in matcha/utils.py to:
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import io
def save_figure_to_numpy(fig):
buf = io.BytesIO()
fig.savefig(buf, format='png', bbox_inches='tight', pad_inches=0)
buf.seek(0)
img = Image.open(buf)
data = np.array(img)
buf.close()
return data
- To be able to test custom vocoders using command line and testing models trained with frequencies other than 22050, I have made other changes to cli.py.
You can find it in the attached files.
Train!
Run the training script:
python matcha/train.py experiment=custom
Monitor using tensorboard
Goto another bash windows, do:
source matcha-tts-env/bin/activate
cd /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/tensorboard/version_0/
tensorboard --logdir=. --bind_all --port=6007
These commands might also be usefull, they should be run from different windows
watch -n 1 nvidia-smi # to see vram usage
xset dpms force off # to turn of monitor
Test
matcha-tts --text "INPUT TEXT" --checkpoint_path /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/matcha_ljspeech.ckpt --vocoder hifigan_univ_v1 [or hifigan_T2_v1]
matcha-tts --cpu --text "INPUT TEXT" --checkpoint_path /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/checkpoints/last.ckpt --sample_rate 24000 --vocoder hifigan_univ_v1
matcha-tts --file /home/oem/Basir/TTS/HiFi-GAN/MelDataset/metadata_raw.txt --checkpoint_path /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-24000-motahare.ckpt --vocoder /home/oem/Basir/TTS/HiFi-GAN/Trained/MOTAHARE_V1_24KHz/g_00050000 --denoiser_strength 0.00025000 --sample_rate 24000
matcha-tts --cpu --file /home/oem/Basir/TTS/HiFi-GAN/MelDataset/metadata_raw.txt --checkpoint_path /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/checkpoints/last.ckpt --vocoder /home/oem/Basir/TTS/HiFi-GAN/Trained/MOTAHARE_V1_24KHz/g_00050000 --denoiser_strength 0.00025000 --sample_rate 24000
Note: Remember that default cleaner used by above command is set in matcha/cli.py
Note: Even "--denoiser_strength 0.00025000" (default) has bad effects on quality, use "--denoiser_strength 0.000001". If there is noise, don't try to suppress it, solve the problem! Noise might be of problems in the text-voice mismatch, bad vocoder (for example using hifigan_T2_v1 for male voice or an under-trained hifigan vocoder), noise in dataset itself or not training matcha model for enough time.
Convert to onnx
pip install onnx
python3 -m matcha.onnx.export matcha.ckpt model-5.onnx --n-timesteps 5
python3 -m matcha.onnx.export /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-22050-khadijah.ckpt /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-22050-khadijah-2.onnx --n-timesteps 2
Remeber that the higher the timesteps, the higher the processing time. Even timesteps==1 gives good results.
Add meta-data for sherpa
pip install tokenizer
edit and run add_sherpa_metadata_to_matcha.py
#!/usr/bin/env python3
import json
import os
from typing import Any, Dict
import onnx
def add_meta_data(filename: str, meta_data: Dict[str, Any]):
"""Add meta data to an ONNX model. It is changed in-place.
Args:
filename:
Filename of the ONNX model to be changed.
meta_data:
Key-value pairs.
"""
model = onnx.load(filename)
for key, value in meta_data.items():
meta = model.metadata_props.add()
meta.key = key
meta.value = str(value)
onnx.save(model, filename)
def main():
# Caution: Please change the filename
filename = "/home/oem/Basir/TTS/Matcha/Trained/onnx/matcha-fa_en-musa-12000-5.onnx"
print("add model metadata")
meta_data = {
"model_type": "matcha-tts",
"language": "Persian+English",
"voice": "fa",
"has_espeak": 1,
"jieba": 0,
"n_speakers": 1,
"sample_rate": 12000,
"version": 1,
"pad_id": 0,
"use_icefall": 0,
"model_author": "Ali Mahmoudi (@mah92)",
"maintainer": "k2-fsa",
"use_eos_bos": 0,
"num_ode_steps": 5,
"dataset": "Musa-FA_EN-Public-Phone-Audio-Dataset",
"dataset_url": "https://huggingface.co/datasets/mah92/Musa-FA_EN-Public-Phone-Audio-Dataset",
"see_also": "https://github.com/k2-fsa/sherpa-onnx/issues/1779",
}
print(meta_data)
add_meta_data(filename, meta_data)
main()
Note: num_ode_steps in sherpa corresponds to num_steps when converting to onnx.
Test onnx
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnx
Contribute your model
Upload your model in hugging face and add an issue in the sherpa-onnx github repo. They will add your model.
Attention: sherpa-onnx is using T1 hifigan vocoder which is trained on a sinhle female voice. It gets noisy for male voice and high pitched letters. Use vocoders from here instead.
Credits
Special thanks to Masoud Azizi (@Mablue ), Amirreza Ramezani (@brightening-eyes ), and Dr. Hamid Jafari (Khaneh Noor Iranian Basir).
Special thanks to people from @ttsfarsi telegram channel.
I should also thank you @csukuangfj from Xiaomi corporation for your helps and cares in icefall and sherpa-onnx repos.
و ما نحن بشئ الا بما رحم ربنا