YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

بسم اله الرحمن الرحیم - هست کلید در گنج حکیم

Matcha TTS For persian language

This recepie is for training persian/english tts models (the middle part of the tts, converting ipa phonemes to melspectograms).

The main repo is here.

To do this, you probably need a graphic card with 8GBs of vram for 12 hours or more, supporting pytorch (or use google-colab, kaggle notebook, ...).

Remeber than a classic TTS system consists of these parts:

  1. Vowelizer: Converting text to ipa (International Phonetic Association). Most of the error sin reading(WER) correspond to this part. E-speak library via espeak-ng or piper_phonemizer is usually used for this part.
  2. TTS Model: Converting ipa to melspectograms.
  3. Vocoder: Converts melspectogram diagrams to sound. hifigan is usually used and gives natural results.

Setup environment

sudo apt-get install python3.10-venv
python3.10 -m venv matcha-tts-env
source matcha-tts-env/bin/activate

Install requirements

git clone [email protected]:shivammehta25/Matcha-TTS.git --depth 1
cd Matcha-TTS
pip install -e .

Prepare the dataset

The data structure should be sth like here

Split metadata.csv (dataset text in forms of LJ speech format) to train.txt, val.txt and test.txt files using split_metadata_csv.py

import random

# File paths
input_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/metadata.csv"
wav_folder = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/wav"
train_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/train.txt"
validation_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/val.txt"
test_file = "/home/oem/Basir/TTS/Datasets/Phone-Online/Male/test.txt"

# Read the file as raw text
with open(input_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

# Transform the format
transformed_lines = []
for line in lines:
    file_id, text = line.strip().split("|", 1)  # Split on the first "|"
    transformed_line = f"{wav_folder}/{file_id}.wav|{text}"
    transformed_lines.append(transformed_line)

# Shuffle the data
random.shuffle(transformed_lines)

# Calculate split sizes
total_lines = len(transformed_lines)
train_size = int(0.95 * total_lines)
validation_size = int(0.045 * total_lines)
test_size = total_lines - train_size - validation_size

# Split the data
train_data = transformed_lines[:train_size]
validation_data = transformed_lines[train_size:train_size + validation_size]
test_data = transformed_lines[train_size + validation_size:]

# Save to files
with open(train_file, "w", encoding="utf-8") as f:
    f.write("\n".join(train_data))

with open(validation_file, "w", encoding="utf-8") as f:
    f.write("\n".join(validation_data))

with open(test_file, "w", encoding="utf-8") as f:
    f.write("\n".join(test_data))

print(f"Data split and saved successfully!")
print(f"Train: {len(train_data)} lines")
print(f"Validation: {len(validation_data)} lines")
print(f"Test: {len(test_data)} lines")

Initiate configuration files

copy and edit configs/data/ljspeech.yaml to configs/data/custom.yaml

copy and edit configs/experiment/ljspeech.yaml to configs/experiment/custom.yaml

Inside configs/data/custom.yaml, change:

train_filelist_path: /home/oem/Basir/TTS/Datasets/Phone-Online/Male/train.txt

valid_filelist_path: /home/oem/Basir/TTS/Datasets/Phone-Online/Male/val.txt

Generate normalisation statistics with the yaml file of dataset configuration

./matcha-tts-env/bin/matcha-data-stats -i custom.yaml -f

Output: {'mel_mean': -7.081411, 'mel_std': 3.500973}
Update these values in configs/data/custom.yaml under data_statistics key.

** If freq == 12KHz, it gives warning to reduce n_mels, but hifigan gives error when trying to train such a vocoder, so don't touch n_mels

Manage vram usage

For a minimum (8GB) memory, reduce batch_size in configs/data/custom.yaml: batch_size: 14

(NOT NEEDED for me with 8GB of vram): for a minimal memory, add below to configs/experiment/custom.yaml

model:
  out_size: 172

Set initial checkpoint

Set initial ckpt (or null) by setting ckpt_path in configs/train.yaml

Changes needed in code for persian

  1. In matcha/text/cleaners.py, phonemizer.backend.EspeakBackend part: language="fa",

  2. Run:

pip install piper-phonemize
  1. In cleaners.py:

add below english_cleaners_piper:

import piper_phonemize
def persian_cleaners_piper(text):
    """Pipeline for Persian text, including abbreviation expansion. + punctuation + stress"""
    #text = convert_to_ascii(text)
    text = lowercase(text)
    text = expand_abbreviations(text)
    phonemes = "".join(piper_phonemize.phonemize_espeak(text=text, voice="fa")[0])
    phonemes = collapse_whitespace(phonemes)
    
    # Remove unwanted symbols (e.g., '1')
    unwanted_symbols = {'1', '-'}  # Add any other unwanted symbols here
    filtered_phonemes = "".join([char for char in phonemes if char not in unwanted_symbols])
    
    return filtered_phonemes
  1. Also set cleaner in configs/data/custom.yaml:

cleaners: [persian_cleaners_piper]

  1. Replace symbols.py by:
def read_tokens():
    tokens = []
    with open("/home/oem/Basir/TTS/Matcha/Matcha-TTS/configs/tokens/tokens_sherpa_with_fa.txt", "r", encoding="utf-8") as f:
        for line in f:
            # Remove the newline character at the end
            line = line.rstrip("\n")
            # Split into token and number, preserving whitespace
            if " " in line:
                token = line[:line.index(" ")]  # Extract everything before the first space
                if len(token) == 0: # White-space
                    token = ' '
            else:
                token = line  # If there's no space, the entire line is the token
            tokens.append(token)
    return tokens

symbols = read_tokens()

Change tokens_sherpa_with_fa address with your own one.

  1. In matcha/cli.py change this line to:
    intersperse(text_to_sequence(text, ["persian_cleaners_piper"])[0], 0),

Other changes

  1. For possible errors(due to python updates), change save_figure_to_numpy in matcha/utils.py to:
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import io

def save_figure_to_numpy(fig):
    buf = io.BytesIO()
    fig.savefig(buf, format='png', bbox_inches='tight', pad_inches=0)
    buf.seek(0)
    img = Image.open(buf)
    data = np.array(img)
    buf.close()
    
    return data
  1. To be able to test custom vocoders using command line and testing models trained with frequencies other than 22050, I have made other changes to cli.py.

You can find it in the attached files.

Train!

Run the training script:

python matcha/train.py experiment=custom

Monitor using tensorboard

Goto another bash windows, do:

source matcha-tts-env/bin/activate
cd /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/tensorboard/version_0/
tensorboard --logdir=. --bind_all --port=6007

These commands might also be usefull, they should be run from different windows

watch -n 1 nvidia-smi # to see vram usage
xset dpms force off # to turn of monitor

Test

matcha-tts --text "INPUT TEXT" --checkpoint_path /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/matcha_ljspeech.ckpt --vocoder hifigan_univ_v1 [or hifigan_T2_v1]
matcha-tts --cpu --text "INPUT TEXT" --checkpoint_path /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/checkpoints/last.ckpt --sample_rate 24000 --vocoder hifigan_univ_v1
matcha-tts --file /home/oem/Basir/TTS/HiFi-GAN/MelDataset/metadata_raw.txt --checkpoint_path /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-24000-motahare.ckpt --vocoder /home/oem/Basir/TTS/HiFi-GAN/Trained/MOTAHARE_V1_24KHz/g_00050000 --denoiser_strength 0.00025000 --sample_rate 24000

matcha-tts --cpu --file /home/oem/Basir/TTS/HiFi-GAN/MelDataset/metadata_raw.txt  --checkpoint_path /home/oem/Basir/TTS/Matcha/Matcha-TTS/logs/train/custom/runs/2025-02-07_10-13-16/checkpoints/last.ckpt --vocoder /home/oem/Basir/TTS/HiFi-GAN/Trained/MOTAHARE_V1_24KHz/g_00050000 --denoiser_strength 0.00025000 --sample_rate 24000

Note: Remember that default cleaner used by above command is set in matcha/cli.py

Note: Even "--denoiser_strength 0.00025000" (default) has bad effects on quality, use "--denoiser_strength 0.000001". If there is noise, don't try to suppress it, solve the problem! Noise might be of problems in the text-voice mismatch, bad vocoder (for example using hifigan_T2_v1 for male voice or an under-trained hifigan vocoder), noise in dataset itself or not training matcha model for enough time.

Convert to onnx

pip install onnx
python3 -m matcha.onnx.export matcha.ckpt model-5.onnx --n-timesteps 5
python3 -m matcha.onnx.export /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-22050-khadijah.ckpt /home/oem/Basir/TTS/Matcha/Trained/inital_checkpoints/phone-22050-khadijah-2.onnx --n-timesteps 2

Remeber that the higher the timesteps, the higher the processing time. Even timesteps==1 gives good results.

Add meta-data for sherpa

pip install tokenizer

edit and run add_sherpa_metadata_to_matcha.py

#!/usr/bin/env python3

import json
import os
from typing import Any, Dict
import onnx


def add_meta_data(filename: str, meta_data: Dict[str, Any]):
    """Add meta data to an ONNX model. It is changed in-place.

    Args:
      filename:
        Filename of the ONNX model to be changed.
      meta_data:
        Key-value pairs.
    """
    model = onnx.load(filename)
    for key, value in meta_data.items():
        meta = model.metadata_props.add()
        meta.key = key
        meta.value = str(value)

    onnx.save(model, filename)

def main():
    # Caution: Please change the filename
    filename = "/home/oem/Basir/TTS/Matcha/Trained/onnx/matcha-fa_en-musa-12000-5.onnx"

    print("add model metadata")
    meta_data = {
        "model_type": "matcha-tts",
        "language": "Persian+English",
        "voice": "fa",
        "has_espeak": 1,
        "jieba": 0,
        "n_speakers": 1,
        "sample_rate": 12000,
        "version": 1,
        "pad_id": 0,
        "use_icefall": 0,
        "model_author": "Ali Mahmoudi (@mah92)",
        "maintainer": "k2-fsa",
        "use_eos_bos": 0,
        "num_ode_steps": 5,
        "dataset": "Musa-FA_EN-Public-Phone-Audio-Dataset",
        "dataset_url": "https://huggingface.co/datasets/mah92/Musa-FA_EN-Public-Phone-Audio-Dataset",
        "see_also": "https://github.com/k2-fsa/sherpa-onnx/issues/1779",
    }
    print(meta_data)
    add_meta_data(filename, meta_data)


main()

Note: num_ode_steps in sherpa corresponds to num_steps when converting to onnx.

Test onnx

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnx

Contribute your model

Upload your model in hugging face and add an issue in the sherpa-onnx github repo. They will add your model.

Attention: sherpa-onnx is using T1 hifigan vocoder which is trained on a sinhle female voice. It gets noisy for male voice and high pitched letters. Use vocoders from here instead.

Credits

Special thanks to Masoud Azizi (@Mablue ), Amirreza Ramezani (@brightening-eyes ), and Dr. Hamid Jafari (Khaneh Noor Iranian Basir).

Special thanks to people from @ttsfarsi telegram channel.

I should also thank you @csukuangfj from Xiaomi corporation for your helps and cares in icefall and sherpa-onnx repos.

و ما نحن بشئ الا بما رحم ربنا

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.