Spaces:
Running
on
Zero
Running
on
Zero
File size: 4,938 Bytes
dd9600d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
# CapSpeech-NAR
## Preprocess Data
You can use `data/process.sh` or run them step by step.
1. Prepare json files. Run:
```bash
SAVE_DIR='./capspeech' # to save processed data
CACHE_DIR='./cache' # to save dataset cache
MLS_WAV_DIR='' # downloaded mls wav path
LIBRITTSRMIX_WAV_DIR='' # downloaded librittsrmix wav path
GIGASPEECH_WAV_DIR='' # downloaded gigaspeech wav path
COMMONVOICE_WAV_DIR='' # downloaded commonvoice wav path
EMILIA_WAV_DIR='' # downloaded emilia wav path
CPUS=30
N_WORKERS=8
BATCH_SIZE=64
python preprocess.py \
--save_dir ${SAVE_DIR} \
--cache_dir ${CACHE_DIR} \
--libriRmix_wav_dir ${LIBRITTSRMIX_WAV_DIR}\
--mls_wav_dir ${MLS_WAV_DIR} \
--commonvoice_dir ${COMMONVOICE_WAV_DIR} \
--gigaspeech_dir ${GIGASPEECH_WAV_DIR} \
--emilia_dir ${EMILIA_WAV_DIR} \
--splits train val \
--audio_min_length 3.0 \
--audio_max_length 18.0
```
Notes: `SAVE_DIR` is the path to save processed data; `CACHE_DIR` is the path to save downloaded huggingface data; `MLS_WAV_DIR` is the path of downloaded MLS English-version wav path, it should contain something like `mls_english/test/audio/10226/10111/10226_10111_000001.flac`; `COMMONVOICE_WAV_DIR` is the path of downloaded Commonvoice English-version wav path, it should contain something like `commonvoice/common_voice_en_20233751.wav`; `GIGASPEECH_WAV_DIR` is the path of downloaded GigaSpeech wav path, it should contain something like `gigaspeech/AUD0000000468_S0000654.wav`; `LIBRITTSRMIX_WAV_DIR` is the path of downloaded LibriTTS-r Mix wav path, it should contain something like `LibriTTS_R/test-clean/1089/134686/1089_134686_000001_000001_01.wav`; `EMILIA_WAV_DIR` is the path of downloaded Emilia wav path, it should contain something like `EN_B00020_S00165_W000096.mp3`.
You will get a `jsons` folder with `.json` files like this:
```
[
{
"segment_id": "1089_134686_000001_000001_01",
"audio_path": "/data/capspeech-data/librittsr-mix/LibriTTS_R/test-clean/1089/134686/1089_134686_000001_000001_01.wav",
"text": "<train_whistling> he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled <B_start> out in thick peppered flour fattened sauce stuff it into you his belly counselled him <B_end>",
"caption": "A middle-aged male's speech is characterized by a steady, slightly somber tone, with his voice carrying a moderately low pitch. His speech pace is moderate, neither too quick nor too slow, lending an air of calm and measured thoughtfulness to his delivery.",
"duration": 12.79125,
"source": "libritts-r"
},
...
]
```
2. Phonemize. Run:
```bash
SAVE_DIR='./capspeech'
CPUS=30
python phonemize.py \
--save_dir ${SAVE_DIR} \
--num_cpus ${CPUS}
```
You will get a `g2p` folder with `.txt` files.
3. Caption with T5 embeddings. Run:
```bash
SAVE_DIR='./capspeech'
python caption.py \
--save_dir ${SAVE_DIR}
```
You will get a `t5` folder with `.npz` files.
4. Make manifests. Run:
```bash
SAVE_DIR='./capspeech'
python filemaker.py \
--save_dir ${SAVE_DIR}
```
You will get a `manifest` folder with `.txt` files like this:
```
1995_1826_000016_000004_01 playing_accordion
1995_1826_000016_000007_01 underwater_bubbling
1995_1826_000016_000008_01 telephone
1995_1826_000016_000009_01 eletric_blender_running
1995_1826_000016_000010_01 harmonica
```
5. Make vocab. Run:
```bash
SAVE_DIR='./capspeech'
python vocab.py \
--save_dir ${SAVE_DIR}
```
You will get a `vocab.txt` file.
📝 **Note:** We provided the following scripts to process our data. Make sure to change to your path.
1. Preprocess pretraining data:
```bash
bash data_preprocessing/process_pretrain.sh
```
2. Preprocess CapTTS, EmoCapTTS and AccCapTTS data:
```bash
bash data_preprocessing/process_captts.sh
```
3. Preprocess CapTTS-SE data:
```bash
bash data_preprocessing/process_capttsse.sh
```
4. Preprocess AgentTTS data:
```bash
bash data_preprocessing/process_agenttts.sh
```
## Pretrain
```bash
accelerate launch train.py --config-name "./configs/pretrain.yaml"
```
## Finetune on CapTTS
```bash
accelerate launch finetune.py --config-name "./configs/finetune_captts.yaml" --pretrained-ckpt "YOUR_MODEL_PATH"
```
## Finetune on EmoCapTTS
```bash
accelerate launch finetune.py --config-name "./configs/finetune_emocaptts.yaml" --pretrained-ckpt "YOUR_MODEL_PATH"
```
## Finetune on AccCapTTS
```bash
accelerate launch finetune.py --config-name "./configs/finetune_acccaptts.yaml" --pretrained-ckpt "YOUR_MODEL_PATH"
```
## Finetune on CapTTS-SE
```bash
accelerate launch finetune.py --config-name "./configs/finetune_capttsse.yaml" --pretrained-ckpt "YOUR_MODEL_PATH"
```
## Finetune on AgentTTS
```bash
accelerate launch finetune.py --config-name "./configs/finetune_agenttts.yaml" --pretrained-ckpt "YOUR_MODEL_PATH"
```
## Train a duration predictor
```bash
python duration_predictor.py
```
|