File size: 10,991 Bytes
8d7f03b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 |
# RepCodec: A Speech Representation Codec for Speech Tokenization
> [**RepCodec: A Speech Representation Codec for Speech Tokenization**](https://arxiv.org/abs/2309.00169)
## Introduction
**RepCodec** is a speech tokenization method for converting a speech waveform into a sequence of discrete semantic
tokens.
The main idea is to train a representation codec which learns a vector quantization codebook through reconstructing the
input speech representations from speech encoders like HuBERT or data2vec.
Extensive experiments show that RepCodec significantly outperforms the widely used k-means clustering approach in both
speech understanding and generation.
Also, RepCodec generalizes well across various speech encoders and languages.
<img src="images/RepCodec.png" alt="se" width="1000" />
## RepCodec Models
| Feature Type | Speech Data | RepCodec Model |
|-----------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| [HuBERT base](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#pre-trained-and-fine-tuned-asr-models) layer 9 | [Librispeech](http://www.openslr.org/12) train-clean-100 | [hubert_base_l9](https://drive.google.com/file/d/1XD0HKl607FFjri2-VJT7lHQeSpxsCCFO/view?usp=sharing) |
| [HuBERT large](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#pre-trained-and-fine-tuned-asr-models) layer 18 | [Librispeech](http://www.openslr.org/12) train-clean-100 | [hubert_large_l18](https://drive.google.com/file/d/1mTbm5GeJ7gp_5L3QLP-JGXdf8RnRw5n6/view?usp=sharing) |
| [data2vec base](https://github.com/facebookresearch/fairseq/blob/main/examples/data2vec/README.md#speech-2) layer 6 | [Librispeech](http://www.openslr.org/12) train-clean-100 | [data2vec_base_l6](https://drive.google.com/file/d/1d8sf3Ko_fYM9zlaiwxK_4xusLRKV5EMd/view?usp=sharing) |
| [data2vec large](https://github.com/facebookresearch/fairseq/blob/main/examples/data2vec/README.md#speech-2) layer 18 | [Librispeech](http://www.openslr.org/12) train-clean-100 | [data2vec_large_l18](https://drive.google.com/file/d/1nuRIHaejT-uVi4cluftbT8o_JZqar5SU/view?usp=sharing) |
| [Whisper medium](https://github.com/openai/whisper/tree/main#available-models-and-languages) layer 24 | [Librispeech](http://www.openslr.org/12) train-clean-100 | [whisper_medium_l24](https://drive.google.com/file/d/1V6YJSA2V4iywXrecJAN0oqsa3aHowexZ/view?usp=sharing) |
| [Whisper large-v2](https://github.com/openai/whisper/tree/main#available-models-and-languages) layer 32 | [Librispeech](http://www.openslr.org/12) train-clean-100 | [whisper_large_l32](https://drive.google.com/file/d/1k_X7ZMPg8iOeDrIJe70v6CHfFygzufXC/view?usp=sharing) |
## Speech Tokenization Using Pre-Trained Models
### Installation
Please first install RepCodec by
```
git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .
```
We used Python 3.9.18 and PyTorch 1.12.1 to test the usage, but the code should be compatible with other recent Python
and PyTorch versions.
### Representation Preparation
We adapt the `dump_hubert_feature.py` script
from [fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert/simple_kmeans#hubert-feature)
to support dumping representations from **data2vec**, **HuBERT**, or **Whisper** encoders.
If you use our script (`examples/dump_feature.py`), please also install the following packages:
```
pip install npy_append_array soundfile
```
Additionally, if you want to dump representations from
- **data2vec** or **HuBERT**: please
follow [fairseq's instruction](https://github.com/facebookresearch/fairseq#requirements-and-installation) to install
the latest fairseq.
- **Whisper**: please follow [Whispers'instruction](https://github.com/openai/whisper/tree/main#setup) to install the
latest
Whisper.
Then, you can follow the given examples to dump representations:
```
# Example 1: dump from HuBERT base layer 9
# (for data2vec, simply change "model_type" to data2vec and "ckpt_path" to the path of data2vec model)
layer=9
python3 examples/dump_feature.py \
--model_type hubert \
--tsv_path /path/to/tsv/file \
--ckpt_path /path/to/HuBERT/model \
--layer ${layer} \
--feat_dir /dir/to/save/representations
# Example 2: dump from Whisper medium layer 24
layer=24
python3 examples/dump_feature.py \
--model_type whisper \
--tsv_path /path/to/tsv/file \
--whisper_root /directory/to/save/whisper/model \
--whisper_name medium \
--layer ${layer} \
--feat_dir /dir/to/save/representations
```
Explanations about the args:
- **model_type:** choose from `data2vec`, `hubert`, and `whisper`.
- **tsv_path:** path of the tsv file.
Should have the format of
```
/dir/to/dataset
path_of_utterance_1 number_of_frames
path_of_utterance_2 number_of_frames
```
You can follow [this script](https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/wav2vec_manifest.py)
to generate the tsv file.
For example, by running
```
python wav2vec_manifest.py \
/dir/to/LibriSpeech/dev-clean \
--dest /dir/to/manifest \
--ext flac \
--valid-percent 0
```
you can obtain the `dev-clean.tsv` in `/dir/to/manifest` for LibriSpeech. (By default, the output file name
is `train.tsv`. Remember to rename the file.)
It should be similar to:
```
/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac 78720
2277/149896/2277-149896-0005.flac 89600
2277/149896/2277-149896-0033.flac 45520
```
- **ckpt_path**:
must provide for data2vec and HuBERT.
You need to download the model
from [data2vec website](https://github.com/facebookresearch/fairseq/blob/main/examples/data2vec/README.md#speech-2)
or [HuBERT website](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#pre-trained-and-fine-tuned-asr-models)
yourself.
`--ckpt_path` is the path of the data2vec/HuBERT model.
- **whisper_root** and **whisper_name**:
must provide **BOTH** `--whisper_root` and `--whisper_name` for Whisper.
If there is no corresponding model in `--whisper_root`, the script will download for you.
- **layer**:
which Transformer encoder layer of the model should the representations be extracted from.
It is **1-based**.
For example, if layer=9, then the outputs from the 9<sup>th</sup> Transformer encoder layer are dumped.
Range: [1, number of Transformer encoder layers]
- **feat_dir**: The output representations will be saved to `${feat_dir}/0_1.npy`
and `${feat_dir}/0_1.len`.
For other useful functionalities (e.g., sharding), please check the argument list in `examples/dump_feature.py`.
### Command Line Usage
We expect to have `${feat_dir}/0_1.npy` and `${feat_dir}/0_1.len` in the provided
directory `/dir/to/representaitons`.
Also, the tsv file should be the **same** as the one used in [Representation Preparation](#representation-preparation).
```
repcodec /dir/to/representaitons \
--model /path/to/repcodec/model \
--tsv_path /path/to/tsv/file \
[--model_config_path /path/to/train/config] \
[--use_gpu] \
[--out_dir /path/to/output]
```
If you trained the model yourself following [Training New RepCodec Models](#training-new-repcodec-models),
please provide the training config file using `--model_config_path`.
If you use the model we provide [here](#repcodec-models), then you do not have to provide that.
This command will tokenize the representations and the output discrete tokens will be saved to `${out_dir}/tokens`.
The tokens are in the same order as the provided tsv file.
An example of the output file:
```
/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac 696 696 198 198 198 498 ...
2277/149896/2277-149896-0005.flac 696 696 198 198 198 907 ...
2277/149896/2277-149896-0033.flac 696 696 198 198 198 696 ...
```
Under `examples/tokens`, we provide some token files as references. They are obtained from LibriSpeech dev-clean subset
using the 6 types of representations and corresponding [RepCodec Models](#repcodec-models).
Your results should be very similar to ours.
### Python Usage
```python
import torch
import yaml
from repcodec.RepCodec import RepCodec
# for feature types of HubERT base & data2vec base, please use repcodec_dim768.yaml;
# for feature types of HuBERT large & data2vec large & Whisper medium, please use repcodec_dim1024.yaml;
# for feature types of Whisper large-v2, please use repcodec_dim1280.yaml
config = "repcodec/configs/repcodec_dim768.yaml"
with open(config) as fp:
conf = yaml.load(fp, Loader=yaml.FullLoader)
model = RepCodec(**conf)
model.load_state_dict(torch.load("./hubert_base_l9.pkl", map_location="cpu")["model"]["repcodec"])
model.quantizer.initial()
model.eval()
# input shape: (batch size, hidden dim, sequence length)
random_features = torch.randn(size=(1, 768, 100))
with torch.no_grad():
x = model.encoder(random_features)
z = model.projector(x)
_, idx = model.quantizer.codebook.forward_index(z.transpose(2, 1))
tokens = idx.cpu().data.numpy().tolist()[0]
```
## Training New RepCodec Models
We use a config file to set up all the training configurations, e.g., data, model architecture,
optimizer, scheduler.
We provide an example [here](./train_configs/ex_dim768_mse.yaml).
Please first install required packages following [Installation](#installation)
and prepare the representations following [Representation Preparation](#representation-preparation).
The input data directory is expected to have the following structure
```
/dir/to/representations/
train_set_name/
0_1.npy
0_1.len
valid_set_name/
0_1.npy
0_1.len
test_set_name/
0_1.npy
0_1.len
```
The names of subsets should be the same as the fields in the config file.
Then, you can run training by
```
python train.py \
-c /path/to/config/file \
--tag $tag \
--exp_root exp
```
`tag` is the name of the output folder.
All outputs will be saved to `exp_root/tag/`.
## Acknowledge
Our implementation is based on [facebookresearch/AudioDec](https://github.com/facebookresearch/AudioDec).
We thank them for open-sourcing their code!
## Citation
If you find our work useful, please cite the following article.
```
@misc{huang2023repcodec,
title={RepCodec: A Speech Representation Codec for Speech Tokenization},
author={Zhichao Huang and Chutong Meng and Tom Ko},
year={2023},
eprint={2309.00169},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
```
|