zhihaodu commited on
Commit
3b53536
·
1 Parent(s): 063311e

init commit

Browse files
Files changed (6) hide show
  1. LICENSE +21 -0
  2. README.md +150 -0
  3. config.yaml +200 -0
  4. example/example.wav +0 -0
  5. fig/framework.png +0 -0
  6. model.pth +3 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Alibaba Inc.
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,153 @@
1
  ---
 
 
 
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - speech quantization
5
  license: mit
6
+ datasets:
7
+ - in-house
8
  ---
9
+
10
+ # Highlights
11
+ This model is used for speech codec or quantization on English and Chinese utterances.
12
+ - Training with large scale in-house dataset, robust to many scenarios
13
+ - Lower frame rate, 25 token/s for each quantizer
14
+ - Achieving higher codec quality under low band widths
15
+ - Training with structured dropout, enabling various band widths during inference with a single model
16
+ - Quantizing a raw speech waveform into a sequence of discrete tokens
17
+
18
+ # FunCodec model
19
+ This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec),
20
+ an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group.
21
+ This repository provides a pre-trained model on the LibriTTS corpus.
22
+ It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis
23
+ and other academic research topics.
24
+ Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312),
25
+ the following improved techniques are utilized to train the model, resulting in higher codec quality and
26
+ [ViSQOL](https://github.com/google/visqol) scores under the same band width:
27
+ - The magnitude spectrum loss is employed to enhance the middle and high frequency signals
28
+ - Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model
29
+ - Codes are initialized by k-means clusters rather than random values
30
+ - Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks.
31
+
32
+ ## Model description
33
+ This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain
34
+ several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.
35
+ <p align="center">
36
+ <img src="fig/framework.png" alt="FunCodec architecture"/>
37
+ </p>
38
+
39
+ In general, FunCodec models consist of five modules: a domain transformation module,
40
+ an encoder, a RVQ module, a decoder and a domain inversion module.
41
+ - Domain Transformation:transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain.
42
+ - Encoder:encode signals into compact representations with stacked convolutional and LSTM layers.
43
+ - Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model.
44
+ - RVQ:quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers.
45
+ - Decoder:decode quantized embeddings into different signal domains the same as inputs.
46
+ - Domain Inversion:re-synthesize perceptible waveforms from different domains.
47
+
48
+ More details can be found at:
49
+ - Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405)
50
+ - Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec)
51
+
52
+ ## Intended uses & sceneries
53
+ ### Inference with FunCodec
54
+
55
+ You can extract codecs and reconstruct them back to waveforms with FunCodec repository.
56
+
57
+ #### FunCodec installation
58
+ ```sh
59
+ # Install Pytorch GPU (version >= 1.12.0):
60
+ conda install pytorch==1.12.0
61
+ # for other versions, please refer: https://pytorch.org/get-started/locally
62
+
63
+ # Download codebase:
64
+ git clone https://github.com/alibaba-damo-academy/FunCodec.git
65
+
66
+ # Install FunCodec codebase:
67
+ cd FunCodec
68
+ pip install --editable ./
69
+ ```
70
+
71
+ #### Codec extraction
72
+ ```sh
73
+ # Enter the example directory
74
+ cd egs/LibriTTS/codec
75
+ # Specify the model name
76
+ model_name="audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch"
77
+ # Download the model
78
+ git lfs install
79
+ git clone https://huggingface.co/alibaba-damo/${model_name}
80
+ mkdir exp
81
+ mv ${model_name} exp/$model_name
82
+ # Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
83
+ bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
84
+ --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
85
+ --wav_scp input_wav.scp --out_dir outputs/codecs
86
+ # input_wav.scp has the following format:
87
+ # uttid1 path/to/file1.wav
88
+ # uttid2 path/to/file2.wav
89
+ # ...
90
+ ```
91
+
92
+ ### Reconstruct waveforms from codecs
93
+ ```shell
94
+ # Reconstruct waveforms into "outputs/recon_wavs"
95
+ bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
96
+ --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
97
+ --wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs
98
+ # codecs.txt is the output of stage 1, which has the following format:
99
+ # uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
100
+ # uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
101
+ # ...
102
+ ```
103
+
104
+ ### Inference with Huggingface Transformers
105
+ Inference with Huggingface transformers package is under development.
106
+
107
+ ### Application sceneries
108
+ Running environment
109
+ - Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested.
110
+
111
+ Intended using sceneries
112
+ - This model is suitable for general usages, containing academic and industrial applications.
113
+ - Speech quantization, codec and tokenization for English utterances
114
+
115
+ ## Evaluation results
116
+
117
+ ### Training configuration
118
+ - Feature info: raw waveform input
119
+ - Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200
120
+ - Loss info: L1, L2, discriminative loss
121
+ - Model info: SEANet, Conv, LSTM
122
+ - Train config: config.yaml
123
+ - Model size: 57.83 M parameters
124
+
125
+ ### Experimental Results
126
+
127
+ Test set: Multiple test sets, ViSQOL scores
128
+
129
+ | testset | 100 tk/s | 200 tk/s|
130
+ |:--------:|:--------:|:--------:|
131
+ | Librispeech test-clean | 3.93 | 4.16 |
132
+ | Librispeech test-other | 3.84 | 4.07 |
133
+ | aishell1 | 3.70 | 3.96 |
134
+ | aishell2 test-ios | 3.86 | 4.11 |
135
+ | aishell2 test-mic | 3.81 | 4.07 |
136
+ | wenet test | 3.54 | 3.84 |
137
+ | gigaspeech test | 3.84 | 4.10 |
138
+
139
+
140
+ ### Limitations and bias
141
+ - Only suitable for speech signals, not compatible for music or other audio types
142
+
143
+ ### BibTeX entry and citation info
144
+ ```BibTeX
145
+ @misc{du2023funcodec,
146
+ title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
147
+ author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
148
+ year={2023},
149
+ eprint={2309.07405},
150
+ archivePrefix={arXiv},
151
+ primaryClass={cs.Sound}
152
+ }
153
+ ```
config.yaml ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/encodec_lstm_16k_n32_600k_step_rmseg_use_power_ds640.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ dry_run: false
5
+ iterator_type: sequence
6
+ output_dir: exp/encodec_lstm_16k_n32_600k_step_rmseg_use_power_ds640_raw_en_inhouse_open
7
+ ngpu: 4
8
+ seed: 0
9
+ num_workers: 8
10
+ num_att_plot: 0
11
+ dist_backend: nccl
12
+ dist_init_method: env://
13
+ dist_world_size: null
14
+ dist_rank: null
15
+ local_rank: 0
16
+ dist_master_addr: null
17
+ dist_master_port: null
18
+ dist_launcher: null
19
+ multiprocessing_distributed: true
20
+ unused_parameters: true
21
+ sharded_ddp: false
22
+ cudnn_enabled: true
23
+ cudnn_benchmark: false
24
+ cudnn_deterministic: false
25
+ collect_stats: false
26
+ write_collected_feats: false
27
+ max_epoch: 60
28
+ max_update: 9223372036854775807
29
+ patience: null
30
+ val_scheduler_criterion:
31
+ - valid
32
+ - loss
33
+ early_stopping_criterion:
34
+ - valid
35
+ - loss
36
+ - min
37
+ best_model_criterion:
38
+ - - valid
39
+ - generator_multi_spectral_recon_loss
40
+ - min
41
+ keep_nbest_models: 60
42
+ nbest_averaging_interval: 0
43
+ grad_clip: -1
44
+ grad_clip_type: 2.0
45
+ grad_noise: false
46
+ accum_grad: 1
47
+ no_forward_run: false
48
+ resume: true
49
+ train_dtype: float32
50
+ use_amp: false
51
+ log_interval: 50
52
+ use_tensorboard: true
53
+ use_wandb: false
54
+ wandb_project: null
55
+ wandb_id: null
56
+ wandb_entity: null
57
+ wandb_name: null
58
+ wandb_model_log_interval: -1
59
+ detect_anomaly: false
60
+ pretrain_path: null
61
+ init_param: []
62
+ ignore_init_mismatch: true
63
+ freeze_param: []
64
+ num_iters_per_epoch: 10000
65
+ batch_size: 256
66
+ valid_batch_size: null
67
+ batch_bins: 8000000
68
+ valid_batch_bins: null
69
+ drop_last: true
70
+ train_shape_file:
71
+ - exp/inhouse_open/train/speech_shape
72
+ valid_shape_file:
73
+ - exp/inhouse_open/dev/speech_shape
74
+ batch_type: unsorted
75
+ valid_batch_type: null
76
+ speech_length_min: -1
77
+ speech_length_max: -1
78
+ fold_length:
79
+ - 512
80
+ - 150
81
+ sort_in_batch: descending
82
+ sort_batch: descending
83
+ multiple_iterator: false
84
+ chunk_length: 500
85
+ chunk_shift_ratio: 0.5
86
+ num_cache_chunks: 1024
87
+ dataset_type: small
88
+ dataset_conf: {}
89
+ train_data_file: null
90
+ valid_data_file: null
91
+ train_data_path_and_name_and_type:
92
+ - - dump/inhouse_open/train/wav.scp.pai
93
+ - speech
94
+ - kaldi_ark
95
+ valid_data_path_and_name_and_type:
96
+ - - dump/inhouse_open/dev/wav.scp.pai
97
+ - speech
98
+ - kaldi_ark
99
+ allow_variable_data_keys: false
100
+ max_cache_size: 0.0
101
+ max_cache_fd: 32
102
+ valid_max_cache_size: null
103
+ optim: adam
104
+ optim_conf:
105
+ lr: 0.0003
106
+ betas:
107
+ - 0.5
108
+ - 0.9
109
+ scheduler: null
110
+ scheduler_conf:
111
+ step_size: 8
112
+ gamma: 0.1
113
+ optim2: adam
114
+ optim2_conf:
115
+ lr: 0.0003
116
+ betas:
117
+ - 0.5
118
+ - 0.9
119
+ scheduler2: null
120
+ scheduler2_conf:
121
+ step_size: 8
122
+ gamma: 0.1
123
+ simple_ddp: false
124
+ num_worker_count: 2
125
+ generator_first: false
126
+ input_size: 1
127
+ cmvn_file: null
128
+ disc_grad_clip: -1
129
+ disc_grad_clip_type: 2.0
130
+ gen_train_interval: 1
131
+ disc_train_interval: 1
132
+ stat_flops: false
133
+ use_preprocessor: true
134
+ speech_volume_normalize: null
135
+ speech_rms_normalize: false
136
+ speech_max_length: 51200
137
+ sampling_rate: 16000
138
+ valid_max_length: 51200
139
+ frontend: null
140
+ frontend_conf: {}
141
+ normalize: null
142
+ normalize_conf: {}
143
+ encoder: encodec_seanet_encoder
144
+ encoder_conf:
145
+ ratios:
146
+ - 8
147
+ - 5
148
+ - 4
149
+ - 2
150
+ - 2
151
+ norm: time_group_norm
152
+ causal: false
153
+ quantizer: costume_quantizer
154
+ quantizer_conf:
155
+ codebook_size: 1024
156
+ num_quantizers: 32
157
+ ema_decay: 0.99
158
+ kmeans_init: true
159
+ sampling_rate: 16000
160
+ quantize_dropout: true
161
+ rand_num_quant:
162
+ - 2
163
+ - 4
164
+ - 8
165
+ - 16
166
+ - 32
167
+ use_ddp: true
168
+ encoder_hop_length: 640
169
+ decoder: encodec_seanet_decoder
170
+ decoder_conf:
171
+ ratios:
172
+ - 8
173
+ - 5
174
+ - 4
175
+ - 2
176
+ - 2
177
+ norm: time_group_norm
178
+ causal: false
179
+ model: encodec
180
+ model_conf:
181
+ odim: 128
182
+ multi_spectral_window_powers_of_two:
183
+ - 5
184
+ - 6
185
+ - 7
186
+ - 8
187
+ - 9
188
+ - 10
189
+ target_sample_hz: 16000
190
+ audio_normalize: true
191
+ segment_dur: null
192
+ overlap_ratio: null
193
+ use_power_spec_loss: true
194
+ discriminator: multiple_disc
195
+ discriminator_conf:
196
+ disc_conf_list:
197
+ - filters: 32
198
+ name: encodec_multi_scale_stft_discriminator
199
+ distributed: true
200
+ version: 0.2.0
example/example.wav ADDED
Binary file (161 kB). View file
 
fig/framework.png ADDED
model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11ae436b4a7b7fb089467fac539474787093581f889e433088f2487de4aae9f7
3
+ size 265938769