Sharded Feature Extraction and K-means Application
This folder contains scripts for preparing HUBERT labels from tsv files, the steps are:
- feature extraction
- k-means clustering
- k-means application
Data preparation
*.tsv
files contains a list of audio, where each line is the root, and
following lines are the subpath for each audio:
<root-dir>
<audio-path-1>
<audio-path-2>
...
Feature extraction
MFCC feature
Suppose the tsv file is at ${tsv_dir}/${split}.tsv
. To extract 39-D
mfcc+delta+ddelta features for the 1st iteration HUBERT training, run:
python dump_mfcc_feature.py ${tsv_dir} ${split} ${nshard} ${rank} ${feat_dir}
This would shard the tsv file into ${nshard}
and extract features for the
${rank}
-th shard, where rank is an integer in [0, nshard-1]
. Features would
be saved at ${feat_dir}/${split}_${rank}_${nshard}.{npy,len}
.
HUBERT feature
To extract features from the ${layer}
-th transformer layer of a trained
HUBERT model saved at ${ckpt_path}
, run:
python dump_hubert_feature.py ${tsv_dir} ${split} ${ckpt_path} ${layer} ${nshard} ${rank} ${feat_dir}
Features would also be saved at ${feat_dir}/${split}_${rank}_${nshard}.{npy,len}
.
- if out-of-memory, decrease the chunk size with
--max_chunk
K-means clustering
To fit a k-means model with ${n_clusters}
clusters on 10% of the ${split}
data, run
python learn_kmeans.py ${feat_dir} ${split} ${nshard} ${km_path} ${n_cluster} --percent 0.1
This saves the k-means model to ${km_path}
.
- set
--precent -1
to use all data - more kmeans options can be found with
-h
flag
K-means application
To apply a trained k-means model ${km_path}
to obtain labels for ${split}
, run
python dump_km_label.py ${feat_dir} ${split} ${km_path} ${nshard} ${rank} ${lab_dir}
This would extract labels for the ${rank}
-th shard out of ${nshard}
shards
and dump them to ${lab_dir}/${split}_${rank}_${shard}.km
Finally, merge shards for ${split}
by running
for rank in $(seq 0 $((nshard - 1))); do
cat $lab_dir/${split}_${rank}_${nshard}.km
done > $lab_dir/${split}.km