Spaces:

Mudrock
/

Sebas

Configuration error

App Files Files Community

Mudrock commited on Dec 12, 2022

Commit

530a7d1

1 Parent(s): b05a8d5

Upload 18 files

Browse files

Files changed (18) hide show

LICENSE +21 -0
README.md +122 -12
cog.yaml +33 -0
config.py +62 -0
create_balanced_list.py +24 -0
create_index.sh +12 -0
create_indexes.py +126 -0
data_processor.py +179 -0
htsat_config.py +122 -0
htsat_utils.py +226 -0
losses.py +23 -0
main.py +502 -0
opt_thres.pkl +3 -0
predict.py +111 -0
requirements.txt +19 -0
sed_model.py +358 -0
utils.py +580 -0
zero_shot_create_vector.py +158 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2021 Knut(Ke) Chen
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,122 @@
----
-title: Sebas
-emoji: 🔥
-colorFrom: pink
-colorTo: pink
-sdk: streamlit
-sdk_version: 1.15.2
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Zero Shot Audio Source Separation
+## Introduction
+The Code Repository for "[Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data](https://arxiv.org/abs/2112.07891)", in AAAI 2022.
+In this paper, we propose a three-component pipline that allows you to train a audio source separator to separate *any source* from the track. All you need is a mixture audio to separate, and a given source sample as a query. Then the model will separate your specified source from the track. Our model lies in a zero-shot setting because we never use the seapration dataset but a general audio dataset **AudioSet**. However, we achieve a very competible separation performance (SDR) in MUSDB18 Dataset compared with those supervised models. Our model has a generalization ability to unseen sources out of the training set. Indeed, we do not even require the separation dataset for training but solely **AudioSet**.
+The demos and introduction are presented in our [short instroduction video](https://youtu.be/8XQ5ZyYRLQM) and [full presentation video](https://youtu.be/RgNwB_pJ7Cw).
+More demos will be presented in [my personal website](https://www.knutchen.com) (now under construction)
+Chckout this interactive demo at Replicate <a href="https://replicate.com/retrocirce/zero_shot_audio_source_separation"><img src="https://replicate.com/retrocirce/zero_shot_audio_source_separation/badge"></a> Thanks @[ariel415el](https://github.com/ariel415el) for creating this!
+![Model Arch](fig/arch.png)
+## Main Separation Performance on MUSDB18 Dataset
+We achieve a very competible separation performance (SDR) in MUSDB18 Dataset **with neither seeing the MUSDB18 training data nor speficying source targets**, compared with those supervised models.
+Additionally, our model can easily separate many other sources, such as violin, harmonica, guitar, etc. (demos shown in the above video link)
+<p align="center">
+<img src="fig/results.png" align="center" alt="MUSDB results" width="50%"/>
+</p>
+## Getting Started
+### Install Requirments
+```
+pip install -r requirements.txt
+```
+### Download and Processing Datasets
+* config.py
+```
+change the varible "dataset_path" to your audioset address
+change the classes_num to 527
+```
+* [AudioSet](https://research.google.com/audioset/download.html)
+```
+./create_index.sh #
+// remember to change the pathes in the script
+// more information about this script is in https://github.com/qiuqiangkong/audioset_tagging_cnn
+python main.py save_idc
+// count the number of samples in each class and save the npy files
+```
+* [MUSDB18](https://sigsep.github.io/datasets/musdb.html) - You can directly use [our processed musdb audio files](https://drive.google.com/drive/folders/1VwRnCxp3t2bXUS_MbXiFiggwkkJQEmha?usp=sharing) in 32000Hz sample rate. Or you set the "musdb_path" in the download path, and:
+```
+python main.py musdb_process
+// Notice that the training set is a highlight version, while the testing set is the full version
+```
+### Set the Configuration File: config.py
+The script *config.py* contains all configurations you need to assign to run your code.
+Please read the introduction comments in the file and change your settings.
+For the most important part:
+If you want to train/test your model on AudioSet, you need to set:
+```
+dataset_path = "your processed audioset folder"
+balanced_data = True
+sample_rate = 32000
+hop_size = 320
+classes_num = 527
+```
+### Train and Evaluation
+#### Train the sound event detection system ST-SED/HTS-AT
+We further integrated this system ST-SED into an independent repository, and evaluteed it on more datasets, improved it a lot and achieved better performance.
+You can follow [this repo](https://github.com/RetroCirce/HTS-Audio-Transformer) to train and evalute the sound event detection system ST-SED (or a more relevant name HTS-AT), the configuation file for training the model for this separation task should be [htsat_config.py](htsat_config.py).
+For this separation task, if you want to save time, you can also download [the checkpoint](https://drive.google.com/drive/folders/1RouwHsGsMs8n3l_jF8XifWtbPzur_YQS?usp=sharing) directly.
+#### Train, Evaluate and Inference the Seapration Model
+All scripts is run by main.py:
+```
+Train: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py train
+Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
+```
+We recommend using at least 4 GPU cards with above 20GB memories per card. In our training phrase, we use 8 Nvidia V-100 (32GB) GPUs.
+We provide a quick **inference** interface by:
+```
+CUDA_VISIBLE_DEVICES=1 python main.py inference
+```
+Where you can separate any given source from the track. You need to set the value of "inference_file" and "inference_query" in *config.py*. Just check the comment and get it started. And for the inference, we recommend to use only one card (because it is already enough).
+#### Model Checkpoints:
+We provide the model checkpoints in this [link](https://drive.google.com/drive/folders/1RouwHsGsMs8n3l_jF8XifWtbPzur_YQS?usp=sharing). Feel free to download and test it.
+## Citing
+```
+@inproceedings{zsasp-ke2022,
+  author = {Ke Chen* and Xingjian Du* and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
+  title = {Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data},
+  booktitle = {{AAAI} 2022}
+}
+@inproceedings{htsat-ke2022,
+  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
+  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
+  booktitle = {{ICASSP} 2022}
+}
+```

cog.yaml ADDED Viewed

	@@ -0,0 +1,33 @@

+build:
+  gpu: true
+  python_version: "3.8"
+  system_packages:
+    - "libgl1-mesa-glx"
+    - "libglib2.0-0"
+    - "libsndfile1-dev"
+    - "ffmpeg"
+  python_packages:
+    - torch==1.9.0
+    - torchmetrics==0.6.0
+    - torchaudio==0.9.0
+    - torchcontrib==0.0.2
+    - torchlibrosa==0.0.9
+    - librosa==0.8.0
+    - pytorch_lightning==1.4.1
+    - museval==0.4.0
+    - noisereduce==2.0.0
+    - numba==0.55.1
+    - numpy==1.19.4
+    - scikit_learn==0.24.0
+    - scipy==1.6.0
+    - soundfile==0.10.3.post1
+    - tensorboard==2.2.0
+    - tqdm==4.55.0
+    - h5py==3.1.0
+    - musdb==0.4.0
+  # run:
+  #     - pip install open3d
+  # #   - "gdown --id 16VnMcF1KJYxN9QId6TClMsZRahHNMW5g"
+predict: "predict.py:Predictor"

config.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# Ke Chen
+# [email protected]
+# Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data
+# The configuration file
+# for model training
+exp_name = "exp_zs_asp_full" # the saved ckpt prefix name of the model
+workspace = "/home/Research/ZS_ASP/" # the folder of your code
+dataset_path = "/home/Research/ZS_ASP/data/audioset" # the dataset path
+index_type = "full_train"
+idc_path = "/home/Research/ZS_ASP/" # the folder of audioset class count files
+balanced_data = True
+# trained from a checkpoint, or evaluate a single model
+resume_checkpoint = None
+# "/home/Research/ZS_ASP/model_backup/zeroshot_asp_full.ckpt"
+loss_type = "mae"
+gather_mode = False
+debug = False
+classes_num = 527
+eval_list = [] # left blank to preserve all classes, otherwise will filter the specified classes
+# [15, 63, 81, 184, 335, 449, 474, 348, 486, 4] # randomly generated from the 527-classes for held-out evaludation
+batch_size = 16 * 8   # batch size per GPU x GPU number , default is 16 x 8 = 128
+learning_rate = 1e-3 # 3e-4 is also workable
+max_epoch = 100
+num_workers = 3
+lr_scheduler_epoch = [90, 110]
+latent_dim = 2048
+# for signal processing
+sample_rate = 32000
+clip_samples = sample_rate * 10 # audio_set 10-sec clip
+segment_frames = 200
+hop_samples = 320
+random_seed = 12412 # 444612 1536123 12412
+random_mode = "one_class" # "no_random, one_class, random, order", one class is the best
+# for evaluation
+musdb_path = "/home/Research/ZS_ASP/data/musdb-wav/" # musdb download folder
+testavg_path = "/home/Research/ZS_ASP/data/musdb30-train-32000fs.npy" # the processed training set (to get the latent query)
+testset_path = "/home/Research/ZS_ASP/data/musdb-test-32000fs.npy" # the processed testing set (to calculate the performance)
+test_key = ["vocals", "drums", "bass", "other"] # four tracks for musdb, and your named track for other inference
+test_type = "mix"
+infer_type = "mean"
+energy_thres = 0.1
+wave_output_path = "/home/Research/ZS_ASP/wavoutput" # output folder
+using_wiener = True # use wiener filter or not (default: True)
+using_whiting = False # use whiting or not (default: False)
+# weight average
+wa_model_folder = "/home/Research/ZS_ASP/version_3/checkpoints/"
+wa_model_path = "zs_wa.ckpt"
+# for inference
+inference_file = "/home/Research/ZS_ASP/data/pagenini.wav" # an audio file to separate
+inference_query = "/home/Research/ZS_ASP/data/query" # a folder containing all samples for obtaining the query
+overlap_rate = 0.0 # [0.0, 1.0), 0 to disabled, recommand 0.5 for 50% overlap. Overlap will increase computation time and improve result quality

create_balanced_list.py ADDED Viewed

	@@ -0,0 +1,24 @@

+# Ke Chen
+# [email protected]
+import os
+import sys
+import config
+import logging
+import numpy as np
+from utils import get_balanced_class_list
+def main():
+    train_indexes_hdf5_path = os.path.join(config.dataset_path, "hdf5s", "indexes",
+        "{}.h5".format(config.data_type))
+    eval_indexes_hdf5_path = os.path.join(config.dataset_path, "hdf5s", "indexes", "eval.h5")
+    logging.info("Process training data")
+    indexes_per_class = get_balanced_class_list(train_indexes_hdf5_path, random_seed = config.random_seed)
+    np.save("idc_train.npy", indexes_per_class)
+    logging.info("Process testing data")
+    indexes_per_class = get_balanced_class_list(eval_indexes_hdf5_path, random_seed = config.random_seed)
+    np.save("idc_eval.npy", indexes_per_class)
+if __name__ == '__main__':
+    logging.basicConfig(level=logging.INFO)
+    main()

create_index.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+python3 create_indexes.py create_indexes --waveforms_hdf5_path="/home/Research/ZS_ASP/data/audioset/hdf5s/indexes/balanced_train.h5"
+# Unbalanced training indexes
+for IDX in {00..40}; do
+    echo $IDX
+    python3 create_indexes.py create_indexes --waveforms_hdf5_path="/home/Research/ZS_ASP/data/audioset/hdf5s/waveforms/unbalanced_train/unbalanced_train_part$IDX.h5" --indexes_hdf5_path="/home/Research/ZS_ASP/data/audioset/hdf5s/indexes/unbalanced_train/unbalanced_train_part$IDX.h5"
+done
+# Combine balanced and unbalanced training indexes to a full training indexes hdf5
+python3 create_indexes.py combine_full_indexes --indexes_hdf5s_dir="/home/Research/ZS_ASP/data/audioset/hdf5s/indexes" --full_indexes_hdf5_path="/home/Research/ZS_ASP/data/audioset/hdf5s/indexes/full_train.h5"

create_indexes.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import numpy as np
+import argparse
+import csv
+import os
+import glob
+import datetime
+import time
+import logging
+import h5py
+import librosa
+from utils import create_folder, get_sub_filepaths
+import config
+def create_indexes(args):
+    """Create indexes a for dataloader to read for training. When users have
+    a new task and their own data, they need to create similar indexes. The
+    indexes contain meta information of "where to find the data for training".
+    """
+    # Arguments & parameters
+    waveforms_hdf5_path = args.waveforms_hdf5_path
+    indexes_hdf5_path = args.indexes_hdf5_path
+    # Paths
+    create_folder(os.path.dirname(indexes_hdf5_path))
+    with h5py.File(waveforms_hdf5_path, 'r') as hr:
+        with h5py.File(indexes_hdf5_path, 'w') as hw:
+            audios_num = len(hr['audio_name'])
+            hw.create_dataset('audio_name', data=hr['audio_name'][:], dtype='S20')
+            hw.create_dataset('target', data=hr['target'][:], dtype=np.bool)
+            hw.create_dataset('hdf5_path', data=[waveforms_hdf5_path.encode()] * audios_num, dtype='S200')
+            hw.create_dataset('index_in_hdf5', data=np.arange(audios_num), dtype=np.int32)
+    print('Write to {}'.format(indexes_hdf5_path))
+def combine_full_indexes(args):
+    """Combine all balanced and unbalanced indexes hdf5s to a single hdf5. This
+    combined indexes hdf5 is used for training with full data (~20k balanced
+    audio clips + ~1.9m unbalanced audio clips).
+    """
+    # Arguments & parameters
+    indexes_hdf5s_dir = args.indexes_hdf5s_dir
+    full_indexes_hdf5_path = args.full_indexes_hdf5_path
+    classes_num = config.classes_num
+    # Paths
+    paths = get_sub_filepaths(indexes_hdf5s_dir)
+    paths = [path for path in paths if (
+        'train' in path and 'full_train' not in path and 'mini' not in path)]
+    print('Total {} hdf5 to combine.'.format(len(paths)))
+    with h5py.File(full_indexes_hdf5_path, 'w') as full_hf:
+        full_hf.create_dataset(
+            name='audio_name',
+            shape=(0,),
+            maxshape=(None,),
+            dtype='S20')
+        full_hf.create_dataset(
+            name='target',
+            shape=(0, classes_num),
+            maxshape=(None, classes_num),
+            dtype=np.bool)
+        full_hf.create_dataset(
+            name='hdf5_path',
+            shape=(0,),
+            maxshape=(None,),
+            dtype='S200')
+        full_hf.create_dataset(
+            name='index_in_hdf5',
+            shape=(0,),
+            maxshape=(None,),
+            dtype=np.int32)
+        for path in paths:
+            with h5py.File(path, 'r') as part_hf:
+                print(path)
+                n = len(full_hf['audio_name'][:])
+                new_n = n + len(part_hf['audio_name'][:])
+                full_hf['audio_name'].resize((new_n,))
+                full_hf['audio_name'][n : new_n] = part_hf['audio_name'][:]
+                full_hf['target'].resize((new_n, classes_num))
+                full_hf['target'][n : new_n] = part_hf['target'][:]
+                full_hf['hdf5_path'].resize((new_n,))
+                full_hf['hdf5_path'][n : new_n] = part_hf['hdf5_path'][:]
+                full_hf['index_in_hdf5'].resize((new_n,))
+                full_hf['index_in_hdf5'][n : new_n] = part_hf['index_in_hdf5'][:]
+    print('Write combined full hdf5 to {}'.format(full_indexes_hdf5_path))
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    subparsers = parser.add_subparsers(dest='mode')
+    parser_create_indexes = subparsers.add_parser('create_indexes')
+    parser_create_indexes.add_argument('--waveforms_hdf5_path', type=str, required=True, help='Path of packed waveforms hdf5.')
+    parser_create_indexes.add_argument('--indexes_hdf5_path', type=str, required=True, help='Path to write out indexes hdf5.')
+    parser_combine_full_indexes = subparsers.add_parser('combine_full_indexes')
+    parser_combine_full_indexes.add_argument('--indexes_hdf5s_dir', type=str, required=True, help='Directory containing indexes hdf5s to be combined.')
+    parser_combine_full_indexes.add_argument('--full_indexes_hdf5_path', type=str, required=True, help='Path to write out full indexes hdf5 file.')
+    args = parser.parse_args()
+    if args.mode == 'create_indexes':
+        create_indexes(args)
+    elif args.mode == 'combine_full_indexes':
+        combine_full_indexes(args)
+    else:
+        raise Exception('Incorrect arguments!')

data_processor.py ADDED Viewed

	@@ -0,0 +1,179 @@

+# Ke Chen
+# [email protected]
+# Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data
+# The dataset classes
+import numpy as np
+import torch
+import logging
+import os
+import sys
+import h5py
+import csv
+import time
+import random
+import json
+from datetime import datetime
+from utils import int16_to_float32
+from torch.utils.data import Dataset, Sampler
+# output the dict["index"].key form to save the memory in multi-GPU training
+def reverse_dict(data_path, sed_path, output_dir):
+    # filename
+    waveform_dir = os.path.join(output_dir, "audioset_eval_waveform_balanced.h5")
+    sed_dir = os.path.join(output_dir, "audioset_eval_sed_balanced.h5")
+    # load data
+    logging.info("Write Data...............")
+    h_data = h5py.File(data_path, "r")
+    h_sed = h5py.File(sed_path, "r")
+    audio_num = len(h_data["waveform"])
+    assert len(h_data["waveform"]) == len(h_sed["sed_vector"]), "waveform and sed should be in the same length"
+    with h5py.File(waveform_dir, 'w') as hw:
+        for i in range(audio_num):
+            hw.create_dataset(str(i), data=int16_to_float32(h_data['waveform'][i]), dtype=np.float32)
+    logging.info("Write Data Succeed...............")
+    logging.info("Write Sed...............")
+    with h5py.File(sed_dir, 'w') as hw:
+        for i in range(audio_num):
+            hw.create_dataset(str(i), data=h_sed['sed_vector'][i], dtype=np.float32)
+    logging.info("Write Sed Succeed...............")
+# A dataset for handling musdb
+class MusdbDataset(Dataset):
+    def __init__(self, tracks):
+        self.tracks = tracks
+        self.dataset_len = len(tracks)
+    def __getitem__(self, index):
+        """Load waveform and target of an audio clip.
+        Args:
+            index: the index number
+        Return:
+            track: [mixture + n_sources, n_samples]
+        """
+        return self.tracks[index]
+    def __len__(self):
+        return self.dataset_len
+class InferDataset(Dataset):
+    def __init__(self, tracks):
+        self.tracks = tracks
+        self.dataset_len = len(tracks)
+    def __getitem__(self, index):
+        """Load waveform and target of an audio clip.
+        Args:
+            index: the index number
+        Return:
+            track: [mixture + n_sources, n_samples]
+        """
+        return self.tracks[index]
+    def __len__(self):
+        return self.dataset_len
+# polished LGSPDataset, the main dataset for procssing the audioset files
+class LGSPDataset(Dataset):
+    def __init__(self, index_path, idc, config, factor = 3, eval_mode = False):
+        self.index_path = index_path
+        self.fp = h5py.File(index_path, "r")
+        self.config = config
+        self.idc = idc
+        self.factor = factor
+        self.classes_num = self.config.classes_num
+        self.eval_mode = eval_mode
+        self.total_size = int(len(self.fp["audio_name"]) * self.factor)
+        self.generate_queue()
+        logging.info("total dataset size: %d" %(self.total_size))
+        logging.info("class num: %d" %(self.classes_num))
+    def generate_queue(self):
+        self.queue = []
+        self.class_queue = []
+        if self.config.debug:
+            self.total_size = 1000
+        if self.config.balanced_data:
+            while len(self.queue) < self.total_size * 2:
+                if self.eval_mode:
+                    if len(self.config.eval_list) == 0:
+                        class_set = [*range(self.classes_num)]
+                    else:
+                        class_set = self.config.eval_list[:]
+                else:
+                    class_set = [*range(self.classes_num)]
+                    class_set = list(set(class_set) - set(self.config.eval_list))
+                random.shuffle(class_set)
+                self.queue += [self.idc[d][random.randint(0, len(self.idc[d]) - 1)] for d in class_set]
+                self.class_queue += class_set[:]
+            self.queue = self.queue[:self.total_size * 2]
+            self.class_queue = self.class_queue[:self.total_size * 2]
+            self.queue = [[self.queue[i],self.queue[i+1]] for i in range(0, self.total_size * 2, 2)]
+            self.class_queue = [[self.class_queue[i],self.class_queue[i+1]] for i in range(0, self.total_size * 2, 2)]
+            assert len(self.queue) == self.total_size, "generate data error!!"
+        else:
+            if self.eval_mode:
+                    if len(self.config.eval_list) == 0:
+                        class_set = [*range(self.classes_num)]
+                    else:
+                        class_set = self.config.eval_list[:]
+            else:
+                class_set = [*range(self.classes_num)]
+                class_set = list(set(class_set) - set(self.config.eval_list))
+            self.class_queue = random.choices(class_set, k = self.total_size * 2)
+            self.queue = [self.idc[d][random.randint(0, len(self.idc[d]) - 1)] for d in self.class_queue]
+            self.queue = [[self.queue[i],self.queue[i+1]] for i in range(0, self.total_size * 2, 2)]
+            self.class_queue = [[self.class_queue[i],self.class_queue[i+1]] for i in range(0, self.total_size * 2, 2)]
+            assert len(self.queue) == self.total_size, "generate data error!!"
+        logging.info("queue regenerated:%s" %(self.queue[-5:]))
+    def __getitem__(self, index):
+        """Load waveform and target of an audio clip.
+        Args:
+            index: the index number
+        Return: {
+            "audio_name_1": str,
+            "waveform_1": (clip_samples,),
+            "class_id_1": int,
+            "audio_name_2": str,
+            "waveform_2": (clip_samples,),
+            "class_id_2": int,
+            ...
+            "check_num": int
+        }
+        """
+        # put the right index here!!!
+        data_dict = {}
+        for k in range(2):
+            s_index = self.queue[index][k]
+            target = self.class_queue[index][k]
+            audio_name = self.fp["audio_name"][s_index].decode()
+            hdf5_path = self.fp["hdf5_path"][s_index].decode().replace("/home/tiger/DB/knut/data/audioset", self.config.dataset_path)
+            r_idx = self.fp["index_in_hdf5"][s_index]
+            with h5py.File(hdf5_path, "r") as f:
+                waveform = int16_to_float32(f["waveform"][r_idx])
+            data_dict["audio_name_" + str(k+1)] = audio_name
+            data_dict["waveform_" + str(k+1)] = waveform
+            data_dict["class_id_" + str(k+1)] = target
+        data_dict["check_num"] = str(self.queue[-5:])
+        return data_dict
+    def __len__(self):
+        return self.total_size
+# only for test
+class TestDataset(Dataset):
+    def __init__(self, dataset_size):
+        print("init")
+        self.dataset_size = dataset_size
+        self.base_num = 100
+        self.dicts = [(self.base_num + 2 * i, self.base_num + 2 * i + 1) for i in range(self.dataset_size)]
+    def get_new_list(self):
+        self.base_num = random.randint(0,10)
+        print("base num changed:", self.base_num)
+        self.dicts = [(self.base_num + 2 * i, self.base_num + 2 * i + 1) for i in range(self.dataset_size)]
+    def __getitem__(self, index):
+        return self.dicts[index]
+    def __len__(self):
+        return self.dataset_size

htsat_config.py ADDED Viewed

	@@ -0,0 +1,122 @@

+# Ke Chen
+# [email protected]
+# Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data
+# The configuration file of ST-SED model or HTS-AT model
+exp_name = "exp_htsat_2048d" # the saved ckpt prefix name of the model
+workspace = "/home/kechen/Research/HTSAT" # the folder of your code
+dataset_path = "/home/Research/audioset" # the dataset path
+desed_folder = "/home/Research/DESED" # the desed file
+dataset_type = "audioset"
+loss_type = "clip_bce"
+balanced_data = True
+resume_checkpoint = "/home/kechen/Research/Latent_ASP/model_backup/htsat_audioset_2048d.ckpt"
+esc_fold = 0 # just for esc dataset, select the fold you need for evaluation and (+1) validation
+debug = False
+random_seed = 970131 # 19970318 970131 12412 127777 1009 34047
+batch_size = 32 * 4 # batch size per GPU x GPU number , default is 32 x 4 = 128
+learning_rate = 1e-3 # 1e-4 also workable
+max_epoch = 100
+num_workers = 3
+lr_scheduler_epoch = [10,20,30]
+lr_rate = [0.02, 0.05, 0.1]
+# these data preparation optimizations do not bring many improvements, so deprecated
+enable_token_label = False # token label
+class_map_path = "class_hier_map.npy"
+class_filter = None
+retrieval_index = [15382, 9202, 130, 17618, 17157, 17516, 16356, 6165, 13992, 9238, 5550, 5733, 1914, 1600, 3450, 13735, 11108, 3762,
+    9840, 11318, 8131, 4429, 16748, 4992, 16783, 12691, 4945, 8779, 2805, 9418, 2797, 14357, 5603, 212, 3852, 12666, 1338, 10269, 2388, 8260, 4293, 14454, 7677, 11253, 5060, 14938, 8840, 4542, 2627, 16336, 8992, 15496, 11140, 446, 6126, 10691, 8624, 10127, 9068, 16710, 10155, 14358, 7567, 5695, 2354, 8057, 17635, 133, 16183, 14535, 7248, 4560, 14429, 2463, 10773, 113, 2462, 9223, 4929, 14274, 4716, 17307, 4617, 2132, 11083, 1039, 1403, 9621, 13936, 2229, 2875, 17840, 9359, 13311, 9790, 13288, 4750, 17052, 8260, 14900]
+token_label_range = [0.2,0.6]
+enable_time_shift = False # shift time
+enable_label_enhance = False # enhance hierarchical label
+enable_repeat_mode = False # repeat the spectrogram / reshape the spectrogram
+# for model's design
+enable_tscam = True # enbale the token-semantic layer
+# for signal processing
+sample_rate = 32000 # 16000 for scv2, 32000 for audioset and esc-50
+clip_samples = sample_rate * 10 # audio_set 10-sec clip
+window_size = 1024
+hop_size = 320 # 160 for scv2, 320 for audioset and esc-50
+mel_bins = 64
+fmin = 50
+fmax = 14000
+shift_max = int(clip_samples * 0.5)
+# for data collection
+classes_num = 527 # esc: 50 | audioset: 527 | scv2: 35
+patch_size = (25, 4) # deprecated
+crop_size = None # int(clip_samples * 0.5) deprecated
+# for htsat hyperparamater
+htsat_window_size = 8
+htsat_spec_size =  256
+htsat_patch_size = 4
+htsat_stride = (4, 4)
+htsat_num_head = [4,8,16,32]
+htsat_dim = 256 # for 2048-d model
+htsat_depth = [2,2,6,2]
+swin_pretrain_path = None
+# "/home/Research/model_backup/pretrain/swin_tiny_c24_patch4_window8_256.pth"
+# Some Deprecated Optimization in the model design, check the model code for details
+htsat_attn_heatmap = False
+htsat_hier_output = False
+htsat_use_max = False
+# no use here
+ensemble_checkpoints = []
+ensemble_strides = []
+# weight average folder
+wa_folder = "/home/version_0/checkpoints/"
+# weight average output filename
+wa_model_path = "HTSAT_AudioSet_Saved_x.ckpt"
+esm_model_pathes = [
+    "/home/Research/model_backup/AudioSet/HTSAT_AudioSet_Saved_1.ckpt",
+    "/home/Research/model_backup/AudioSet/HTSAT_AudioSet_Saved_2.ckpt",
+    "/home/Research/model_backup/AudioSet/HTSAT_AudioSet_Saved_3.ckpt",
+    "/home/Research/model_backup/AudioSet/HTSAT_AudioSet_Saved_4.ckpt",
+    "/home/Research/model_backup/AudioSet/HTSAT_AudioSet_Saved_5.ckpt",
+    "/home/Research/model_backup/AudioSet/HTSAT_AudioSet_Saved_6.ckpt"
+]
+# for framewise localization
+heatmap_dir = "/home/Research/heatmap_output"
+test_file = "htsat-test-ensemble"
+fl_local = False # indicate if we need to use this dataset for the framewise detection
+fl_dataset = "/home/Research/desed/desed_eval.npy"
+fl_class_num = [
+    "Speech", "Frying", "Dishes", "Running_water",
+    "Blender", "Electric_shaver_toothbrush", "Alarm_bell_ringing",
+    "Cat", "Dog", "Vacuum_cleaner"
+]
+# map 527 classes into 10 classes
+fl_audioset_mapping = [
+    [0,1,2,3,4,5,6,7],
+    [366, 367, 368],
+    [364],
+    [288, 289, 290, 291, 292, 293, 294, 295, 296, 297],
+    [369],
+    [382],
+    [310, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402],
+    [81, 82, 83, 84, 85],
+    [74, 75, 76, 77, 78, 79],
+    [377]
+]

htsat_utils.py ADDED Viewed

	@@ -0,0 +1,226 @@

+# Ke Chen
+# [email protected]
+# HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION
+# Some Useful Common Methods
+import numpy as np
+import torch
+import torch.nn as nn
+from torch import Tensor
+from typing import Optional
+import logging
+import os
+import sys
+import h5py
+import csv
+import time
+import json
+import museval
+import librosa
+from datetime import datetime
+from tqdm import tqdm
+from scipy import stats
+import torch.nn as nn
+import torch.nn.functional as F
+# import from https://github.com/Alibaba-MIIL/ASL/blob/main/src/loss_functions/losses.py
+class AsymmetricLoss(nn.Module):
+    def __init__(self, gamma_neg=4, gamma_pos=1, clip=0.05, eps=1e-8, disable_torch_grad_focal_loss=True):
+        super(AsymmetricLoss, self).__init__()
+        self.gamma_neg = gamma_neg
+        self.gamma_pos = gamma_pos
+        self.clip = clip
+        self.disable_torch_grad_focal_loss = disable_torch_grad_focal_loss
+        self.eps = eps
+    def forward(self, x, y):
+        """"
+        Parameters
+        ----------
+        x: input logits
+        y: targets (multi-label binarized vector)
+        """
+        # Calculating Probabilities
+        # x_sigmoid = torch.sigmoid(x)
+        x_sigmoid = x # without sigmoid since it has been computed
+        xs_pos = x_sigmoid
+        xs_neg = 1 - x_sigmoid
+        # Asymmetric Clipping
+        if self.clip is not None and self.clip > 0:
+            xs_neg = (xs_neg + self.clip).clamp(max=1)
+        # Basic CE calculation
+        los_pos = y * torch.log(xs_pos.clamp(min=self.eps))
+        los_neg = (1 - y) * torch.log(xs_neg.clamp(min=self.eps))
+        loss = los_pos + los_neg
+        # Asymmetric Focusing
+        if self.gamma_neg > 0 or self.gamma_pos > 0:
+            if self.disable_torch_grad_focal_loss:
+                torch.set_grad_enabled(False)
+            pt0 = xs_pos * y
+            pt1 = xs_neg * (1 - y)  # pt = p if t > 0 else 1-p
+            pt = pt0 + pt1
+            one_sided_gamma = self.gamma_pos * y + self.gamma_neg * (1 - y)
+            one_sided_w = torch.pow(1 - pt, one_sided_gamma)
+            if self.disable_torch_grad_focal_loss:
+                torch.set_grad_enabled(True)
+            loss *= one_sided_w
+        return -loss.mean()
+def get_mix_lambda(mixup_alpha, batch_size):
+    mixup_lambdas = [np.random.beta(mixup_alpha, mixup_alpha, 1)[0] for _ in range(batch_size)]
+    return np.array(mixup_lambdas).astype(np.float32)
+def create_folder(fd):
+    if not os.path.exists(fd):
+        os.makedirs(fd)
+def dump_config(config, filename, include_time = False):
+    save_time = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+    config_json = {}
+    for key in dir(config):
+        if not key.startswith("_"):
+            config_json[key] = eval("config." + key)
+    if include_time:
+        filename = filename + "_" + save_time
+    with open(filename + ".json", "w") as f:
+        json.dump(config_json, f ,indent=4)
+def int16_to_float32(x):
+    return (x / 32767.).astype(np.float32)
+def float32_to_int16(x):
+    x = np.clip(x, a_min = -1., a_max = 1.)
+    return (x * 32767.).astype(np.int16)
+# index for each class
+def process_idc(index_path, classes_num, filename):
+    # load data
+    logging.info("Load Data...............")
+    idc = [[] for _ in range(classes_num)]
+    with h5py.File(index_path, "r") as f:
+        for i in tqdm(range(len(f["target"]))):
+            t_class = np.where(f["target"][i])[0]
+            for t in t_class:
+                idc[t].append(i)
+    print(idc)
+    np.save(filename, idc)
+    logging.info("Load Data Succeed...............")
+def clip_bce(pred, target):
+    """Binary crossentropy loss.
+    """
+    return F.binary_cross_entropy(pred, target)
+    # return F.binary_cross_entropy(pred, target)
+def clip_ce(pred, target):
+    return F.cross_entropy(pred, target)
+def d_prime(auc):
+    d_prime = stats.norm().ppf(auc) * np.sqrt(2.0)
+    return d_prime
+def get_loss_func(loss_type):
+    if loss_type == 'clip_bce':
+        return clip_bce
+    if loss_type == 'clip_ce':
+        return clip_ce
+    if loss_type == 'asl_loss':
+        loss_func = AsymmetricLoss(gamma_neg=4, gamma_pos=0,clip=0.05)
+        return loss_func
+def do_mixup_label(x):
+    out = torch.logical_or(x, torch.flip(x, dims = [0])).float()
+    return out
+def do_mixup(x, mixup_lambda):
+    """
+    Args:
+      x: (batch_size , ...)
+      mixup_lambda: (batch_size,)
+    Returns:
+      out: (batch_size, ...)
+    """
+    out = (x.transpose(0,-1) * mixup_lambda + torch.flip(x, dims = [0]).transpose(0,-1) * (1 - mixup_lambda)).transpose(0,-1)
+    return out
+def interpolate(x, ratio):
+    """Interpolate data in time domain. This is used to compensate the
+    resolution reduction in downsampling of a CNN.
+    Args:
+      x: (batch_size, time_steps, classes_num)
+      ratio: int, ratio to interpolate
+    Returns:
+      upsampled: (batch_size, time_steps * ratio, classes_num)
+    """
+    (batch_size, time_steps, classes_num) = x.shape
+    upsampled = x[:, :, None, :].repeat(1, 1, ratio, 1)
+    upsampled = upsampled.reshape(batch_size, time_steps * ratio, classes_num)
+    return upsampled
+def pad_framewise_output(framewise_output, frames_num):
+    """Pad framewise_output to the same length as input frames. The pad value
+    is the same as the value of the last frame.
+    Args:
+      framewise_output: (batch_size, frames_num, classes_num)
+      frames_num: int, number of frames to pad
+    Outputs:
+      output: (batch_size, frames_num, classes_num)
+    """
+    pad = framewise_output[:, -1 :, :].repeat(1, frames_num - framewise_output.shape[1], 1)
+    """tensor for padding"""
+    output = torch.cat((framewise_output, pad), dim=1)
+    """(batch_size, frames_num, classes_num)"""
+    return output
+# set the audio into the format that can be fed into the model
+# resample -> convert to mono -> output the audio
+# track [n_sample, n_channel]
+def prepprocess_audio(track, ofs, rfs, mono_type = "mix"):
+    if track.shape[-1] > 1:
+        # stereo
+        if mono_type == "mix":
+            track = np.transpose(track, (1,0))
+            track = librosa.to_mono(track)
+        elif mono_type == "left":
+            track = track[:, 0]
+        elif mono_type == "right":
+            track = track[:, 1]
+    else:
+        track = track[:, 0]
+    # track [n_sample]
+    if ofs != rfs:
+        track = librosa.resample(track, ofs, rfs)
+    return track
+def init_hier_head(class_map, num_class):
+    class_map = np.load(class_map, allow_pickle = True)
+    head_weight = torch.zeros(num_class,num_class).float()
+    head_bias = torch.zeros(num_class).float()
+    for i in range(len(class_map)):
+        for d in class_map[i][1]:
+            head_weight[d][i] = 1.0
+        for d in class_map[i][2]:
+            head_weight[d][i] = 1.0 / len(class_map[i][2])
+        head_weight[i][i] = 1.0
+    return head_weight, head_bias

losses.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import torch.nn.functional as F
+import torch
+import numpy as np
+def mae(input, target):
+    return torch.mean(torch.abs(input - target))
+def logmae_wav(model, output_dict, target):
+    loss = torch.log10(torch.clamp(mae(output_dict['wav'], target), 1e-8, np.inf))
+    return loss
+def get_loss_func(loss_type):
+    if loss_type == 'logmae_wav':
+        return logmae_wav
+    elif loss_type == 'mae':
+    	return mae
+    else:
+        raise Exception('Incorrect loss_type!')

main.py ADDED Viewed

	@@ -0,0 +1,502 @@

+# Ke Chen
+# [email protected]
+# Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data
+# The Main Script
+import os
+# this is to avoid the sdr calculation from occupying all cpus
+os.environ["OMP_NUM_THREADS"] = "4"
+os.environ["OPENBLAS_NUM_THREADS"] = "4"
+os.environ["MKL_NUM_THREADS"] = "6"
+os.environ["VECLIB_MAXIMUM_THREADS"] = "4"
+os.environ["NUMEXPR_NUM_THREADS"] = "6"
+import sys
+import librosa
+import numpy as np
+import argparse
+import logging
+import torch
+from torch.utils.data import DataLoader
+from torch.utils.data.distributed import DistributedSampler
+from utils import collect_fn, dump_config, create_folder, prepprocess_audio
+import musdb
+from models.asp_model import ZeroShotASP, SeparatorModel, AutoTaggingWarpper, WhitingWarpper
+from data_processor import LGSPDataset, MusdbDataset
+import config
+import htsat_config
+from models.htsat import HTSAT_Swin_Transformer
+from sed_model import SEDWrapper
+import pytorch_lightning as pl
+from pytorch_lightning.callbacks import ModelCheckpoint
+from htsat_utils import process_idc
+import warnings
+warnings.filterwarnings("ignore")
+class data_prep(pl.LightningDataModule):
+    def __init__(self, train_dataset, eval_dataset, device_num, config):
+        super().__init__()
+        self.train_dataset = train_dataset
+        self.eval_dataset = eval_dataset
+        self.device_num = device_num
+        self.config = config
+    def train_dataloader(self):
+        train_sampler = DistributedSampler(self.train_dataset, shuffle = False) if self.device_num > 1 else None
+        train_loader = DataLoader(
+            dataset = self.train_dataset,
+            num_workers = config.num_workers,
+            batch_size = config.batch_size // self.device_num,
+            shuffle = False,
+            sampler = train_sampler,
+            collate_fn = collect_fn
+        )
+        return train_loader
+    def val_dataloader(self):
+        eval_sampler = DistributedSampler(self.eval_dataset, shuffle = False) if self.device_num > 1 else None
+        eval_loader = DataLoader(
+            dataset = self.eval_dataset,
+            num_workers = config.num_workers,
+            batch_size = config.batch_size // self.device_num,
+            shuffle = False,
+            sampler = eval_sampler,
+            collate_fn = collect_fn
+        )
+        return eval_loader
+    def test_dataloader(self):
+        test_sampler = DistributedSampler(self.eval_dataset, shuffle = False) if self.device_num > 1 else None
+        test_loader = DataLoader(
+            dataset = self.eval_dataset,
+            num_workers = config.num_workers,
+            batch_size = config.batch_size // self.device_num,
+            shuffle = False,
+            sampler = test_sampler,
+            collate_fn = collect_fn
+        )
+        return test_loader
+def save_idc():
+    train_index_path = os.path.join(config.dataset_path, "hdf5s", "indexes", config.index_type + ".h5")
+    eval_index_path = os.path.join(config.dataset_path,"hdf5s", "indexes", "eval.h5")
+    process_idc(train_index_path, config.classes_num,  config.index_type + "_idc.npy")
+    process_idc(eval_index_path, config.classes_num, "eval_idc.npy")
+# Process the musdb tracks into the sample rate of 32000 Hz sample rate, the original is 44100 Hz
+def process_musdb():
+    # use musdb as testset
+    test_data = musdb.DB(
+        root = config.musdb_path,
+        download = False,
+        subsets = "test",
+        is_wav = True
+    )
+    print(len(test_data.tracks))
+    mus_tracks = []
+    # in musdb, all fs is the same (44100)
+    orig_fs = test_data.tracks[0].rate
+    print(orig_fs)
+    for track in test_data.tracks:
+        temp = {}
+        mixture = prepprocess_audio(
+            track.audio,
+            orig_fs, config.sample_rate,
+            config.test_type
+        )
+        temp["mixture" ]= mixture
+        for dickey in config.test_key:
+            source = prepprocess_audio(
+                track.targets[dickey].audio,
+                orig_fs, config.sample_rate,
+                config.test_type
+            )
+            temp[dickey] = source
+        print(track.audio.shape, len(temp.keys()), temp["mixture"].shape)
+        mus_tracks.append(temp)
+    print(len(mus_tracks))
+    # save the file to npy
+    np.save("musdb-32000fs.npy", mus_tracks)
+# weight average will perform in the given folder
+# It will output one model checkpoint, which avergas the weight of all models in the folder
+def weight_average():
+    model_ckpt = []
+    model_files = os.listdir(config.wa_model_folder)
+    wa_ckpt = {
+        "state_dict": {}
+    }
+    for model_file in model_files:
+        model_file = os.path.join(config.esm_model_folder, model_file)
+        model_ckpt.append(torch.load(model_file, map_location="cpu")["state_dict"])
+    keys = model_ckpt[0].keys()
+    for key in keys:
+        model_ckpt_key = torch.cat([d[key].float().unsqueeze(0) for d in model_ckpt])
+        model_ckpt_key = torch.mean(model_ckpt_key, dim = 0)
+        assert model_ckpt_key.shape == model_ckpt[0][key].shape, "the shape is unmatched " + model_ckpt_key.shape + " " + model_ckpt[0][key].shape
+        wa_ckpt["state_dict"][key] = model_ckpt_key
+    torch.save(wa_ckpt, config.wa_model_path)
+# use the model to quickly separate a track given a query
+# it requires four variables in config.py:
+#   inference_file: the track you want to separate
+#   inference_query: a **folder** containing all samples from the same source
+#   test_key: ["name"] indicate the source name (just a name for final output, no other functions)
+#   wave_output_path: the output folder
+# make sure the query folder contain the samples from the same source
+# each time, the model is able to separate one source from the track
+# if you want to separate multiple sources, you need to change the query folder or write a script to help you do that
+def inference():
+    # set exp settings
+    device_name = "cuda" if torch.cuda.is_available() else "cpu"
+    device = torch.device("cuda")
+    assert config.test_key is not None, "there should be a separate key"
+    create_folder(config.wave_output_path)
+    test_track, fs = librosa.load(config.inference_file, sr = None)
+    test_track = test_track[:,None]
+    print(test_track.shape)
+    print(fs)
+    # convert the track into 32000 Hz sample rate
+    test_track = prepprocess_audio(
+        test_track,
+        fs, config.sample_rate,
+        config.test_type
+        )
+    test_tracks = []
+    temp = [test_track]
+    for dickey in config.test_key:
+        temp.append(test_track)
+    temp = np.array(temp)
+    test_tracks.append(temp)
+    dataset = MusdbDataset(tracks = test_tracks) # the action is similar to musdbdataset, reuse it
+    loader = DataLoader(
+        dataset = dataset,
+        num_workers = 1,
+        batch_size = 1,
+        shuffle = False
+    )
+    # obtain the samples for query
+    queries = []
+    for query_file in os.listdir(config.inference_query):
+        f_path = os.path.join(config.inference_query, query_file)
+        if query_file.endswith(".wav"):
+            temp_q, fs = librosa.load(f_path, sr = None)
+            temp_q = temp_q[:, None]
+            temp_q = prepprocess_audio(
+                temp_q,
+                fs, config.sample_rate,
+                config.test_type
+            )
+            temp = [temp_q]
+            for dickey in config.test_key:
+                temp.append(temp_q)
+            temp = np.array(temp)
+            queries.append(temp)
+    assert config.resume_checkpoint is not None, "there should be a saved model when inferring"
+    sed_model = HTSAT_Swin_Transformer(
+        spec_size=htsat_config.htsat_spec_size,
+        patch_size=htsat_config.htsat_patch_size,
+        in_chans=1,
+        num_classes=htsat_config.classes_num,
+        window_size=htsat_config.htsat_window_size,
+        config = htsat_config,
+        depths = htsat_config.htsat_depth,
+        embed_dim = htsat_config.htsat_dim,
+        patch_stride=htsat_config.htsat_stride,
+        num_heads=htsat_config.htsat_num_head
+    )
+    at_model = SEDWrapper(
+        sed_model = sed_model,
+        config = htsat_config,
+        dataset = None
+    )
+    ckpt = torch.load(htsat_config.resume_checkpoint, map_location="cpu")
+    at_model.load_state_dict(ckpt["state_dict"])
+    trainer = pl.Trainer(
+        gpus = 1
+    )
+    avg_at = None
+    # obtain the latent embedding as query
+    if config.infer_type == "mean":
+        avg_dataset = MusdbDataset(tracks = queries)
+        avg_loader = DataLoader(
+            dataset = avg_dataset,
+            num_workers = 1,
+            batch_size = 1,
+            shuffle = False
+        )
+        at_wrapper = AutoTaggingWarpper(
+            at_model = at_model,
+            config = config,
+            target_keys = config.test_key
+        )
+        trainer.test(at_wrapper, test_dataloaders = avg_loader)
+        avg_at = at_wrapper.avg_at
+    # import seapration model
+    model = ZeroShotASP(
+        channels = 1, config = config,
+        at_model = at_model,
+        dataset = dataset
+    )
+    # resume checkpoint
+    ckpt = torch.load(config.resume_checkpoint, map_location="cpu")
+    model.load_state_dict(ckpt["state_dict"], strict= False)
+    exp_model = SeparatorModel(
+        model = model,
+        config = config,
+        target_keys = config.test_key,
+        avg_at = avg_at,
+        using_wiener = False,
+        calc_sdr = False,
+        output_wav = True
+    )
+    trainer.test(exp_model, test_dataloaders = loader)
+# test the separation model, mainly in musdb
+def test():
+    # set exp settings
+    device_name = "cuda" if torch.cuda.is_available() else "cpu"
+    device = torch.device("cuda")
+    assert config.test_key is not None, "there should be a separate key"
+    create_folder(config.wave_output_path)
+    # use musdb as testset
+    test_data = np.load(config.testset_path, allow_pickle = True)
+    print(len(test_data))
+    mus_tracks = []
+    # in musdb, all fs is the same (44100)
+    # load the dataset
+    for track in test_data:
+        temp = []
+        mixture = track["mixture"]
+        temp.append(mixture)
+        for dickey in config.test_key:
+            source = track[dickey]
+            temp.append(source)
+        temp = np.array(temp)
+        print(temp.shape)
+        mus_tracks.append(temp)
+    print(len(mus_tracks))
+    dataset = MusdbDataset(tracks = mus_tracks)
+    loader = DataLoader(
+        dataset = dataset,
+        num_workers = 1,
+        batch_size = 1,
+        shuffle = False
+    )
+    assert config.resume_checkpoint is not None, "there should be a saved model when inferring"
+    sed_model = HTSAT_Swin_Transformer(
+        spec_size=htsat_config.htsat_spec_size,
+        patch_size=htsat_config.htsat_patch_size,
+        in_chans=1,
+        num_classes=htsat_config.classes_num,
+        window_size=htsat_config.htsat_window_size,
+        config = htsat_config,
+        depths = htsat_config.htsat_depth,
+        embed_dim = htsat_config.htsat_dim,
+        patch_stride=htsat_config.htsat_stride,
+        num_heads=htsat_config.htsat_num_head
+    )
+    at_model = SEDWrapper(
+        sed_model = sed_model,
+        config = htsat_config,
+        dataset = None
+    )
+    ckpt = torch.load(htsat_config.resume_checkpoint, map_location="cpu")
+    at_model.load_state_dict(ckpt["state_dict"])
+    trainer = pl.Trainer(
+        gpus = 1
+    )
+    avg_at = None
+    # obtain the query of four stems from the training set
+    if config.infer_type == "mean":
+        avg_data = np.load(config.testavg_path, allow_pickle = True)[:90]
+        print(len(avg_data))
+        avgmus_tracks = []
+        # in musdb, all fs is the same (44100)
+        # load the dataset
+        for track in avg_data:
+            temp = []
+            mixture = track["mixture"]
+            temp.append(mixture)
+            for dickey in config.test_key:
+                source = track[dickey]
+                temp.append(source)
+            temp = np.array(temp)
+            print(temp.shape)
+            avgmus_tracks.append(temp)
+        print(len(avgmus_tracks))
+        avg_dataset = MusdbDataset(tracks = avgmus_tracks)
+        avg_loader = DataLoader(
+            dataset = avg_dataset,
+            num_workers = 1,
+            batch_size = 1,
+            shuffle = False
+        )
+        at_wrapper = AutoTaggingWarpper(
+            at_model = at_model,
+            config = config,
+            target_keys = config.test_key
+        )
+        trainer.test(at_wrapper, test_dataloaders = avg_loader)
+        avg_at = at_wrapper.avg_at
+    model = ZeroShotASP(
+        channels = 1, config = config,
+        at_model = at_model,
+        dataset = dataset
+    )
+    ckpt = torch.load(config.resume_checkpoint, map_location="cpu")
+    model.load_state_dict(ckpt["state_dict"], strict= False)
+    exp_model = SeparatorModel(
+        model = model,
+        config = config,
+        target_keys = config.test_key,
+        avg_at = avg_at,
+        using_wiener = config.using_wiener
+    )
+    trainer.test(exp_model, test_dataloaders = loader)
+def train():
+    # set exp settings
+    # device_name = "cuda" if torch.cuda.is_available() else "cpu"
+    # device = torch.device("cuda")
+    device_num = torch.cuda.device_count()
+    print("each batch size:", config.batch_size // device_num)
+    train_index_path = os.path.join(config.dataset_path, "hdf5s","indexes", config.index_type + ".h5")
+    train_idc = np.load(os.path.join(config.idc_path, config.index_type + "_idc.npy"), allow_pickle = True)
+    eval_index_path = os.path.join(config.dataset_path,"hdf5s", "indexes", "eval.h5")
+    eval_idc = np.load(os.path.join(config.idc_path, "eval_idc.npy"), allow_pickle = True)
+    # set exp folder
+    exp_dir = os.path.join(config.workspace, "results", config.exp_name)
+    checkpoint_dir = os.path.join(config.workspace, "results", config.exp_name, "checkpoint")
+    if not config.debug:
+        create_folder(os.path.join(config.workspace, "results"))
+        create_folder(exp_dir)
+        create_folder(checkpoint_dir)
+        dump_config(config, os.path.join(exp_dir, config.exp_name), False)
+    # load data
+    # import dataset LGSPDataset (latent general source separation) and sampler
+    dataset = LGSPDataset(
+        index_path = train_index_path,
+        idc = train_idc,
+        config = config,
+        factor = 0.05,
+        eval_mode = False
+    )
+    eval_dataset = LGSPDataset(
+        index_path = eval_index_path,
+        idc = eval_idc,
+        config = config,
+        factor = 0.05,
+        eval_mode = True
+    )
+    audioset_data = data_prep(train_dataset=dataset,eval_dataset=eval_dataset,device_num=device_num, config=config)
+    checkpoint_callback = ModelCheckpoint(
+        monitor = "mixture_sdr",
+        filename='l-{epoch:d}-{mixture_sdr:.3f}-{clean_sdr:.3f}-{silence_sdr:.3f}',
+        save_top_k = 10,
+        mode = "max"
+    )
+    # infer at model
+    sed_model = HTSAT_Swin_Transformer(
+        spec_size=htsat_config.htsat_spec_size,
+        patch_size=htsat_config.htsat_patch_size,
+        in_chans=1,
+        num_classes=htsat_config.classes_num,
+        window_size=htsat_config.htsat_window_size,
+        config = htsat_config,
+        depths = htsat_config.htsat_depth,
+        embed_dim = htsat_config.htsat_dim,
+        patch_stride=htsat_config.htsat_stride,
+        num_heads=htsat_config.htsat_num_head
+    )
+    at_model = SEDWrapper(
+        sed_model = sed_model,
+        config = htsat_config,
+        dataset = None
+    )
+    # load the checkpoint
+    ckpt = torch.load(htsat_config.resume_checkpoint, map_location="cpu")
+    at_model.load_state_dict(ckpt["state_dict"])
+    trainer = pl.Trainer(
+        deterministic=True,
+        default_root_dir = checkpoint_dir,
+        gpus = device_num,
+        val_check_interval = 0.2,
+        # check_val_every_n_epoch = 1,
+        max_epochs = config.max_epoch,
+        auto_lr_find = True,
+        sync_batchnorm = True,
+        callbacks = [checkpoint_callback],
+        accelerator = "ddp" if device_num > 1 else None,
+        resume_from_checkpoint = None, #config.resume_checkpoint,
+        replace_sampler_ddp = False,
+        gradient_clip_val=1.0,
+        num_sanity_val_steps = 0,
+    )
+    model = ZeroShotASP(
+        channels = 1, config = config,
+        at_model = at_model,
+        dataset = dataset
+    )
+    if config.resume_checkpoint is not None:
+        ckpt = torch.load(config.resume_checkpoint, map_location="cpu")
+        model.load_state_dict(ckpt["state_dict"])
+    # trainer.test(model, datamodule = audioset_data)
+    trainer.fit(model, audioset_data)
+def main():
+    parser = argparse.ArgumentParser(description="latent genreal source separation parser")
+    subparsers = parser.add_subparsers(dest = "mode")
+    parser_train = subparsers.add_parser("train")
+    parser_test = subparsers.add_parser("test")
+    parser_musdb = subparsers.add_parser("musdb_process")
+    parser_saveidc = subparsers.add_parser("save_idc")
+    parser_wa = subparsers.add_parser("weight_average")
+    parser_infer = subparsers.add_parser("inference")
+    args = parser.parse_args()
+    # default settings
+    logging.basicConfig(level=logging.INFO)
+    pl.utilities.seed.seed_everything(seed = config.random_seed)
+    if args.mode == "train":
+        train()
+    elif args.mode == "test":
+        test()
+    elif args.mode == "musdb_process":
+        process_musdb()
+    elif args.mode == "weight_average":
+        weight_average()
+    elif args.mode == "save_idc":
+        save_idc()
+    elif args.mode == "inference":
+        inference()
+    else:
+        raise Exception("Error Mode!")
+if __name__ == '__main__':
+    main()

opt_thres.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f40e97a4946e70392576d4f4c171596bcb6243883a54f48aaa9ae5b86c0976c
+size 13585

predict.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import os
+import types
+import librosa
+import numpy as np
+import pytorch_lightning as pl
+import torch
+from torch.utils.data import DataLoader
+import htsat_config
+from cog import BasePredictor, Input, Path
+from data_processor import MusdbDataset
+from models.asp_model import AutoTaggingWarpper, SeparatorModel, ZeroShotASP
+from models.htsat import HTSAT_Swin_Transformer
+from sed_model import SEDWrapper
+from utils import prepprocess_audio
+def get_inference_configs():
+    config = types.SimpleNamespace()
+    config.ckpt_path = "pretrained/zeroshot_asp_full.ckpt"
+    config.sed_ckpt_path = "pretrained/htsat_audioset_2048d.ckpt"
+    config.wave_output_path = "predict_outputs"
+    config.test_key = "query_name"
+    config.test_type = "mix"
+    config.loss_type = "mae"
+    config.infer_type = "mean"
+    config.sample_rate = 32000
+    config.segment_frames = 200
+    config.hop_samples = 320
+    config.energy_thres = 0.1
+    config.using_whiting = False
+    config.latent_dim = 2048
+    config.classes_num = 527
+    config.overlap_rate = 0.5
+    config.num_workers = 1
+    return config
+def load_models(config):
+    sed_model = HTSAT_Swin_Transformer(
+        spec_size=htsat_config.htsat_spec_size,
+        patch_size=htsat_config.htsat_patch_size,
+        in_chans=1,
+        num_classes=htsat_config.classes_num,
+        window_size=htsat_config.htsat_window_size,
+        config=htsat_config,
+        depths=htsat_config.htsat_depth,
+        embed_dim=htsat_config.htsat_dim,
+        patch_stride=htsat_config.htsat_stride,
+        num_heads=htsat_config.htsat_num_head,
+    )
+    at_model = SEDWrapper(sed_model=sed_model, config=htsat_config, dataset=None)
+    ckpt = torch.load(config.sed_ckpt_path, map_location="cpu")
+    at_model.load_state_dict(ckpt["state_dict"])
+    at_wrapper = AutoTaggingWarpper(
+        at_model=at_model, config=config, target_keys=[config.test_key]
+    )
+    asp_model = ZeroShotASP(channels=1, config=config, at_model=at_model, dataset=None)
+    ckpt = torch.load(config.ckpt_path, map_location="cpu")
+    asp_model.load_state_dict(ckpt["state_dict"], strict=False)
+    return at_wrapper, asp_model
+def get_dataloader_from_sound_file(sound_file_path, config):
+    signal, sampling_rate = librosa.load(str(sound_file_path), sr=None)
+    signal = prepprocess_audio(
+        signal[:, None], sampling_rate, config.sample_rate, config.test_type
+    )
+    signal = np.array([signal, signal]) # Duplicate signal for later use
+    dataset = MusdbDataset(tracks=[signal])
+    data_loader = DataLoader(dataset, num_workers=config.num_workers, batch_size=1, shuffle=False)
+    return data_loader
+class Predictor(BasePredictor):
+    def setup(self):
+        self.config = get_inference_configs()
+        os.makedirs(self.config.wave_output_path, exist_ok=True)
+        self.at_wrapper, self.asp_model = load_models(self.config)
+    def predict(
+        self,
+        mix_file: Path = Input(description="Reference sound to extract source from"),
+        query_file: Path = Input(description="Query sound to be searched and extracted from mix"),
+    ) -> Path:
+        ref_loader = get_dataloader_from_sound_file(str(mix_file), self.config)
+        query_loader = get_dataloader_from_sound_file(str(query_file), self.config)
+        trainer = pl.Trainer(gpus=1)
+        trainer.test(self.at_wrapper, test_dataloaders=query_loader)
+        avg_at = self.at_wrapper.avg_at
+        exp_model = SeparatorModel(
+            model=self.asp_model,
+            config=self.config,
+            target_keys=[self.config.test_key],
+            avg_at=avg_at,
+            using_wiener=False,
+            calc_sdr=False,
+            output_wav=True,
+        )
+        trainer.test(exp_model, test_dataloaders=ref_loader)
+        prediction_path = os.path.join(
+            self.config.wave_output_path, f"0_{self.config.test_key}_pred_(0.0).wav"
+        )
+        return prediction_path

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+h5py==3.6.0
+hydra-core>=1.0
+librosa==0.8.1
+musdb==0.4.0
+museval==0.4.0
+noisereduce==2.0.0
+numba==0.55.1
+numpy==1.21.5
+omegaconf>=2.0.0
+pytorch_lightning==1.5.9
+scikit_learn==1.0.2
+scipy==1.7.3
+soundfile==0.10.3.post1
+tensorboard==2.8.0
+torch==1.10.2
+torchaudio==0.10.2
+torchcontrib==0.0.2
+torchlibrosa==0.0.9
+tqdm==4.62.3

sed_model.py ADDED Viewed

	@@ -0,0 +1,358 @@

+# Ke Chen
+# [email protected]
+# HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION
+# The Model Training Wrapper
+import numpy as np
+import librosa
+import os
+import sys
+import math
+import bisect
+import pickle
+from numpy.lib.function_base import average
+from sklearn import metrics
+import soundfile as sf
+from sklearn.metrics import average_precision_score, roc_auc_score, accuracy_score
+import tensorboard
+import torch
+import torchaudio
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as cp
+import torch.optim as optim
+from torch.nn.parameter import Parameter
+import torch.distributed as dist
+from torchlibrosa.stft import STFT, ISTFT, magphase
+import pytorch_lightning as pl
+from htsat_utils import do_mixup, get_mix_lambda, do_mixup_label, get_loss_func, d_prime
+import random
+from torchcontrib.optim import SWA
+class SEDWrapper(pl.LightningModule):
+    def __init__(self, sed_model, config, dataset):
+        super().__init__()
+        self.sed_model = sed_model
+        self.config = config
+        self.dataset = dataset
+        self.loss_func = get_loss_func(config.loss_type)
+    def evaluate_metric(self, pred, ans):
+        ap = []
+        if self.config.dataset_type == "audioset":
+            mAP = np.mean(average_precision_score(ans, pred, average = None))
+            mAUC = np.mean(roc_auc_score(ans, pred, average = None))
+            dprime = d_prime(mAUC)
+            return {"mAP": mAP, "mAUC": mAUC, "dprime": dprime}
+        else:
+            acc = accuracy_score(ans, np.argmax(pred, 1))
+            return {"acc": acc}
+    def forward(self, x, mix_lambda = None):
+        output_dict = self.sed_model(x, mix_lambda)
+        return output_dict["clipwise_output"], output_dict["framewise_output"]
+    def inference(self, x):
+        self.device_type = next(self.parameters()).device
+        self.eval()
+        x = torch.from_numpy(x).float().to(self.device_type)
+        output_dict = self.sed_model(x, None, True)
+        for key in output_dict.keys():
+            output_dict[key] = output_dict[key].detach().cpu().numpy()
+        return output_dict
+    def training_step(self, batch, batch_idx):
+        self.device_type = next(self.parameters()).device
+        mix_lambda = torch.from_numpy(get_mix_lambda(0.5, len(batch["waveform"]))).to(self.device_type)
+        # Another Choice: also mixup the target, but AudioSet is not a perfect data
+        # so "adding noise" might be better than purly "mix"
+        # batch["target"] = do_mixup_label(batch["target"])
+        # batch["target"] = do_mixup(batch["target"], mix_lambda)
+        pred, _ = self(batch["waveform"], mix_lambda)
+        loss = self.loss_func(pred, batch["target"])
+        self.log("loss", loss, on_epoch= True, prog_bar=True)
+        return loss
+    def training_epoch_end(self, outputs):
+        # Change: SWA, deprecated
+        # for opt in self.trainer.optimizers:
+        #     if not type(opt) is SWA:
+        #         continue
+        #     opt.swap_swa_sgd()
+        self.dataset.generate_queue()
+    def validation_step(self, batch, batch_idx):
+        pred, _ = self(batch["waveform"])
+        return [pred.detach(), batch["target"].detach()]
+    def validation_epoch_end(self, validation_step_outputs):
+        self.device_type = next(self.parameters()).device
+        pred = torch.cat([d[0] for d in validation_step_outputs], dim = 0)
+        target = torch.cat([d[1] for d in validation_step_outputs], dim = 0)
+        gather_pred = [torch.zeros_like(pred) for _ in range(dist.get_world_size())]
+        gather_target = [torch.zeros_like(target) for _ in range(dist.get_world_size())]
+        dist.barrier()
+        if self.config.dataset_type == "audioset":
+            metric_dict = {
+                "mAP": 0.,
+                "mAUC": 0.,
+                "dprime": 0.
+            }
+        else:
+            metric_dict = {
+                "acc":0.
+            }
+        dist.all_gather(gather_pred, pred)
+        dist.all_gather(gather_target, target)
+        if dist.get_rank() == 0:
+            gather_pred = torch.cat(gather_pred, dim = 0).cpu().numpy()
+            gather_target = torch.cat(gather_target, dim = 0).cpu().numpy()
+            if self.config.dataset_type == "scv2":
+                gather_target = np.argmax(gather_target, 1)
+            metric_dict = self.evaluate_metric(gather_pred, gather_target)
+            print(self.device_type, dist.get_world_size(), metric_dict, flush = True)
+        if self.config.dataset_type == "audioset":
+            self.log("mAP", metric_dict["mAP"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+            self.log("mAUC", metric_dict["mAUC"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+            self.log("dprime", metric_dict["dprime"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+        else:
+            self.log("acc", metric_dict["acc"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+        dist.barrier()
+    def time_shifting(self, x, shift_len):
+        shift_len = int(shift_len)
+        new_sample = torch.cat([x[:, shift_len:], x[:, :shift_len]], axis = 1)
+        return new_sample
+    def test_step(self, batch, batch_idx):
+        self.device_type = next(self.parameters()).device
+        preds = []
+        # cancel the time shifting optimization because to speed up
+        shift_num = 1
+        for i in range(shift_num):
+            pred, pred_map = self(batch["waveform"])
+            preds.append(pred.unsqueeze(0))
+            batch["waveform"] = self.time_shifting(batch["waveform"], shift_len = 100 * (i + 1))
+        preds = torch.cat(preds, dim=0)
+        pred = preds.mean(dim = 0)
+        if self.config.fl_local:
+            return [
+                pred.detach().cpu().numpy(),
+                pred_map.detach().cpu().numpy(),
+                batch["audio_name"],
+                batch["real_len"].cpu().numpy()
+            ]
+        else:
+            return [pred.detach(), batch["target"].detach()]
+    def test_epoch_end(self, test_step_outputs):
+        self.device_type = next(self.parameters()).device
+        if self.config.fl_local:
+            pred = np.concatenate([d[0] for d in test_step_outputs], axis = 0)
+            pred_map = np.concatenate([d[1] for d in test_step_outputs], axis = 0)
+            audio_name = np.concatenate([d[2] for d in test_step_outputs], axis = 0)
+            real_len = np.concatenate([d[3] for d in test_step_outputs], axis = 0)
+            heatmap_file = os.path.join(self.config.heatmap_dir, self.config.test_file + "_" + str(self.device_type) + ".npy")
+            save_npy = [
+                {
+                    "audio_name": audio_name[i],
+                    "heatmap": pred_map[i],
+                    "pred": pred[i],
+                    "real_len":real_len[i]
+                }
+                for i in range(len(pred))
+            ]
+            np.save(heatmap_file, save_npy)
+        else:
+            self.device_type = next(self.parameters()).device
+            pred = torch.cat([d[0] for d in test_step_outputs], dim = 0)
+            target = torch.cat([d[1] for d in test_step_outputs], dim = 0)
+            gather_pred = [torch.zeros_like(pred) for _ in range(dist.get_world_size())]
+            gather_target = [torch.zeros_like(target) for _ in range(dist.get_world_size())]
+            dist.barrier()
+            if self.config.dataset_type == "audioset":
+                metric_dict = {
+                "mAP": 0.,
+                "mAUC": 0.,
+                "dprime": 0.
+                }
+            else:
+                metric_dict = {
+                    "acc":0.
+                }
+            dist.all_gather(gather_pred, pred)
+            dist.all_gather(gather_target, target)
+            if dist.get_rank() == 0:
+                gather_pred = torch.cat(gather_pred, dim = 0).cpu().numpy()
+                gather_target = torch.cat(gather_target, dim = 0).cpu().numpy()
+                if self.config.dataset_type == "scv2":
+                    gather_target = np.argmax(gather_target, 1)
+                metric_dict = self.evaluate_metric(gather_pred, gather_target)
+                print(self.device_type, dist.get_world_size(), metric_dict, flush = True)
+            if self.config.dataset_type == "audioset":
+                self.log("mAP", metric_dict["mAP"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+                self.log("mAUC", metric_dict["mAUC"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+                self.log("dprime", metric_dict["dprime"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+            else:
+                self.log("acc", metric_dict["acc"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+            dist.barrier()
+    def configure_optimizers(self):
+        optimizer = optim.AdamW(
+            filter(lambda p: p.requires_grad, self.parameters()),
+            lr = self.config.learning_rate,
+            betas = (0.9, 0.999), eps = 1e-08, weight_decay = 0.05,
+        )
+        # Change: SWA, deprecated
+        # optimizer = SWA(optimizer, swa_start=10, swa_freq=5)
+        def lr_foo(epoch):
+            if epoch < 3:
+                # warm up lr
+                lr_scale = self.config.lr_rate[epoch]
+            else:
+                # warmup schedule
+                lr_pos = int(-1 - bisect.bisect_left(self.config.lr_scheduler_epoch, epoch))
+                if lr_pos < -3:
+                    lr_scale = max(self.config.lr_rate[0] * (0.98 ** epoch), 0.03 )
+                else:
+                    lr_scale = self.config.lr_rate[lr_pos]
+            return lr_scale
+        scheduler = optim.lr_scheduler.LambdaLR(
+            optimizer,
+            lr_lambda=lr_foo
+        )
+        return [optimizer], [scheduler]
+class Ensemble_SEDWrapper(pl.LightningModule):
+    def __init__(self, sed_models, config, dataset):
+        super().__init__()
+        self.sed_models = nn.ModuleList(sed_models)
+        self.config = config
+        self.dataset = dataset
+    def evaluate_metric(self, pred, ans):
+        if self.config.dataset_type == "audioset":
+            mAP = np.mean(average_precision_score(ans, pred, average = None))
+            mAUC = np.mean(roc_auc_score(ans, pred, average = None))
+            dprime = d_prime(mAUC)
+            return {"mAP": mAP, "mAUC": mAUC, "dprime": dprime}
+        else:
+            acc = accuracy_score(ans, np.argmax(pred, 1))
+            return {"acc": acc}
+    def forward(self, x, sed_index, mix_lambda = None):
+        self.sed_models[sed_index].eval()
+        preds = []
+        pred_maps = []
+        # cancel the time shifting optimization because to speed up
+        shift_num = 1
+        for i in range(shift_num):
+            pred, pred_map = self.sed_models[sed_index](x)
+            pred_maps.append(pred_map.unsqueeze(0))
+            preds.append(pred.unsqueeze(0))
+            x = self.time_shifting(x, shift_len = 100 * (i + 1))
+        preds = torch.cat(preds, dim=0)
+        pred_maps = torch.cat(pred_maps, dim = 0)
+        pred = preds.mean(dim = 0)
+        pred_map = pred_maps.mean(dim = 0)
+        return pred, pred_map
+    def time_shifting(self, x, shift_len):
+        shift_len = int(shift_len)
+        new_sample = torch.cat([x[:, shift_len:], x[:, :shift_len]], axis = 1)
+        return new_sample
+    def test_step(self, batch, batch_idx):
+        self.device_type = next(self.parameters()).device
+        if self.config.fl_local:
+            pred = torch.zeros(len(batch["waveform"]), self.config.classes_num).float().to(self.device_type)
+            pred_map = torch.zeros(len(batch["waveform"]), 1024, self.config.classes_num).float().to(self.device_type)
+            for j in range(len(self.sed_models)):
+                temp_pred, temp_pred_map = self(batch["waveform"], j)
+                pred = pred + temp_pred
+                pred_map = pred_map + temp_pred_map
+            pred = pred / len(self.sed_models)
+            pred_map = pred_map / len(self.sed_models)
+            return [
+                pred.detach().cpu().numpy(),
+                pred_map.detach().cpu().numpy(),
+                batch["audio_name"],
+                batch["real_len"].cpu().numpy()
+            ]
+        else:
+            pred = torch.zeros(len(batch["waveform"]), self.config.classes_num).float().to(self.device_type)
+            for j in range(len(self.sed_models)):
+                temp_pred, _ = self(batch["waveform"], j)
+                pred = pred + temp_pred
+            pred = pred / len(self.sed_models)
+            return [
+                pred.detach(),
+                batch["target"].detach(),
+            ]
+    def test_epoch_end(self, test_step_outputs):
+        self.device_type = next(self.parameters()).device
+        if self.config.fl_local:
+            pred = np.concatenate([d[0] for d in test_step_outputs], axis = 0)
+            pred_map = np.concatenate([d[1] for d in test_step_outputs], axis = 0)
+            audio_name = np.concatenate([d[2] for d in test_step_outputs], axis = 0)
+            real_len = np.concatenate([d[3] for d in test_step_outputs], axis = 0)
+            heatmap_file = os.path.join(self.config.heatmap_dir, self.config.test_file + "_" + str(self.device_type) + ".npy")
+            print(pred.shape)
+            print(pred_map.shape)
+            print(real_len.shape)
+            save_npy = [
+                {
+                    "audio_name": audio_name[i],
+                    "heatmap": pred_map[i],
+                    "pred": pred[i],
+                    "real_len":real_len[i]
+                }
+                for i in range(len(pred))
+            ]
+            np.save(heatmap_file, save_npy)
+        else:
+            pred = torch.cat([d[0] for d in test_step_outputs], dim = 0)
+            target = torch.cat([d[1] for d in test_step_outputs], dim = 0)
+            gather_pred = [torch.zeros_like(pred) for _ in range(dist.get_world_size())]
+            gather_target = [torch.zeros_like(target) for _ in range(dist.get_world_size())]
+            dist.barrier()
+            if self.config.dataset_type == "audioset":
+                metric_dict = {
+                "mAP": 0.,
+                "mAUC": 0.,
+                "dprime": 0.
+                }
+            else:
+                metric_dict = {
+                    "acc":0.
+                }
+            dist.all_gather(gather_pred, pred)
+            dist.all_gather(gather_target, target)
+            if dist.get_rank() == 0:
+                gather_pred = torch.cat(gather_pred, dim = 0).cpu().numpy()
+                gather_target = torch.cat(gather_target, dim = 0).cpu().numpy()
+                if self.config.dataset_type == "scv2":
+                    gather_target = np.argmax(gather_target, 1)
+                metric_dict = self.evaluate_metric(gather_pred, gather_target)
+                print(self.device_type, dist.get_world_size(), metric_dict, flush = True)
+            if self.config.dataset_type == "audioset":
+                self.log("mAP", metric_dict["mAP"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+                self.log("mAUC", metric_dict["mAUC"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+                self.log("dprime", metric_dict["dprime"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+            else:
+                self.log("acc", metric_dict["acc"] * float(dist.get_world_size()), on_epoch = True, prog_bar=True, sync_dist=True)
+            dist.barrier()

utils.py ADDED Viewed

	@@ -0,0 +1,580 @@

+# Ke Chen
+# [email protected]
+# Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data
+# Some Common Methods
+import numpy as np
+from scipy.signal import butter, filtfilt
+import torch
+import torch.nn as nn
+from torch import Tensor
+from typing import Optional
+import logging
+import os
+import sys
+import h5py
+import csv
+import time
+import json
+import museval
+import librosa
+from datetime import datetime
+def create_folder(fd):
+    if not os.path.exists(fd):
+        os.makedirs(fd)
+def get_filename(path):
+    path = os.path.realpath(path)
+    na_ext = path.split('/')[-1]
+    na = os.path.splitext(na_ext)[0]
+    return na
+def get_sub_filepaths(folder):
+    paths = []
+    for root, dirs, files in os.walk(folder):
+        for name in files:
+            path = os.path.join(root, name)
+            paths.append(path)
+    return paths
+def np_to_pytorch(x, device = None):
+    if 'float' in str(x.dtype):
+        x = torch.Tensor(x)
+    elif 'int' in str(x.dtype):
+        x = torch.LongTensor(x)
+    else:
+        return x
+    return x.to(device)
+def count_parameters(model):
+    return sum(p.numel() for p in model.parameters() if p.requires_grad)
+def calculate_average_energy(x):
+    return np.mean(np.square(x))
+def id_to_one_hot(id, classes_num):
+    one_hot = np.zeros(classes_num)
+    one_hot[id] = 1
+    return one_hot
+def ids_to_hots(ids, classes_num):
+    hots = np.zeros(classes_num)
+    for id in ids:
+        hots[id] = 1
+    return hots
+def float32_to_int16(x):
+    assert np.max(np.abs(x)) <= 1.
+    return (x * 32767.).astype(np.int16)
+def int16_to_float32(x):
+    return (x / 32767.).astype(np.float32)
+def collect_fn(list_data_dict):
+    np_data_dict = {}
+    for key in list_data_dict[0].keys():
+        np_data_dict[key] = np.array([data_dict[key] for data_dict in list_data_dict])
+    return np_data_dict
+def dump_config(config, filename, include_time = False):
+    save_time = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+    config_json = {}
+    for key in dir(config):
+        if not key.startswith("_"):
+            config_json[key] = eval("config." + key)
+    if include_time:
+        filename = filename + "_" + save_time
+    with open(filename + ".json", "w") as f:
+        json.dump(config_json, f ,indent=4)
+def get_segment_bgn_end_samples(anchor_index, segment_frames, hop_samples, clip_samples):
+    bgn_frame = anchor_index - segment_frames // 2
+    end_frame = anchor_index + segment_frames // 2
+    bgn_sample = bgn_frame * hop_samples
+    end_sample = end_frame * hop_samples
+    segment_samples = segment_frames * hop_samples
+    if bgn_sample < 0:
+        bgn_sample = 0
+        end_sample = segment_samples
+    if end_sample > clip_samples:
+        bgn_sample = clip_samples - segment_samples
+        end_sample = clip_samples
+    return bgn_sample, end_sample
+def get_mix_data(waveforms, con_vectors, class_ids, indexes, mix_type = "mixture"):
+    # define return data
+    mixtures = []
+    sources = []
+    conditions = []
+    gds = []
+    for i in range(0, len(indexes), 2):
+        n1 = indexes[i]
+        n2 = indexes[i + 1]
+        # energy normalization
+        e1 = np.mean(np.square(waveforms[n1]))
+        e2 = np.mean(np.square(waveforms[n2]))
+        ratio = (e1 / max(1e-8, e2)) ** 0.5
+        ratio = np.clip(ratio, 0.02, 50)
+        waveforms[n2] *= ratio
+        mixture = waveforms[n1] + waveforms[n2]
+        # form data
+        if mix_type == "clean":
+            mixtures.append(waveforms[n1])
+            mixtures.append(waveforms[n2])
+            sources.append(waveforms[n1])
+            sources.append(waveforms[n2])
+        elif mix_type == "silence":
+            mixtures.append(waveforms[n2])
+            mixtures.append(waveforms[n1])
+            sources.append(np.zeros_like(waveforms[n1]))
+            sources.append(np.zeros_like(waveforms[n2]))
+        else:
+            mixtures.append(mixture)
+            mixtures.append(mixture)
+            sources.append(waveforms[n1])
+            sources.append(waveforms[n2])
+        conditions.append(con_vectors[n1])
+        conditions.append(con_vectors[n2])
+        gds.append(class_ids[n1])
+        gds.append(class_ids[n2])
+    return mixtures, sources, conditions, gds
+# generate a list
+def get_balanced_class_list(index_path, factor = 3, black_list = None, random_seed = 0):
+    # initialization
+    random_state = np.random.RandomState(random_seed)
+    logging.info("Load Indexes...............")
+    with h5py.File(index_path, "r") as hf:
+        indexes = hf["index_in_hdf5"][:]
+        targets = hf["target"][:].astype(np.float32)
+    (audios_num, classes_num) = targets.shape
+    # set the indexes per class for balanced list
+    indexes_per_class = []
+    for k in range(classes_num):
+        indexes_per_class.append(
+            np.where(targets[:, k] == 1)[0]
+        )
+    logging.info("Load Indexes Succeed...............")
+    return indexes_per_class
+def dataset_worker_init_fn_seed(worker_id):
+    seed = np.random.randint(0, 224141) + worker_id * np.random.randint(100,1000)
+    print(seed)
+    np.random.seed(seed)
+def calculate_sdr(ref, est, scaling=False):
+    s = museval.evaluate(ref[None,:,None], est[None,:,None], win = len(ref), hop = len(ref))
+    return s[0][0]
+def butter_lowpass_filter(data, cuton, cutoff, fs, order):
+    normal_cutoff = cutoff / (0.5 * fs)
+    normal_cuton = cuton / (0.5 * fs)
+    b, a = butter(order, [normal_cuton, normal_cutoff], btype="band", analog=False)
+    y = filtfilt(b,a, data)
+    return y
+def calculate_silence_sdr(mixture, est):
+    sdr = 10. * (
+        np.log10(np.clip(np.mean(mixture ** 2), 1e-8, np.inf)) \
+        - np.log10(np.clip(np.mean(est ** 2), 1e-8, np.inf)))
+    return sdr
+def evaluate_sdr(ref, est, class_ids, mix_type = "mixture"):
+    sdr_results = []
+    if mix_type == "silence":
+        for i in range(len(ref)):
+            sdr = calculate_silence_sdr(ref[i,:,0], est[i,:,0])
+            sdr_results.append([sdr, class_ids[i]])
+    else:
+        for i in range(len(ref)):
+            if np.sum(ref[i,:,0]) == 0 or np.sum(est[i,:,0]) == 0:
+                continue
+            else:
+                sdr_c = calculate_sdr(ref[i,:,0], est[i,:,0], scaling = True)
+            sdr_results.append([sdr_c, class_ids[i]])
+    return sdr_results
+# set the audio into the format that can be fed into the model
+# resample -> convert to mono -> output the audio
+# track [n_sample, n_channel]
+def prepprocess_audio(track, ofs, rfs, mono_type = "mix"):
+    if track.shape[-1] > 1:
+        # stereo
+        if mono_type == "mix":
+            track = np.transpose(track, (1,0))
+            track = librosa.to_mono(track)
+        elif mono_type == "left":
+            track = track[:, 0]
+        elif mono_type == "right":
+            track = track[:, 1]
+    else:
+        track = track[:, 0]
+    # track [n_sample]
+    if ofs != rfs:
+        track = librosa.resample(track, ofs, rfs)
+    return track
+# *************************************************
+# all below is referred from the wiener filter code
+def atan2(y, x):
+    r"""Element-wise arctangent function of y/x.
+    Returns a new tensor with signed angles in radians.
+    It is an alternative implementation of torch.atan2
+    Args:
+        y (Tensor): First input tensor
+        x (Tensor): Second input tensor [shape=y.shape]
+    Returns:
+        Tensor: [shape=y.shape].
+    """
+    pi = 2 * torch.asin(torch.tensor(1.0))
+    x += ((x == 0) & (y == 0)) * 1.0
+    out = torch.atan(y / x)
+    out += ((y >= 0) & (x < 0)) * pi
+    out -= ((y < 0) & (x < 0)) * pi
+    out *= 1 - ((y > 0) & (x == 0)) * 1.0
+    out += ((y > 0) & (x == 0)) * (pi / 2)
+    out *= 1 - ((y < 0) & (x == 0)) * 1.0
+    out += ((y < 0) & (x == 0)) * (-pi / 2)
+    return out
+# Define basic complex operations on torch.Tensor objects whose last dimension
+# consists in the concatenation of the real and imaginary parts.
+def _norm(x: torch.Tensor) -> torch.Tensor:
+    r"""Computes the norm value of a torch Tensor, assuming that it
+    comes as real and imaginary part in its last dimension.
+    Args:
+        x (Tensor): Input Tensor of shape [shape=(..., 2)]
+    Returns:
+        Tensor: shape as x excluding the last dimension.
+    """
+    return torch.abs(x[..., 0]) ** 2 + torch.abs(x[..., 1]) ** 2
+def _mul_add(a: torch.Tensor, b: torch.Tensor, out: Optional[torch.Tensor] = None) -> torch.Tensor:
+    """Element-wise multiplication of two complex Tensors described
+    through their real and imaginary parts.
+    The result is added to the `out` tensor"""
+    # check `out` and allocate it if needed
+    target_shape = torch.Size([max(sa, sb) for (sa, sb) in zip(a.shape, b.shape)])
+    if out is None or out.shape != target_shape:
+        out = torch.zeros(target_shape, dtype=a.dtype, device=a.device)
+    if out is a:
+        real_a = a[..., 0]
+        out[..., 0] = out[..., 0] + (real_a * b[..., 0] - a[..., 1] * b[..., 1])
+        out[..., 1] = out[..., 1] + (real_a * b[..., 1] + a[..., 1] * b[..., 0])
+    else:
+        out[..., 0] = out[..., 0] + (a[..., 0] * b[..., 0] - a[..., 1] * b[..., 1])
+        out[..., 1] = out[..., 1] + (a[..., 0] * b[..., 1] + a[..., 1] * b[..., 0])
+    return out
+def _mul(a: torch.Tensor, b: torch.Tensor, out: Optional[torch.Tensor] = None) -> torch.Tensor:
+    """Element-wise multiplication of two complex Tensors described
+    through their real and imaginary parts
+    can work in place in case out is a only"""
+    target_shape = torch.Size([max(sa, sb) for (sa, sb) in zip(a.shape, b.shape)])
+    if out is None or out.shape != target_shape:
+        out = torch.zeros(target_shape, dtype=a.dtype, device=a.device)
+    if out is a:
+        real_a = a[..., 0]
+        out[..., 0] = real_a * b[..., 0] - a[..., 1] * b[..., 1]
+        out[..., 1] = real_a * b[..., 1] + a[..., 1] * b[..., 0]
+    else:
+        out[..., 0] = a[..., 0] * b[..., 0] - a[..., 1] * b[..., 1]
+        out[..., 1] = a[..., 0] * b[..., 1] + a[..., 1] * b[..., 0]
+    return out
+def _inv(z: torch.Tensor, out: Optional[torch.Tensor] = None) -> torch.Tensor:
+    """Element-wise multiplicative inverse of a Tensor with complex
+    entries described through their real and imaginary parts.
+    can work in place in case out is z"""
+    ez = _norm(z)
+    if out is None or out.shape != z.shape:
+        out = torch.zeros_like(z)
+    out[..., 0] = z[..., 0] / ez
+    out[..., 1] = -z[..., 1] / ez
+    return out
+def _conj(z, out: Optional[torch.Tensor] = None) -> torch.Tensor:
+    """Element-wise complex conjugate of a Tensor with complex entries
+    described through their real and imaginary parts.
+    can work in place in case out is z"""
+    if out is None or out.shape != z.shape:
+        out = torch.zeros_like(z)
+    out[..., 0] = z[..., 0]
+    out[..., 1] = -z[..., 1]
+    return out
+def _invert(M: torch.Tensor, out: Optional[torch.Tensor] = None) -> torch.Tensor:
+    """
+    Invert 1x1 or 2x2 matrices
+    Will generate errors if the matrices are singular: user must handle this
+    through his own regularization schemes.
+    Args:
+        M (Tensor): [shape=(..., nb_channels, nb_channels, 2)]
+            matrices to invert: must be square along dimensions -3 and -2
+    Returns:
+        invM (Tensor): [shape=M.shape]
+            inverses of M
+    """
+    nb_channels = M.shape[-2]
+    if out is None or out.shape != M.shape:
+        out = torch.empty_like(M)
+    if nb_channels == 1:
+        # scalar case
+        out = _inv(M, out)
+    elif nb_channels == 2:
+        # two channels case: analytical expression
+        # first compute the determinent
+        det = _mul(M[..., 0, 0, :], M[..., 1, 1, :])
+        det = det - _mul(M[..., 0, 1, :], M[..., 1, 0, :])
+        # invert it
+        invDet = _inv(det)
+        # then fill out the matrix with the inverse
+        out[..., 0, 0, :] = _mul(invDet, M[..., 1, 1, :], out[..., 0, 0, :])
+        out[..., 1, 0, :] = _mul(-invDet, M[..., 1, 0, :], out[..., 1, 0, :])
+        out[..., 0, 1, :] = _mul(-invDet, M[..., 0, 1, :], out[..., 0, 1, :])
+        out[..., 1, 1, :] = _mul(invDet, M[..., 0, 0, :], out[..., 1, 1, :])
+    else:
+        raise Exception("Only 2 channels are supported for the torch version.")
+    return out
+def expectation_maximization(
+    y: torch.Tensor,
+    x: torch.Tensor,
+    iterations: int = 2,
+    eps: float = 1e-10,
+    batch_size: int = 200,
+):
+    r"""Expectation maximization algorithm, for refining source separation
+    estimates.
+    Args:
+        y (Tensor): [shape=(nb_frames, nb_bins, nb_channels, 2, nb_sources)]
+            initial estimates for the sources
+        x (Tensor): [shape=(nb_frames, nb_bins, nb_channels, 2)]
+            complex STFT of the mixture signal
+        iterations (int): [scalar]
+            number of iterations for the EM algorithm.
+        eps (float or None): [scalar]
+            The epsilon value to use for regularization and filters.
+    Returns:
+        y (Tensor): [shape=(nb_frames, nb_bins, nb_channels, 2, nb_sources)]
+            estimated sources after iterations
+        v (Tensor): [shape=(nb_frames, nb_bins, nb_sources)]
+            estimated power spectral densities
+        R (Tensor): [shape=(nb_bins, nb_channels, nb_channels, 2, nb_sources)]
+            estimated spatial covariance matrices
+    """
+    # dimensions
+    (nb_frames, nb_bins, nb_channels) = x.shape[:-1]
+    nb_sources = y.shape[-1]
+    regularization = torch.cat(
+        (
+            torch.eye(nb_channels, dtype=x.dtype, device=x.device)[..., None],
+            torch.zeros((nb_channels, nb_channels, 1), dtype=x.dtype, device=x.device),
+        ),
+        dim=2,
+    )
+    regularization = torch.sqrt(torch.as_tensor(eps)) * (
+        regularization[None, None, ...].expand((-1, nb_bins, -1, -1, -1))
+    )
+    # allocate the spatial covariance matrices
+    R = [
+        torch.zeros((nb_bins, nb_channels, nb_channels, 2), dtype=x.dtype, device=x.device)
+        for j in range(nb_sources)
+    ]
+    weight: torch.Tensor = torch.zeros((nb_bins,), dtype=x.dtype, device=x.device)
+    v: torch.Tensor = torch.zeros((nb_frames, nb_bins, nb_sources), dtype=x.dtype, device=x.device)
+    for it in range(iterations):
+        # constructing the mixture covariance matrix. Doing it with a loop
+        # to avoid storing anytime in RAM the whole 6D tensor
+        # update the PSD as the average spectrogram over channels
+        v = torch.mean(torch.abs(y[..., 0, :]) ** 2 + torch.abs(y[..., 1, :]) ** 2, dim=-2)
+        # update spatial covariance matrices (weighted update)
+        for j in range(nb_sources):
+            R[j] = torch.tensor(0.0, device=x.device)
+            weight = torch.tensor(eps, device=x.device)
+            pos: int = 0
+            batch_size = batch_size if batch_size else nb_frames
+            while pos < nb_frames:
+                t = torch.arange(pos, min(nb_frames, pos + batch_size))
+                pos = int(t[-1]) + 1
+                R[j] = R[j] + torch.sum(_covariance(y[t, ..., j]), dim=0)
+                weight = weight + torch.sum(v[t, ..., j], dim=0)
+            R[j] = R[j] / weight[..., None, None, None]
+            weight = torch.zeros_like(weight)
+        # cloning y if we track gradient, because we're going to update it
+        if y.requires_grad:
+            y = y.clone()
+        pos = 0
+        while pos < nb_frames:
+            t = torch.arange(pos, min(nb_frames, pos + batch_size))
+            pos = int(t[-1]) + 1
+            y[t, ...] = torch.tensor(0.0, device=x.device)
+            # compute mix covariance matrix
+            Cxx = regularization
+            for j in range(nb_sources):
+                Cxx = Cxx + (v[t, ..., j, None, None, None] * R[j][None, ...].clone())
+            # invert it
+            inv_Cxx = _invert(Cxx)
+            # separate the sources
+            for j in range(nb_sources):
+                # create a wiener gain for this source
+                gain = torch.zeros_like(inv_Cxx)
+                # computes multichannel Wiener gain as v_j R_j inv_Cxx
+                indices = torch.cartesian_prod(
+                    torch.arange(nb_channels),
+                    torch.arange(nb_channels),
+                    torch.arange(nb_channels),
+                )
+                for index in indices:
+                    gain[:, :, index[0], index[1], :] = _mul_add(
+                        R[j][None, :, index[0], index[2], :].clone(),
+                        inv_Cxx[:, :, index[2], index[1], :],
+                        gain[:, :, index[0], index[1], :],
+                    )
+                gain = gain * v[t, ..., None, None, None, j]
+                # apply it to the mixture
+                for i in range(nb_channels):
+                    y[t, ..., j] = _mul_add(gain[..., i, :], x[t, ..., i, None, :], y[t, ..., j])
+    return y, v, R
+def _covariance(y_j):
+    """
+    Compute the empirical covariance for a source.
+    Args:
+        y_j (Tensor): complex stft of the source.
+            [shape=(nb_frames, nb_bins, nb_channels, 2)].
+    Returns:
+        Cj (Tensor): [shape=(nb_frames, nb_bins, nb_channels, nb_channels, 2)]
+            just y_j * conj(y_j.T): empirical covariance for each TF bin.
+    """
+    (nb_frames, nb_bins, nb_channels) = y_j.shape[:-1]
+    Cj = torch.zeros(
+        (nb_frames, nb_bins, nb_channels, nb_channels, 2),
+        dtype=y_j.dtype,
+        device=y_j.device,
+    )
+    indices = torch.cartesian_prod(torch.arange(nb_channels), torch.arange(nb_channels))
+    for index in indices:
+        Cj[:, :, index[0], index[1], :] = _mul_add(
+            y_j[:, :, index[0], :],
+            _conj(y_j[:, :, index[1], :]),
+            Cj[:, :, index[0], index[1], :],
+        )
+    return Cj
+def wiener(
+    targets_spectrograms: torch.Tensor,
+    mix_stft: torch.Tensor,
+    iterations: int = 1,
+    softmask: bool = False,
+    residual: bool = False,
+    scale_factor: float = 10.0,
+    eps: float = 1e-10,
+):
+    """Wiener-based separation for multichannel audio.
+    Returns:
+        Tensor: shape=(nb_frames, nb_bins, nb_channels, complex=2, nb_sources)
+            STFT of estimated sources
+    """
+    if softmask:
+        # if we use softmask, we compute the ratio mask for all targets and
+        # multiply by the mix stft
+        y = (
+            mix_stft[..., None]
+            * (
+                targets_spectrograms
+                / (eps + torch.sum(targets_spectrograms, dim=-1, keepdim=True).to(mix_stft.dtype))
+            )[..., None, :]
+        )
+    else:
+        # otherwise, we just multiply the targets spectrograms with mix phase
+        # we tacitly assume that we have magnitude estimates.
+        angle = atan2(mix_stft[..., 1], mix_stft[..., 0])[..., None]
+        nb_sources = targets_spectrograms.shape[-1]
+        y = torch.zeros(
+            mix_stft.shape + (nb_sources,), dtype=mix_stft.dtype, device=mix_stft.device
+        )
+        y[..., 0, :] = targets_spectrograms * torch.cos(angle)
+        y[..., 1, :] = targets_spectrograms * torch.sin(angle)
+    if residual:
+        # if required, adding an additional target as the mix minus
+        # available targets
+        y = torch.cat([y, mix_stft[..., None] - y.sum(dim=-1, keepdim=True)], dim=-1)
+    if iterations == 0:
+        return y
+    # we need to refine the estimates. Scales down the estimates for
+    # numerical stability
+    max_abs = torch.max(
+        torch.as_tensor(1.0, dtype=mix_stft.dtype, device=mix_stft.device),
+        torch.sqrt(_norm(mix_stft)).max() / scale_factor,
+    )
+    mix_stft = mix_stft / max_abs
+    y = y / max_abs
+    # call expectation maximization
+    y = expectation_maximization(y, mix_stft, iterations, eps=eps)[0]
+    # scale estimates up again
+    y = y * max_abs
+    return y
+def split_nparray_with_overlap(array, array_size, overlap_size):
+    result = []
+    element_size = int(len(array) / array_size)
+    for i in range(array_size):
+        offset = int(i * element_size)
+        last_loop = i == array_size
+        chunk = array[offset : offset + element_size + (0 if last_loop else overlap_size)]
+        chunk = chunk.copy()
+        chunk.resize(element_size + overlap_size, refcheck = False)
+        result.append(chunk)
+    return np.array(result)

zero_shot_create_vector.py ADDED Viewed

	@@ -0,0 +1,158 @@

+# Ke Chen
+# [email protected]
+# Zero-shot Audio Source Separation via Query-based Learning from Weakly-labeled Data
+# The Main Script
+import os
+gpu_use = 0
+# this is to avoid the sdr calculation from occupying all cpus
+os.environ["OMP_NUM_THREADS"] = "4"
+os.environ["OPENBLAS_NUM_THREADS"] = "4"
+os.environ["MKL_NUM_THREADS"] = "6"
+os.environ["VECLIB_MAXIMUM_THREADS"] = "4"
+os.environ["NUMEXPR_NUM_THREADS"] = "6"
+os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_use)
+import librosa
+import numpy as np
+import soundfile as sf
+from hashlib import md5
+import torch
+from torch.utils.data import DataLoader
+from utils import collect_fn, dump_config, create_folder, prepprocess_audio
+from models.asp_model import ZeroShotASP, SeparatorModel, AutoTaggingWarpper, WhitingWarpper
+from data_processor import LGSPDataset, MusdbDataset
+import config
+import htsat_config
+from models.htsat import HTSAT_Swin_Transformer
+from sed_model import SEDWrapper
+import pytorch_lightning as pl
+import time
+import tqdm
+import warnings
+import shutil
+import pickle
+warnings.filterwarnings("ignore")
+# use the model to quickly separate a track given a query
+# it requires four variables in config.py:
+#   inference_file: the track you want to separate
+#   inference_query: a **folder** containing all samples from the same source
+#   test_key: ["name"] indicate the source name (just a name for final output, no other functions)
+#   wave_output_path: the output folder
+# make sure the query folder contain the samples from the same source
+# each time, the model is able to separate one source from the track
+# if you want to separate multiple sources, you need to change the query folder or write a script to help you do that
+def save_in_file_fast(arr, file_name):
+    pickle.dump(arr, open(file_name, 'wb'), protocol=4)
+def load_from_file_fast(file_name):
+    return pickle.load(open(file_name, 'rb'))
+def create_vector():
+    test_type = 'mix'
+    inference_file = config.inference_file
+    inference_query = config.inference_query
+    test_key = config.test_key
+    wave_output_path = config.wave_output_path
+    sample_rate = config.sample_rate
+    resume_checkpoint_zeroshot = config.resume_checkpoint
+    resume_checkpoint_htsat = htsat_config.resume_checkpoint
+    print('Inference query folder: {}'.format(inference_query))
+    print('Test key: {}'.format(test_key))
+    print('Vector out folder: {}'.format(wave_output_path))
+    print('Sample rate: {}'.format(sample_rate))
+    print('Model 1 (zeroshot): {}'.format(resume_checkpoint_zeroshot))
+    # set exp settings
+    device_name = "cuda" if torch.cuda.is_available() else "cpu"
+    device = torch.device("cuda")
+    create_folder(wave_output_path)
+    # obtain the samples for query
+    queries = []
+    query_names = []
+    for query_file in tqdm.tqdm(os.listdir(inference_query)):
+        f_path = os.path.join(inference_query, query_file)
+        if query_file.endswith(".wav"):
+            temp_q, fs = librosa.load(f_path, sr=None)
+            temp_q = temp_q[:, None]
+            temp_q = prepprocess_audio(
+                temp_q,
+                fs,
+                sample_rate,
+                test_type
+            )
+            temp = [temp_q]
+            for dickey in test_key:
+                temp.append(temp_q)
+            temp = np.array(temp)
+            queries.append(temp)
+            query_names.append(os.path.basename(query_file))
+    sed_model = HTSAT_Swin_Transformer(
+        spec_size=htsat_config.htsat_spec_size,
+        patch_size=htsat_config.htsat_patch_size,
+        in_chans=1,
+        num_classes=htsat_config.classes_num,
+        window_size=htsat_config.htsat_window_size,
+        config=htsat_config,
+        depths=htsat_config.htsat_depth,
+        embed_dim=htsat_config.htsat_dim,
+        patch_stride=htsat_config.htsat_stride,
+        num_heads=htsat_config.htsat_num_head
+    )
+    at_model = SEDWrapper(
+        sed_model=sed_model,
+        config=htsat_config,
+        dataset=None
+    )
+    ckpt = torch.load(resume_checkpoint_htsat, map_location="cpu")
+    at_model.load_state_dict(ckpt["state_dict"])
+    if device_name == 'cpu':
+        trainer = pl.Trainer(
+            accelerator="cpu", gpus=None
+        )
+    else:
+        trainer = pl.Trainer(
+            gpus=1
+        )
+    print('Process: {}'.format(len(queries)))
+    avg_dataset = MusdbDataset(
+        tracks=queries
+    )
+    avg_loader = DataLoader(
+        dataset=avg_dataset,
+        num_workers=1,
+        batch_size=1,
+        shuffle=False
+    )
+    at_wrapper = AutoTaggingWarpper(
+        at_model=at_model,
+        config=config,
+        target_keys=test_key
+    )
+    trainer.test(
+        at_wrapper,
+        test_dataloaders=avg_loader
+    )
+    avg_at = at_wrapper.avg_at
+    md5_str = str(md5(str(queries).encode('utf-8')).hexdigest())
+    out_vector_path = wave_output_path + '/{}_vector_{}.pkl'.format(test_key[0], md5_str)
+    save_in_file_fast(avg_at, out_vector_path)
+    print('Vector saved in: {}'.format(out_vector_path))
+if __name__ == '__main__':
+    create_vector()