ta012 commited on
Commit
e6ec3af
·
verified ·
1 Parent(s): 7ec9b82

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -57
README.md CHANGED
@@ -6,7 +6,6 @@
6
 
7
  🔗 **[Github](https://github.com/ta012/SSLAM) | [Paper](https://openreview.net/pdf?id=odU59TxdiB) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) | [Pre-trained Models(HuggingFace)](https://huggingface.co/ta012/SSLAM) | [Pre-trained Models(Google Drive)](https://drive.google.com/drive/folders/1G0icv-hdqDEqnfP4EFszMXhFnWWM09gT?usp=sharing) **
8
 
9
- ---
10
 
11
  # 📋 Table of Contents
12
  - [Why SSLAM?](#why-sslam)
@@ -16,29 +15,22 @@
16
  - [Inference Installation](#inference-installation)
17
  - [Model Weights](#model-weights)
18
  - [Using SSLAM](#using-sslam)
19
- - [Training Mode](#training-mode)
20
- - [Training Installation](#training-installation)
21
- - [Data Preparation](#️data-preparation)
22
- - [Pre-Training](#pre-training)
23
- - [Checklist](#checklist)
24
  - [Acknowledgements](#acknowledgements)
25
  - [Citation](#citation)
26
 
27
- ---
28
 
29
  ## 🔍Why SSLAM?
30
  🔊 **Real-world audio is polyphonic**—multiple overlapping sound sources are common in everyday environments.
31
  ❌ **Existing SSL models focus on monophonic audio,** limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
32
  💡 **SSLAM bridges this gap** by introducing **self-supervised learning from audio mixtures**, enabling robust learning across **both monophonic and polyphonic soundscapes**.
33
 
34
- ---
35
 
36
  ## 🎼Key Features
37
  ✅ **Self-Supervised Learning from Audio Mixtures (SSLAM)** – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
38
  ✅ **Source Retention Loss** – ensures the integrity of each sound source even in complex mixtures.
39
  ✅ **SOTA Performance** – Achieves **+3.9% mAP improvement** on AudioSet-2M and **+9.1% on polyphonic datasets**.
40
 
41
- ---
42
  ## 📊Results
43
 
44
  ### 1. Standard Audio-SSL Benchmark Datasets
@@ -47,7 +39,7 @@
47
  ### 2. Polyphonic Datasets
48
  ![Polyphonic Datasets](assets/poly_results.png)
49
 
50
- ---
51
  ## **🔍️Inference Mode**
52
  > **Note**: If you are already using [EAT](https://github.com/cwx-worst-one/EAT/tree/main) in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.
53
 
@@ -59,7 +51,7 @@ conda create --prefix /path/to/sslam_eval_env -y python=3.9.13
59
  /path/to/sslam_eval_env/bin/python -m pip install pip==24.0 # downgrade pip
60
  /path/to/sslam_eval_env/bin/pip install -r SSLAM_Inference/requirements_sslam_eval.txt
61
  ```
62
- ---
63
 
64
  ## 📦Model Weights
65
 
@@ -94,53 +86,7 @@ cd SSLAM_Inference/scripts
94
  bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2
95
  ```
96
 
97
- ---
98
- ## **📈Training Mode**
99
- We cover the self-supervised pre-training, fine-tuning and linear evaluation under this section.
100
-
101
- #### **📥Training Installation**
102
-
103
- For training its better to install the fairseq in editable mode,
104
-
105
- ```bash
106
- conda create --prefix /path/to/sslam_env -y python=3.9.13 ## env used for training
107
- /path/to/sslam_env/bin/python -m pip install pip==24.0 # downgrade pip
108
- cd SSLAM/
109
- git clone https://github.com/facebookresearch/fairseq.git
110
-
111
- ##IMPORTANT: Copy the Pre-Training/SSLAM_Stage2 directory to SSLAM/fairseq
112
- ## so that the resultant path is SSLAM/fairseq/SSLAM_Stage2/.
113
- cd fairseq/
114
-
115
- ## install all requirements apart from fairseq
116
- /path/to/sslam_env/bin/pip install -r SSLAM_Stage2/requirements_sslam.txt
117
- ## install fairseq in editable mode
118
- /path/to/sslam_env/bin/pip install --editable ./
119
- ```
120
- #### 🗄️Data Preparation
121
- We utilised AudioSet-2M (full set) for pre-training. For this phase, only the `train.tsv` file is required. Refer to [train.tsv for AudioSet-20K](data_manifests/manifest_as20k/train.tsv) to prepare the train.tsv file for your downloaded copy of AudioSet-2M.
122
-
123
- #### 🚀Pre-Training
124
-
125
- **Note:** This repository focuses solely on Stage 2 pre-training, which introduces our novel SSLAM pre-training strategy.
126
-
127
- To begin Stage 2, you'll need a Stage 1 checkpoint. In our complete pre-training process, Stage 1 mirrors the approach in [EAT](https://github.com/cwx-worst-one/EAT/tree/main) and achieves similar performance. For convenience, we use the EAT checkpoint as the Stage 1 checkpoint.
128
-
129
- Download the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) epoch 10 checkpoint using the link provided by the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) repository: [EAT-base_epoch10_pt.pt](https://drive.google.com/file/d/10pklbY_fKraQUIBizSg1kv4lJXNWxpxl/view?usp=sharing).
130
-
131
- *Only the contents of the **models/** folder and a few parameters in the pre-training script differ between Stage 1 and Stage 2.*
132
 
133
- ```bash
134
- cd SSLAM/fairseq/SSLAM_Stage2/scripts/
135
- bash pretrain_stage2.sh
136
- ```
137
-
138
-
139
- ## 📌Checklist
140
- - [x] Inference Mode
141
- - [x] Pre-Training
142
-
143
- ---
144
 
145
  ## 🙏Acknowledgements
146
 
 
6
 
7
  🔗 **[Github](https://github.com/ta012/SSLAM) | [Paper](https://openreview.net/pdf?id=odU59TxdiB) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) | [Pre-trained Models(HuggingFace)](https://huggingface.co/ta012/SSLAM) | [Pre-trained Models(Google Drive)](https://drive.google.com/drive/folders/1G0icv-hdqDEqnfP4EFszMXhFnWWM09gT?usp=sharing) **
8
 
 
9
 
10
  # 📋 Table of Contents
11
  - [Why SSLAM?](#why-sslam)
 
15
  - [Inference Installation](#inference-installation)
16
  - [Model Weights](#model-weights)
17
  - [Using SSLAM](#using-sslam)
 
 
 
 
 
18
  - [Acknowledgements](#acknowledgements)
19
  - [Citation](#citation)
20
 
 
21
 
22
  ## 🔍Why SSLAM?
23
  🔊 **Real-world audio is polyphonic**—multiple overlapping sound sources are common in everyday environments.
24
  ❌ **Existing SSL models focus on monophonic audio,** limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
25
  💡 **SSLAM bridges this gap** by introducing **self-supervised learning from audio mixtures**, enabling robust learning across **both monophonic and polyphonic soundscapes**.
26
 
 
27
 
28
  ## 🎼Key Features
29
  ✅ **Self-Supervised Learning from Audio Mixtures (SSLAM)** – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
30
  ✅ **Source Retention Loss** – ensures the integrity of each sound source even in complex mixtures.
31
  ✅ **SOTA Performance** – Achieves **+3.9% mAP improvement** on AudioSet-2M and **+9.1% on polyphonic datasets**.
32
 
33
+
34
  ## 📊Results
35
 
36
  ### 1. Standard Audio-SSL Benchmark Datasets
 
39
  ### 2. Polyphonic Datasets
40
  ![Polyphonic Datasets](assets/poly_results.png)
41
 
42
+
43
  ## **🔍️Inference Mode**
44
  > **Note**: If you are already using [EAT](https://github.com/cwx-worst-one/EAT/tree/main) in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.
45
 
 
51
  /path/to/sslam_eval_env/bin/python -m pip install pip==24.0 # downgrade pip
52
  /path/to/sslam_eval_env/bin/pip install -r SSLAM_Inference/requirements_sslam_eval.txt
53
  ```
54
+
55
 
56
  ## 📦Model Weights
57
 
 
86
  bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2
87
  ```
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ## 🙏Acknowledgements
92