Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,6 @@
|
|
6 |
|
7 |
🔗 **[Github](https://github.com/ta012/SSLAM) | [Paper](https://openreview.net/pdf?id=odU59TxdiB) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) | [Pre-trained Models(HuggingFace)](https://huggingface.co/ta012/SSLAM) | [Pre-trained Models(Google Drive)](https://drive.google.com/drive/folders/1G0icv-hdqDEqnfP4EFszMXhFnWWM09gT?usp=sharing) **
|
8 |
|
9 |
-
---
|
10 |
|
11 |
# 📋 Table of Contents
|
12 |
- [Why SSLAM?](#why-sslam)
|
@@ -16,29 +15,22 @@
|
|
16 |
- [Inference Installation](#inference-installation)
|
17 |
- [Model Weights](#model-weights)
|
18 |
- [Using SSLAM](#using-sslam)
|
19 |
-
- [Training Mode](#training-mode)
|
20 |
-
- [Training Installation](#training-installation)
|
21 |
-
- [Data Preparation](#️data-preparation)
|
22 |
-
- [Pre-Training](#pre-training)
|
23 |
-
- [Checklist](#checklist)
|
24 |
- [Acknowledgements](#acknowledgements)
|
25 |
- [Citation](#citation)
|
26 |
|
27 |
-
---
|
28 |
|
29 |
## 🔍Why SSLAM?
|
30 |
🔊 **Real-world audio is polyphonic**—multiple overlapping sound sources are common in everyday environments.
|
31 |
❌ **Existing SSL models focus on monophonic audio,** limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
|
32 |
💡 **SSLAM bridges this gap** by introducing **self-supervised learning from audio mixtures**, enabling robust learning across **both monophonic and polyphonic soundscapes**.
|
33 |
|
34 |
-
---
|
35 |
|
36 |
## 🎼Key Features
|
37 |
✅ **Self-Supervised Learning from Audio Mixtures (SSLAM)** – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
|
38 |
✅ **Source Retention Loss** – ensures the integrity of each sound source even in complex mixtures.
|
39 |
✅ **SOTA Performance** – Achieves **+3.9% mAP improvement** on AudioSet-2M and **+9.1% on polyphonic datasets**.
|
40 |
|
41 |
-
|
42 |
## 📊Results
|
43 |
|
44 |
### 1. Standard Audio-SSL Benchmark Datasets
|
@@ -47,7 +39,7 @@
|
|
47 |
### 2. Polyphonic Datasets
|
48 |

|
49 |
|
50 |
-
|
51 |
## **🔍️Inference Mode**
|
52 |
> **Note**: If you are already using [EAT](https://github.com/cwx-worst-one/EAT/tree/main) in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.
|
53 |
|
@@ -59,7 +51,7 @@ conda create --prefix /path/to/sslam_eval_env -y python=3.9.13
|
|
59 |
/path/to/sslam_eval_env/bin/python -m pip install pip==24.0 # downgrade pip
|
60 |
/path/to/sslam_eval_env/bin/pip install -r SSLAM_Inference/requirements_sslam_eval.txt
|
61 |
```
|
62 |
-
|
63 |
|
64 |
## 📦Model Weights
|
65 |
|
@@ -94,53 +86,7 @@ cd SSLAM_Inference/scripts
|
|
94 |
bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2
|
95 |
```
|
96 |
|
97 |
-
---
|
98 |
-
## **📈Training Mode**
|
99 |
-
We cover the self-supervised pre-training, fine-tuning and linear evaluation under this section.
|
100 |
-
|
101 |
-
#### **📥Training Installation**
|
102 |
-
|
103 |
-
For training its better to install the fairseq in editable mode,
|
104 |
-
|
105 |
-
```bash
|
106 |
-
conda create --prefix /path/to/sslam_env -y python=3.9.13 ## env used for training
|
107 |
-
/path/to/sslam_env/bin/python -m pip install pip==24.0 # downgrade pip
|
108 |
-
cd SSLAM/
|
109 |
-
git clone https://github.com/facebookresearch/fairseq.git
|
110 |
-
|
111 |
-
##IMPORTANT: Copy the Pre-Training/SSLAM_Stage2 directory to SSLAM/fairseq
|
112 |
-
## so that the resultant path is SSLAM/fairseq/SSLAM_Stage2/.
|
113 |
-
cd fairseq/
|
114 |
-
|
115 |
-
## install all requirements apart from fairseq
|
116 |
-
/path/to/sslam_env/bin/pip install -r SSLAM_Stage2/requirements_sslam.txt
|
117 |
-
## install fairseq in editable mode
|
118 |
-
/path/to/sslam_env/bin/pip install --editable ./
|
119 |
-
```
|
120 |
-
#### 🗄️Data Preparation
|
121 |
-
We utilised AudioSet-2M (full set) for pre-training. For this phase, only the `train.tsv` file is required. Refer to [train.tsv for AudioSet-20K](data_manifests/manifest_as20k/train.tsv) to prepare the train.tsv file for your downloaded copy of AudioSet-2M.
|
122 |
-
|
123 |
-
#### 🚀Pre-Training
|
124 |
-
|
125 |
-
**Note:** This repository focuses solely on Stage 2 pre-training, which introduces our novel SSLAM pre-training strategy.
|
126 |
-
|
127 |
-
To begin Stage 2, you'll need a Stage 1 checkpoint. In our complete pre-training process, Stage 1 mirrors the approach in [EAT](https://github.com/cwx-worst-one/EAT/tree/main) and achieves similar performance. For convenience, we use the EAT checkpoint as the Stage 1 checkpoint.
|
128 |
-
|
129 |
-
Download the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) epoch 10 checkpoint using the link provided by the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) repository: [EAT-base_epoch10_pt.pt](https://drive.google.com/file/d/10pklbY_fKraQUIBizSg1kv4lJXNWxpxl/view?usp=sharing).
|
130 |
-
|
131 |
-
*Only the contents of the **models/** folder and a few parameters in the pre-training script differ between Stage 1 and Stage 2.*
|
132 |
|
133 |
-
```bash
|
134 |
-
cd SSLAM/fairseq/SSLAM_Stage2/scripts/
|
135 |
-
bash pretrain_stage2.sh
|
136 |
-
```
|
137 |
-
|
138 |
-
|
139 |
-
## 📌Checklist
|
140 |
-
- [x] Inference Mode
|
141 |
-
- [x] Pre-Training
|
142 |
-
|
143 |
-
---
|
144 |
|
145 |
## 🙏Acknowledgements
|
146 |
|
|
|
6 |
|
7 |
🔗 **[Github](https://github.com/ta012/SSLAM) | [Paper](https://openreview.net/pdf?id=odU59TxdiB) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) | [Pre-trained Models(HuggingFace)](https://huggingface.co/ta012/SSLAM) | [Pre-trained Models(Google Drive)](https://drive.google.com/drive/folders/1G0icv-hdqDEqnfP4EFszMXhFnWWM09gT?usp=sharing) **
|
8 |
|
|
|
9 |
|
10 |
# 📋 Table of Contents
|
11 |
- [Why SSLAM?](#why-sslam)
|
|
|
15 |
- [Inference Installation](#inference-installation)
|
16 |
- [Model Weights](#model-weights)
|
17 |
- [Using SSLAM](#using-sslam)
|
|
|
|
|
|
|
|
|
|
|
18 |
- [Acknowledgements](#acknowledgements)
|
19 |
- [Citation](#citation)
|
20 |
|
|
|
21 |
|
22 |
## 🔍Why SSLAM?
|
23 |
🔊 **Real-world audio is polyphonic**—multiple overlapping sound sources are common in everyday environments.
|
24 |
❌ **Existing SSL models focus on monophonic audio,** limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
|
25 |
💡 **SSLAM bridges this gap** by introducing **self-supervised learning from audio mixtures**, enabling robust learning across **both monophonic and polyphonic soundscapes**.
|
26 |
|
|
|
27 |
|
28 |
## 🎼Key Features
|
29 |
✅ **Self-Supervised Learning from Audio Mixtures (SSLAM)** – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
|
30 |
✅ **Source Retention Loss** – ensures the integrity of each sound source even in complex mixtures.
|
31 |
✅ **SOTA Performance** – Achieves **+3.9% mAP improvement** on AudioSet-2M and **+9.1% on polyphonic datasets**.
|
32 |
|
33 |
+
|
34 |
## 📊Results
|
35 |
|
36 |
### 1. Standard Audio-SSL Benchmark Datasets
|
|
|
39 |
### 2. Polyphonic Datasets
|
40 |

|
41 |
|
42 |
+
|
43 |
## **🔍️Inference Mode**
|
44 |
> **Note**: If you are already using [EAT](https://github.com/cwx-worst-one/EAT/tree/main) in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.
|
45 |
|
|
|
51 |
/path/to/sslam_eval_env/bin/python -m pip install pip==24.0 # downgrade pip
|
52 |
/path/to/sslam_eval_env/bin/pip install -r SSLAM_Inference/requirements_sslam_eval.txt
|
53 |
```
|
54 |
+
|
55 |
|
56 |
## 📦Model Weights
|
57 |
|
|
|
86 |
bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2
|
87 |
```
|
88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
## 🙏Acknowledgements
|
92 |
|