Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,161 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# 🔊 [ICLR 2025] SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
|
2 |
+
|
3 |
+
[](https://openreview.net/forum?id=odU59TxdiB)
|
4 |
+
|
5 |
+
🚀 **SSLAM** is a self-supervised learning framework designed to enhance audio representation quality for both **polyphonic(multiple overlapping sounds)** and monophonic soundscapes. Unlike traditional SSL models that focus on monophonic data, SSLAM introduces a novel **source retention loss** and **audio mixture training**, significantly improving performance on real-world polyphonic audio.
|
6 |
+
|
7 |
+
🔗 **[Github](https://github.com/ta012/SSLAM) | **[Paper](https://openreview.net/pdf?id=odU59TxdiB) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) | [Pre-trained Models](https://drive.google.com/drive/folders/1G0icv-hdqDEqnfP4EFszMXhFnWWM09gT?usp=sharing)**
|
8 |
+
|
9 |
+
---
|
10 |
+
|
11 |
+
# 📋 Table of Contents
|
12 |
+
- [Why SSLAM?](#why-sslam)
|
13 |
+
- [Key Features](#key-features)
|
14 |
+
- [Results](#results)
|
15 |
+
- [Inference Mode](#️inference-mode)
|
16 |
+
- [Inference Installation](#inference-installation)
|
17 |
+
- [Model Weights](#model-weights)
|
18 |
+
- [Using SSLAM](#using-sslam)
|
19 |
+
- [Training Mode](#training-mode)
|
20 |
+
- [Training Installation](#training-installation)
|
21 |
+
- [Data Preparation](#️data-preparation)
|
22 |
+
- [Pre-Training](#pre-training)
|
23 |
+
- [Checklist](#checklist)
|
24 |
+
- [Acknowledgements](#acknowledgements)
|
25 |
+
- [Citation](#citation)
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
## 🔍Why SSLAM?
|
30 |
+
🔊 **Real-world audio is polyphonic**—multiple overlapping sound sources are common in everyday environments.
|
31 |
+
❌ **Existing SSL models focus on monophonic audio,** limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
|
32 |
+
💡 **SSLAM bridges this gap** by introducing **self-supervised learning from audio mixtures**, enabling robust learning across **both monophonic and polyphonic soundscapes**.
|
33 |
+
|
34 |
+
---
|
35 |
+
|
36 |
+
## 🎼Key Features
|
37 |
+
✅ **Self-Supervised Learning from Audio Mixtures (SSLAM)** – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
|
38 |
+
✅ **Source Retention Loss** – ensures the integrity of each sound source even in complex mixtures.
|
39 |
+
✅ **SOTA Performance** – Achieves **+3.9% mAP improvement** on AudioSet-2M and **+9.1% on polyphonic datasets**.
|
40 |
+
|
41 |
+
---
|
42 |
+
## 📊Results
|
43 |
+
|
44 |
+
### 1. Standard Audio-SSL Benchmark Datasets
|
45 |
+

|
46 |
+
|
47 |
+
### 2. Polyphonic Datasets
|
48 |
+

|
49 |
+
|
50 |
+
---
|
51 |
+
## **🔍️Inference Mode**
|
52 |
+
> **Note**: If you are already using [EAT](https://github.com/cwx-worst-one/EAT/tree/main) in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.
|
53 |
+
|
54 |
+
If not, follow the steps below for installation:
|
55 |
+
## 📥Inference Installation
|
56 |
+
|
57 |
+
```bash
|
58 |
+
conda create --prefix /path/to/sslam_eval_env -y python=3.9.13
|
59 |
+
/path/to/sslam_eval_env/bin/python -m pip install pip==24.0 # downgrade pip
|
60 |
+
/path/to/sslam_eval_env/bin/pip install -r SSLAM_Inference/requirements_sslam_eval.txt
|
61 |
+
```
|
62 |
+
---
|
63 |
+
|
64 |
+
## 📦Model Weights
|
65 |
+
|
66 |
+
| Model Type | Link |
|
67 |
+
|--------------------------|--------------------------------------------------------------------------------------------|
|
68 |
+
| **Pre-Trained** | [Download](https://drive.google.com/drive/folders/1aA65-qQCHSCrkiDeLGUtn1PiEjJi5HS8?usp=sharing) |
|
69 |
+
| **AS2M Fine-Tuned** (50.2 mAP) | [Download](https://drive.google.com/drive/folders/1Yy38IyksON5RJFNM7gzeQoAOSPnEIKp2?usp=sharing) |
|
70 |
+
---
|
71 |
+
|
72 |
+
#### 🚀**Using SSLAM**
|
73 |
+
|
74 |
+
We provide scripts to use SSLAM in the following ways:
|
75 |
+
|
76 |
+
##### 1. **Audio Feature (Representation) Extraction Using SSLAM Encoder**
|
77 |
+
|
78 |
+
```bash
|
79 |
+
cd SSLAM_Inference/scripts
|
80 |
+
bash feature_extract.sh
|
81 |
+
```
|
82 |
+
|
83 |
+
##### 2. **Inference on Single Audio WAV File**
|
84 |
+
|
85 |
+
```bash
|
86 |
+
cd SSLAM_Inference/scripts
|
87 |
+
bash inference.sh
|
88 |
+
```
|
89 |
+
|
90 |
+
##### 3. **Evaluation on AudioSet-2M Evaluation Set**
|
91 |
+
|
92 |
+
```bash
|
93 |
+
cd SSLAM_Inference/scripts
|
94 |
+
bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2
|
95 |
+
```
|
96 |
+
|
97 |
+
---
|
98 |
+
## **📈Training Mode**
|
99 |
+
We cover the self-supervised pre-training, fine-tuning and linear evaluation under this section.
|
100 |
+
|
101 |
+
#### **📥Training Installation**
|
102 |
+
|
103 |
+
For training its better to install the fairseq in editable mode,
|
104 |
+
|
105 |
+
```bash
|
106 |
+
conda create --prefix /path/to/sslam_env -y python=3.9.13 ## env used for training
|
107 |
+
/path/to/sslam_env/bin/python -m pip install pip==24.0 # downgrade pip
|
108 |
+
cd SSLAM/
|
109 |
+
git clone https://github.com/facebookresearch/fairseq.git
|
110 |
+
|
111 |
+
##IMPORTANT: Copy the Pre-Training/SSLAM_Stage2 directory to SSLAM/fairseq
|
112 |
+
## so that the resultant path is SSLAM/fairseq/SSLAM_Stage2/.
|
113 |
+
cd fairseq/
|
114 |
+
|
115 |
+
## install all requirements apart from fairseq
|
116 |
+
/path/to/sslam_env/bin/pip install -r SSLAM_Stage2/requirements_sslam.txt
|
117 |
+
## install fairseq in editable mode
|
118 |
+
/path/to/sslam_env/bin/pip install --editable ./
|
119 |
+
```
|
120 |
+
#### 🗄️Data Preparation
|
121 |
+
We utilised AudioSet-2M (full set) for pre-training. For this phase, only the `train.tsv` file is required. Refer to [train.tsv for AudioSet-20K](data_manifests/manifest_as20k/train.tsv) to prepare the train.tsv file for your downloaded copy of AudioSet-2M.
|
122 |
+
|
123 |
+
#### 🚀Pre-Training
|
124 |
+
|
125 |
+
**Note:** This repository focuses solely on Stage 2 pre-training, which introduces our novel SSLAM pre-training strategy.
|
126 |
+
|
127 |
+
To begin Stage 2, you'll need a Stage 1 checkpoint. In our complete pre-training process, Stage 1 mirrors the approach in [EAT](https://github.com/cwx-worst-one/EAT/tree/main) and achieves similar performance. For convenience, we use the EAT checkpoint as the Stage 1 checkpoint.
|
128 |
+
|
129 |
+
Download the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) epoch 10 checkpoint using the link provided by the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) repository: [EAT-base_epoch10_pt.pt](https://drive.google.com/file/d/10pklbY_fKraQUIBizSg1kv4lJXNWxpxl/view?usp=sharing).
|
130 |
+
|
131 |
+
*Only the contents of the **models/** folder and a few parameters in the pre-training script differ between Stage 1 and Stage 2.*
|
132 |
+
|
133 |
+
```bash
|
134 |
+
cd SSLAM/fairseq/SSLAM_Stage2/scripts/
|
135 |
+
bash pretrain_stage2.sh
|
136 |
+
```
|
137 |
+
|
138 |
+
|
139 |
+
## 📌Checklist
|
140 |
+
- [x] Inference Mode
|
141 |
+
- [x] Pre-Training
|
142 |
+
|
143 |
+
---
|
144 |
+
|
145 |
+
## 🙏Acknowledgements
|
146 |
+
|
147 |
+
Our code is primarily based on [EAT](https://github.com/cwx-worst-one/EAT/tree/main) and [data2vec 2.0](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec) with additional concepts and components adapted from [AudioMAE](https://github.com/facebookresearch/AudioMAE).
|
148 |
+
|
149 |
+
|
150 |
+
## 📜Citation
|
151 |
+
|
152 |
+
If you find our work useful, please cite it as:
|
153 |
+
|
154 |
+
```bibtex
|
155 |
+
@inproceedings{alex2025sslam,
|
156 |
+
title={{SSLAM}: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes},
|
157 |
+
author={Tony Alex and Sara Atito and Armin Mustafa and Muhammad Awais and Philip J B Jackson},
|
158 |
+
booktitle={The Thirteenth International Conference on Learning Representations},
|
159 |
+
year={2025},
|
160 |
+
url={https://openreview.net/forum?id=odU59TxdiB}
|
161 |
+
}
|