File size: 4,560 Bytes
8953cb2 30136f6 a6ce0f0 30136f6 e6ec3af 30136f6 e6ec3af 30136f6 63b7e0a 30136f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: mit
datasets:
- agkphysics/AudioSet
language:
- en
pipeline_tag: audio-classification
library_name: fairseq
tags:
- self-supervised-learning
- audio-self-supervised-learning
- SSL
- AudioSet
- AudioSSL
- AudioEncoder
---
# 🔊 [ICLR 2025] SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
[](https://openreview.net/forum?id=odU59TxdiB)
🚀 **SSLAM** is a self-supervised learning framework designed to enhance audio representation quality for both **polyphonic(multiple overlapping sounds)** and monophonic soundscapes. Unlike traditional SSL models that focus on monophonic data, SSLAM introduces a novel **source retention loss** and **audio mixture training**, significantly improving performance on real-world polyphonic audio.
🔗 **[Github](https://github.com/ta012/SSLAM) | [Paper](https://openreview.net/pdf?id=odU59TxdiB) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) |[🤗 Models](https://huggingface.co/ta012/SSLAM) | [Models(Google Drive)](https://drive.google.com/drive/folders/1G0icv-hdqDEqnfP4EFszMXhFnWWM09gT?usp=sharing)**
# 📋 Table of Contents
- [Why SSLAM?](#why-sslam)
- [Key Features](#key-features)
- [Results](#results)
- [Inference Mode](#️inference-mode)
- [Inference Installation](#inference-installation)
- [Model Weights](#model-weights)
- [Using SSLAM](#using-sslam)
- [Acknowledgements](#acknowledgements)
- [Citation](#citation)
## 🔍Why SSLAM?
🔊 **Real-world audio is polyphonic**—multiple overlapping sound sources are common in everyday environments.
❌ **Existing SSL models focus on monophonic audio,** limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
💡 **SSLAM bridges this gap** by introducing **self-supervised learning from audio mixtures**, enabling robust learning across **both monophonic and polyphonic soundscapes**.
## 🎼Key Features
✅ **Self-Supervised Learning from Audio Mixtures (SSLAM)** – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
✅ **Source Retention Loss** – ensures the integrity of each sound source even in complex mixtures.
✅ **SOTA Performance** – Achieves **+3.9% mAP improvement** on AudioSet-2M and **+9.1% on polyphonic datasets**.
## 📊Results
### 1. Standard Audio-SSL Benchmark Datasets

### 2. Polyphonic Datasets

## **🔍️Inference Mode**
> **Note**: If you are already using [EAT](https://github.com/cwx-worst-one/EAT/tree/main) in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.
If not, follow the steps below for installation:
## 📥Inference Installation
```bash
conda create --prefix /path/to/sslam_eval_env -y python=3.9.13
/path/to/sslam_eval_env/bin/python -m pip install pip==24.0 # downgrade pip
##clone SSLAM
git clone https://github.com/ta012/SSLAM.git
cd SSLAM/
/path/to/sslam_eval_env/bin/pip install -r SSLAM_Inference/requirements_sslam_eval.txt
```
#### 🚀**Using SSLAM**
We provide scripts to use SSLAM in the following ways:
##### 1. **Audio Feature (Representation) Extraction Using SSLAM Encoder**
```bash
cd SSLAM_Inference/scripts
bash feature_extract.sh
```
##### 2. **Inference on Single Audio WAV File**
```bash
cd SSLAM_Inference/scripts
bash inference.sh
```
##### 3. **Evaluation on AudioSet-2M Evaluation Set**
```bash
cd SSLAM_Inference/scripts
bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2
```
## 🙏Acknowledgements
Our code is primarily based on [EAT](https://github.com/cwx-worst-one/EAT/tree/main) and [data2vec 2.0](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec) with additional concepts and components adapted from [AudioMAE](https://github.com/facebookresearch/AudioMAE).
## 📜Citation
If you find our work useful, please cite it as:
```bibtex
@inproceedings{alex2025sslam,
title={{SSLAM}: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes},
author={Tony Alex and Sara Atito and Armin Mustafa and Muhammad Awais and Philip J B Jackson},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=odU59TxdiB}
} |