DeepSound-V1 / README.md
lym0302
our
1fd4e9c

A newer version of the Gradio SDK is available: 5.23.3

Upgrade
metadata
title: DeepSound-V1
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false

DeepSound-V1

Paper | Webpage | Huggingface Demo

DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos

Highlight

DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM).

Installation

conda create -n deepsound-v1 python=3.10.16 -y
conda activate deepsound-v1
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu120
pip install flash-attn==2.5.8 --no-build-isolation
pip install -e .
pip install -r reqirments.txt

Demo

Pretrained models

See MODELS.md.

Command-line interface

With demo.py

python demo.py -i <video_path>

All training parameters are here.

Evaluation

Refer av-benchmark for benchmarking results. See EVAL.md.

Citation

Relevant Repositories

Acknowledgement

Many thanks to: