Spaces:

lym0302
/

DeepSound-V1

Running

App Files Files Community

DeepSound-V1 / README.md

lym0302

our

1fd4e9c 3 months ago

preview code

raw

history blame contribute delete

6.04 kB

	---
	title: DeepSound-V1
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: false
	---


	<!-- # DeepSound-V1
	Official code for DeepSound-V1 -->


	<div align="center">
	<p align="center">
	<h2>DeepSound-V1</h2>
	<!-- <a href="https://arxiv.org/abs/2412.15322">Paper</a> \| <a href="https://hkchengrex.github.io/MMAudio">Webpage</a> \| <a href="https://huggingface.co/hkchengrex/MMAudio/tree/main">Models</a> \| <a href="https://huggingface.co/spaces/hkchengrex/MMAudio"> Huggingface Demo</a> \| <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">Colab Demo</a> \| <a href="https://replicate.com/zsxkib/mmaudio">Replicate Demo</a> -->
	<a href="https://github.com/lym0302/DeepSound-V1">Paper</a> \| <a href="https://github.com/lym0302/DeepSound-V1">Webpage</a> \| <a href="https://github.com/lym0302/DeepSound-V1"> Huggingface Demo</a>
	</p>
	</div>

	## [DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos](https://github.com/lym0302/DeepSound-V1)

	<!-- [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/) -->

	<!-- University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation -->

	<!-- ICCV 2025 -->

	## Highlight

	DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM).

	<!-- ## Results

	(All audio from our algorithm MMAudio)

	Videos from Sora:

	https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330

	Videos from Veo 2:

	https://github.com/user-attachments/assets/8a11419e-fee2-46e0-9e67-dfb03c48d00e

	Videos from MovieGen/Hunyuan Video/VGGSound:

	https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca

	For more results, visit https://hkchengrex.com/MMAudio/video_main.html. -->


	## Installation
	```bash
	conda create -n deepsound-v1 python=3.10.16 -y
	conda activate deepsound-v1
	pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu120
	pip install flash-attn==2.5.8 --no-build-isolation
	pip install -e .
	pip install -r reqirments.txt
	```


	<!-- We have only tested this on Ubuntu.

	### Prerequisites

	We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.

	- Python 3.9+
	- PyTorch 2.5.1+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
	<!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->

	<!-- 1. Install prerequisite if not yet met:

	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
	```

	(Or any other CUDA versions that your GPUs/driver support) -->

	<!-- ```
	conda install -c conda-forge 'ffmpeg<7
	```
	(Optional, if you use miniforge and don't already have the appropriate ffmpeg) -->

	<!-- 2. Clone our repository:

	```bash
	git clone https://github.com/lym0302/DeepSound-V1.git
	```

	3. Install with pip (install pytorch first before attempting this!):

	```bash
	cd DeepSound-V1
	pip install -e .
	```

	(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip) -->


	<!-- The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
	The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
	See [MODELS.md](docs/MODELS.md) for more details. -->

	## Demo

	### Pretrained models
	See [MODELS.md](docs/MODELS.md).

	### Command-line interface

	With `demo.py`

	```bash
	python demo.py -i <video_path>
	```

	All training parameters are [here]().

	<!-- The output (audio in `.wav` format, and video in `.mp4` format) will be saved in `./output`.
	See the file for more options.
	Simply omit the `--video` option for text-to-audio synthesis.
	The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality. -->

	<!-- ### Gradio interface

	Supports video-to-audio and text-to-audio synthesis.
	You can also try experimental image-to-audio synthesis which duplicates the input image to a video for processing. This might be interesting to some but it is not something MMAudio has been trained for.
	Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can specify with `--port`.

	```bash
	python gradio_demo.py
	``` -->



	## Evaluation
	Refer [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
	See [EVAL.md](docs/EVAL.md).


	## Citation

	<!-- ```bibtex
	@inproceedings{cheng2025taming,
	title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
	author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
	booktitle={CVPR},
	year={2025}
	}
	``` -->

	## Relevant Repositories

	- [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.


	## Acknowledgement

	Many thanks to:
	- [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
	- [MMAudio](https://github.com/hkchengrex/MMAudio)
	- [FoleyCrafter](https://github.com/open-mmlab/FoleyCrafter)
	- [BS-RoFormer](https://github.com/ZFTurbo/Music-Source-Separation-Training)