File size: 6,040 Bytes
bdb2f2b
 
0163d98
 
bdb2f2b
 
 
 
 
eedfa8e
1fd4e9c
 
eedfa8e
 
1fd4e9c
 
 
 
 
 
 
eedfa8e
1fd4e9c
0163d98
1fd4e9c
0163d98
1fd4e9c
0163d98
1fd4e9c
eedfa8e
 
 
1fd4e9c
eedfa8e
1fd4e9c
eedfa8e
 
 
1fd4e9c
eedfa8e
 
 
1fd4e9c
 
 
eedfa8e
1fd4e9c
eedfa8e
 
 
1fd4e9c
 
eedfa8e
 
1fd4e9c
 
 
 
 
 
 
 
 
eedfa8e
1fd4e9c
eedfa8e
 
 
 
 
1fd4e9c
 
 
eedfa8e
1fd4e9c
eedfa8e
 
1fd4e9c
eedfa8e
 
1fd4e9c
eedfa8e
1fd4e9c
 
eedfa8e
1fd4e9c
eedfa8e
1fd4e9c
0163d98
 
1fd4e9c
0163d98
 
1fd4e9c
0163d98
 
1fd4e9c
 
0163d98
eedfa8e
1fd4e9c
 
 
 
 
 
 
eedfa8e
 
1fd4e9c
 
eedfa8e
 
 
 
1fd4e9c
eedfa8e
1fd4e9c
eedfa8e
1fd4e9c
 
 
 
eedfa8e
 
1fd4e9c
eedfa8e
1fd4e9c
eedfa8e
 
1fd4e9c
 
eedfa8e
1fd4e9c
eedfa8e
1fd4e9c
eedfa8e
 
 
1fd4e9c
 
 
eedfa8e
 
1fd4e9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eedfa8e
 
0163d98
1fd4e9c
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: DeepSound-V1
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
---


<!-- # DeepSound-V1
Official code for DeepSound-V1 -->


<div align="center">
<p align="center">
  <h2>DeepSound-V1</h2>
  <!-- <a href="https://arxiv.org/abs/2412.15322">Paper</a> | <a href="https://hkchengrex.github.io/MMAudio">Webpage</a> | <a href="https://huggingface.co/hkchengrex/MMAudio/tree/main">Models</a> | <a href="https://huggingface.co/spaces/hkchengrex/MMAudio"> Huggingface Demo</a> | <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">Colab Demo</a> | <a href="https://replicate.com/zsxkib/mmaudio">Replicate Demo</a> -->
  <a href="https://github.com/lym0302/DeepSound-V1">Paper</a> | <a href="https://github.com/lym0302/DeepSound-V1">Webpage</a> | <a href="https://github.com/lym0302/DeepSound-V1"> Huggingface Demo</a>
</p>
</div>

## [DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos](https://github.com/lym0302/DeepSound-V1)

<!-- [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/) -->

<!-- University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation -->

<!-- ICCV 2025 -->

## Highlight

DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM).

<!-- ## Results

(All audio from our algorithm MMAudio)

Videos from Sora:

https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330

Videos from Veo 2:

https://github.com/user-attachments/assets/8a11419e-fee2-46e0-9e67-dfb03c48d00e

Videos from MovieGen/Hunyuan Video/VGGSound:

https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca

For more results, visit https://hkchengrex.com/MMAudio/video_main.html. -->


## Installation
```bash
conda create -n deepsound-v1 python=3.10.16 -y
conda activate deepsound-v1
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu120
pip install flash-attn==2.5.8 --no-build-isolation
pip install -e .
pip install -r reqirments.txt
```


<!-- We have only tested this on Ubuntu.

### Prerequisites

We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.

- Python 3.9+
- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
<!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->

<!-- **1. Install prerequisite if not yet met:**

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
```

(Or any other CUDA versions that your GPUs/driver support) -->

<!-- ```
conda install -c conda-forge 'ffmpeg<7
```
(Optional, if you use miniforge and don't already have the appropriate ffmpeg) -->

<!-- **2. Clone our repository:**

```bash
git clone https://github.com/lym0302/DeepSound-V1.git
```

**3. Install with pip (install pytorch first before attempting this!):**

```bash
cd DeepSound-V1
pip install -e .
```

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip) --> 


<!-- The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
See [MODELS.md](docs/MODELS.md) for more details. -->

## Demo

### Pretrained models
See [MODELS.md](docs/MODELS.md).

### Command-line interface

With `demo.py`

```bash
python demo.py -i <video_path>
```

All training parameters are [here]().

<!-- The output (audio in `.wav` format, and video in `.mp4` format) will be saved in `./output`.
See the file for more options.
Simply omit the `--video` option for text-to-audio synthesis.
The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality. -->

<!-- ### Gradio interface

Supports video-to-audio and text-to-audio synthesis.
You can also try experimental image-to-audio synthesis which duplicates the input image to a video for processing. This might be interesting to some but it is not something MMAudio has been trained for.
Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can specify with `--port`.

```bash
python gradio_demo.py
``` -->



## Evaluation
Refer [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
See [EVAL.md](docs/EVAL.md).


## Citation

<!-- ```bibtex
@inproceedings{cheng2025taming,
  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
  booktitle={CVPR},
  year={2025}
}
``` -->

## Relevant Repositories

- [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.


## Acknowledgement

Many thanks to:
- [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2) 
- [MMAudio](https://github.com/hkchengrex/MMAudio) 
- [FoleyCrafter](https://github.com/open-mmlab/FoleyCrafter)
- [BS-RoFormer](https://github.com/ZFTurbo/Music-Source-Separation-Training)