diff --git a/README.md b/README.md index 4fd9618dde607663fcd69b94b821a8603a243b8b..20cf8a7bc5b236d5e1064470df8d11d56bb8c752 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,5 @@ --- title: DeepSound-V1 -emoji: ๐ colorFrom: blue colorTo: indigo sdk: gradio @@ -9,155 +8,160 @@ pinned: false --- -# [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio) + -[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/) -University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation +
+## [DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos](https://github.com/lym0302/DeepSound-V1) -[[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio) + + -**Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.** + ## Highlight -MMAudio generates synchronized audio given video and/or text inputs. -Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. -Moreover, a synchronization module aligns the generated audio with the video frames. +DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM). - -## Results + + ## Installation +```bash +conda create -n deepsound-v1 python=3.10.16 -y +conda activate deepsound-v1 +pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu120 +pip install flash-attn==2.5.8 --no-build-isolation +pip install -e . +pip install -r reqirments.txt +``` + -We have only tested this on Ubuntu. + -**Clone our repository:** + -```bash -cd MMAudio -pip install -e . + -(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip) - -**Pretrained models:** - -The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py` - -| Model | Download link | File size | -| -------- | ------- | ------- | -| Flow prediction network, small 16kHz | mmaudio_small_16k.pth | 601M | -| Flow prediction network, small 44.1kHz | mmaudio_small_44k.pth | 601M | -| Flow prediction network, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G | -| Flow prediction network, large 44.1kHz **(recommended)** | mmaudio_large_44k.pth | 3.9G | -| 16kHz VAE | v1-16.pth | 655M | -| 16kHz BigVGAN vocoder |best_netG.pt | 429M | -| 44.1kHz VAE |v1-44.pth | 1.2G | -| Synchformer visual encoder |synchformer_state_dict.pth | 907M | - -The 44.1kHz vocoder will be downloaded automatically. - -The expected directory structure (full): + + + + + ## Demo -By default, these scripts use the `large_44k` model. -In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs. +### Pretrained models +See [MODELS.md](docs/MODELS.md). ### Command-line interface With `demo.py` + ```bash -python demo.py --duration=8 --video=- MMAudio generates synchronized audio given video and/or text inputs. -
-- Example 1: Ice cracking with sharp snapping sound, and metal tool scraping against the ice surface. - Back to index -
- -- Example 2: Rhythmic splashing and lapping of water. - Back to index -
-- Example 3: Shovel scrapes against dry earth. - Back to index -
-- (Failure case) Example 4: Creamy sound of mashed potatoes being scooped. - Back to index -
-(Click on the links to load the corresponding videos) Back to project page
- -
+
+
+ +
+
+
+> [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://github.com/DAMO-NLP-SG/Video-LLaMA)
+> Hang Zhang, Xin Li, Lidong Bing
+[](https://github.com/DAMO-NLP-SG/Video-LLaMA) [](https://github.com/DAMO-NLP-SG/Video-LLaMA) [](https://arxiv.org/abs/2306.02858)
+
+> [**VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding**](https://arxiv.org/abs/2311.16922)
+> Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
+[](https://github.com/DAMO-NLP-SG/VCD) [](https://github.com/DAMO-NLP-SG/VCD) [](https://arxiv.org/abs/2311.16922)
+
+> [**The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio**](https://arxiv.org/abs/2410.12787)
+> Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
+[](https://github.com/DAMO-NLP-SG/CMM) [](https://github.com/DAMO-NLP-SG/CMM) [](https://arxiv.org/abs/2410.12787)
+
+