Video-to-Video
nielsr HF Staff commited on
Commit
b649f60
·
verified ·
1 Parent(s): 7e65180

Improve model card: add pipeline tag, links, abstract, features, overview, and quick start

Browse files

This PR significantly improves the model card for `ThinkSound`. It:

- Adds the `pipeline_tag: other` metadata for better discoverability of this multimodal audio generation and editing model.
- Adds comprehensive links to the Hugging Face paper page, the GitHub repository, and the Hugging Face Space demo.
- Includes the paper abstract for a quick overview of the work.
- Incorporates detailed "Features" and "Method Overview" sections to explain the model's capabilities and architecture.
- Provides a "Quick Start" guide for users to easily run the model locally, including environment setup and demo instructions.
- Enhances visual appeal by including key diagrams from the project.

This update makes the model card much more informative and user-friendly.

Files changed (1) hide show
  1. README.md +80 -5
README.md CHANGED
@@ -1,21 +1,96 @@
1
  ---
2
  license: apache-2.0
 
3
  ---
4
 
5
- This repository contains the weights of [ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing](https://arxiv.org/abs/2506.21448).
6
 
7
- Project Page: https://thinksound-project.github.io/.
8
 
9
- If you find our work useful, please cite our paper:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ```bibtex
12
  @misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
13
- title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
14
  author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
15
  year={2025},
16
  eprint={2506.21448},
17
  archivePrefix={arXiv},
18
  primaryClass={eess.AS},
19
- url={https://arxiv.org/abs/2506.21448},
20
  }
21
  ```
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: other
4
  ---
5
 
6
+ # 🎶 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
7
 
8
+ This repository contains the weights of the model presented in the paper [ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing](https://arxiv.org/abs/2506.21448).
9
 
10
+ **ThinkSound** is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning. It enables multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).
11
+
12
+ [\ud83d\udcda Paper (arXiv)](https://arxiv.org/abs/2506.21448) | [\ud83d\udcda Paper (Hugging Face)](https://huggingface.co/papers/2506.21448) | [\ud83d\udcbb Code](https://github.com/liuhuadai/ThinkSound) | [\ud83c\udf10 Project Page](https://thinksound-project.github.io/) | [\ud83e\udd17 Hugging Face Space Demo](https://huggingface.co/spaces/FunAudioLLM/ThinkSound)
13
+
14
+ <div align="center">
15
+ <img src="https://raw.githubusercontent.com/liuhuadai/ThinkSound/main/assets/figs/fig1_teaser.png" alt="ThinkSound Teaser" width="100%"/>
16
+ </div>
17
+
18
+ ## Abstract
19
+
20
+ While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark.
21
+
22
+ ## \ud83d\ude80 Features
23
+
24
+ - **Any2Audio**: Generate audio from arbitrary modalities — video, text, audio, or their combinations.
25
+ - **Video-to-Audio SOTA**: Achieves state-of-the-art results on multiple V2A benchmarks.
26
+ - **CoT-Driven Reasoning**: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
27
+ - **Interactive Object-centric Editing**: Refine or edit specific sound events by clicking on visual objects or using text instructions.
28
+ - **Unified Framework**: One foundation model supports generation, editing, and interactive workflow.
29
+
30
+ ## \u2728 Method Overview
31
+
32
+ ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:
33
+
34
+ 1. **Foley Generation:** Generate foundational, semantically and temporally aligned soundscapes from video.
35
+ 2. **Object-Centric Refinement:** Refine or add sounds for user-specified objects via clicks or regions in the video.
36
+ 3. **Targeted Audio Editing:** Modify generated audio using high-level natural language instructions.
37
+
38
+ <div align="center">
39
+ <img src="https://raw.githubusercontent.com/liuhuadai/ThinkSound/main/assets/figs/fig3_model.png" alt="ThinkSound Overview" width="100%"/>
40
+ </div>
41
+
42
+ ## \u26a1 Quick Start
43
+
44
+ ### Environment Preparation
45
+
46
+ ```bash
47
+ git clone https://github.com/liuhuadai/ThinkSound.git
48
+ cd ThinkSound
49
+ pip install -r requirements.txt
50
+ conda install -y -c conda-forge 'ffmpeg<7'
51
+ # Download pretrained weights from Hugging Face:
52
+ # git lfs install
53
+ # git clone https://huggingface.co/liuhuadai/ThinkSound
54
+ # Move downloaded weights (e.g., "ThinkSound" folder containing model files) to ckpts/ directory
55
+ ```
56
+
57
+ ### Run the script
58
+
59
+ ```bash
60
+ chmod +x scripts/demo.sh
61
+ ./scripts/demo.sh <video_path> <title> <CoT description>
62
+ ```
63
+
64
+ ### Web Interface Usage
65
+
66
+ For an interactive experience, launch the Gradio web interface:
67
+
68
+ ```bash
69
+ python app.py
70
+ ```
71
+
72
+ ## \ud83d\udcc4 License
73
+
74
+ This project is released under the [Apache 2.0 License](https://github.com/liuhuadai/ThinkSound/blob/main/LICENSE).
75
+
76
+ > **Note:**
77
+ > The code, models, and dataset are **for research and educational purposes only**.
78
+ > **Commercial use is NOT permitted.**
79
+ >
80
+ > For commercial licensing, please contact the authors.
81
+
82
+ ## \ud83d\udcd6 Citation
83
+
84
+ If you find ThinkSound useful in your research or work, please cite our paper:
85
 
86
  ```bibtex
87
  @misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
88
+ title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
89
  author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
90
  year={2025},
91
  eprint={2506.21448},
92
  archivePrefix={arXiv},
93
  primaryClass={eess.AS},
94
+ url={https://arxiv.org/abs/2506.21448},
95
  }
96
  ```