Spaces:

lym0302
/

DeepSound-V1

Running

App Files Files Community

lym0302123 commited on Mar 21

Commit

eedfa8e

1 Parent(s): bafca5a

try exp

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LICENSE +21 -0
README.md +185 -14
app.py +343 -0
batch_eval.py +110 -0
config/__init__.py +0 -0
config/base_config.yaml +62 -0
config/data/base.yaml +70 -0
config/eval_config.yaml +17 -0
config/eval_data/base.yaml +22 -0
config/hydra/job_logging/custom-eval.yaml +32 -0
config/hydra/job_logging/custom-no-rank.yaml +32 -0
config/hydra/job_logging/custom-simplest.yaml +26 -0
config/hydra/job_logging/custom.yaml +33 -0
config/train_config.yaml +41 -0
demo.py +141 -0
docs/EVAL.md +22 -0
docs/MODELS.md +50 -0
docs/TRAINING.md +184 -0
docs/images/icon.png +0 -0
docs/index.html +149 -0
docs/style.css +78 -0
docs/style_videos.css +52 -0
docs/video_gen.html +254 -0
docs/video_main.html +98 -0
docs/video_vgg.html +452 -0
gradio_demo.py +343 -0
mmaudio/__init__.py +0 -0
mmaudio/__pycache__/__init__.cpython-310.pyc +0 -0
mmaudio/__pycache__/__init__.cpython-38.pyc +0 -0
mmaudio/__pycache__/eval_utils.cpython-310.pyc +0 -0
mmaudio/__pycache__/eval_utils.cpython-38.pyc +0 -0
mmaudio/data/__init__.py +0 -0
mmaudio/data/__pycache__/__init__.cpython-310.pyc +0 -0
mmaudio/data/__pycache__/__init__.cpython-38.pyc +0 -0
mmaudio/data/__pycache__/av_utils.cpython-310.pyc +0 -0
mmaudio/data/__pycache__/av_utils.cpython-38.pyc +0 -0
mmaudio/data/av_utils.py +162 -0
mmaudio/data/data_setup.py +174 -0
mmaudio/data/eval/__init__.py +0 -0
mmaudio/data/eval/audiocaps.py +39 -0
mmaudio/data/eval/moviegen.py +131 -0
mmaudio/data/eval/video_dataset.py +197 -0
mmaudio/data/extracted_audio.py +88 -0
mmaudio/data/extracted_vgg.py +101 -0
mmaudio/data/extraction/__init__.py +0 -0
mmaudio/data/extraction/vgg_sound.py +193 -0
mmaudio/data/extraction/wav_dataset.py +132 -0
mmaudio/data/mm_dataset.py +45 -0
mmaudio/data/utils.py +148 -0
mmaudio/eval_utils.py +255 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Sony Research Inc.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,14 +1,185 @@
----
-title: DeepSound V1
-emoji: 📚
-colorFrom: red
-colorTo: red
-sdk: gradio
-sdk_version: 5.22.0
-app_file: app.py
-pinned: false
-license: mit
-short_description: DeepSound-V1 demo
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<div align="center">
+<p align="center">
+  <h2>MMAudio</h2>
+  <a href="https://arxiv.org/abs/2412.15322">Paper</a> | <a href="https://hkchengrex.github.io/MMAudio">Webpage</a> | <a href="https://huggingface.co/hkchengrex/MMAudio/tree/main">Models</a> | <a href="https://huggingface.co/spaces/hkchengrex/MMAudio"> Huggingface Demo</a> | <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">Colab Demo</a> | <a href="https://replicate.com/zsxkib/mmaudio">Replicate Demo</a>
+</p>
+</div>
+## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)
+[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)
+University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
+CVPR 2025
+## Highlight
+MMAudio generates synchronized audio given video and/or text inputs.
+Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
+Moreover, a synchronization module aligns the generated audio with the video frames.
+## Results
+(All audio from our algorithm MMAudio)
+Videos from Sora:
+https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
+Videos from Veo 2:
+https://github.com/user-attachments/assets/8a11419e-fee2-46e0-9e67-dfb03c48d00e
+Videos from MovieGen/Hunyuan Video/VGGSound:
+https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
+For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
+## Installation
+We have only tested this on Ubuntu.
+### Prerequisites
+We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
+- Python 3.9+
+- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
+<!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->
+**1. Install prerequisite if not yet met:**
+```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
+```
+(Or any other CUDA versions that your GPUs/driver support)
+<!-- ```
+conda install -c conda-forge 'ffmpeg<7
+```
+(Optional, if you use miniforge and don't already have the appropriate ffmpeg) -->
+**2. Clone our repository:**
+```bash
+git clone https://github.com/hkchengrex/MMAudio.git
+```
+**3. Install with pip (install pytorch first before attempting this!):**
+```bash
+cd MMAudio
+pip install -e .
+```
+(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
+**Pretrained models:**
+The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
+The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
+See [MODELS.md](docs/MODELS.md) for more details.
+## Demo
+By default, these scripts use the `large_44k_v2` model.
+In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
+### Command-line interface
+With `demo.py`
+```bash
+python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
+```
+The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
+See the file for more options.
+Simply omit the `--video` option for text-to-audio synthesis.
+The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
+### Gradio interface
+Supports video-to-audio and text-to-audio synthesis.
+You can also try experimental image-to-audio synthesis which duplicates the input image to a video for processing. This might be interesting to some but it is not something MMAudio has been trained for.
+Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can specify with `--port`.
+```bash
+python gradio_demo.py
+```
+### FAQ
+1. Video processing
+    - Processing higher-resolution videos takes longer due to encoding and decoding (which can take >95% of the processing time!), but it does not improve the quality of results.
+    - The CLIP encoder resizes input frames to 384×384 pixels.
+    - Synchformer resizes the shorter edge to 224 pixels and applies a center crop, focusing only on the central square of each frame.
+2. Frame rates
+    - The CLIP model operates at 8 FPS, while Synchformer works at 25 FPS.
+    - Frame rate conversion happens on-the-fly via the video reader.
+    - For input videos with a frame rate below 25 FPS, frames will be duplicated to match the required rate.
+3. Failure cases
+As with most models of this type, failures can occur, and the reasons are not always clear. Below are some known failure modes. If you notice a failure mode or believe there’s a bug, feel free to open an issue in the repository.
+4. Performance variations
+We notice that there can be subtle performance variations in different hardware and software environments. Some of the reasons include using/not using `torch.compile`, video reader library/backend, inference precision, batch sizes, random seeds, etc. We (will) provide pre-computed results on standard benchmark for reference. Results obtained from this codebase should be similar but might not be exactly the same.
+### Known limitations
+1. The model sometimes generates unintelligible human speech-like sounds
+2. The model sometimes generates background music (without explicit training, it would not be high quality)
+3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
+We believe all of these three limitations can be addressed with more high-quality training data.
+## Training
+See [TRAINING.md](docs/TRAINING.md).
+## Evaluation
+See [EVAL.md](docs/EVAL.md).
+## Training Datasets
+MMAudio was trained on several datasets, including [AudioSet](https://research.google.com/audioset/), [Freesound](https://github.com/LAION-AI/audio-dataset/blob/main/laion-audio-630k/README.md), [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [AudioCaps](https://audiocaps.github.io/), and [WavCaps](https://github.com/XinhaoMei/WavCaps). These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.
+## Update Logs
+- 2025-03-09: Uploaded the corrected tsv files. See [TRAINING.md](docs/TRAINING.md).
+- 2025-02-27: Disabled the GradScaler by default to improve training stability. See #49.
+- 2024-12-23: Added training and batch evaluation scripts.
+- 2024-12-14: Removed the `ffmpeg<7` requirement for the demos by replacing `torio.io.StreamingMediaDecoder` with `pyav` for reading frames. The read frames are also cached, so we are not reading the same frames again during reconstruction. This should speed things up and make installation less of a hassle.
+- 2024-12-13: Improved for-loop processing in CLIP/Sync feature extraction by introducing a batch size multiplier. We can approximately use 40x batch size for CLIP/Sync without using more memory, thereby speeding up processing. Removed VAE encoder during inference -- we don't need it.
+- 2024-12-11: Replaced `torio.io.StreamingMediaDecoder` with `pyav` for reading framerate when reconstructing the input video. `torio.io.StreamingMediaDecoder` does not work reliably in huggingface ZeroGPU's environment, and I suspect that it might not work in some other environments as well.
+## Citation
+```bibtex
+@inproceedings{cheng2025taming,
+  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
+  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
+  booktitle={CVPR},
+  year={2025}
+}
+```
+## Relevant Repositories
+- [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
+## Disclaimer
+We have no affiliation with and have no knowledge of the party behind the domain "mmaudio.net".
+## Acknowledgement
+Many thanks to:
+- [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model and the VAE architecture
+- [BigVGAN](https://github.com/NVIDIA/BigVGAN)
+- [Synchformer](https://github.com/v-iashin/Synchformer)
+- [EDM2](https://github.com/NVlabs/edm2) for the magnitude-preserving VAE network architecture

app.py ADDED Viewed

	@@ -0,0 +1,343 @@

+import gc
+import logging
+from argparse import ArgumentParser
+from datetime import datetime
+from fractions import Fraction
+from pathlib import Path
+import gradio as gr
+import torch
+import torchaudio
+from mmaudio.eval_utils import (ModelConfig, VideoInfo, all_model_cfg, generate, load_image,
+                                load_video, make_video, setup_eval_logging)
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import MMAudio, get_my_mmaudio
+from mmaudio.model.sequence_config import SequenceConfig
+from mmaudio.model.utils.features_utils import FeaturesUtils
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+log = logging.getLogger()
+device = 'cpu'
+if torch.cuda.is_available():
+    device = 'cuda'
+elif torch.backends.mps.is_available():
+    device = 'mps'
+else:
+    log.warning('CUDA/MPS are not available, running on CPU')
+dtype = torch.bfloat16
+model: ModelConfig = all_model_cfg['large_44k_v2']
+model.download_if_needed()
+output_dir = Path('./output/gradio')
+setup_eval_logging()
+def get_model() -> tuple[MMAudio, FeaturesUtils, SequenceConfig]:
+    seq_cfg = model.seq_cfg
+    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
+    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
+    log.info(f'Loaded weights from {model.model_path}')
+    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
+                                  synchformer_ckpt=model.synchformer_ckpt,
+                                  enable_conditions=True,
+                                  mode=model.mode,
+                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
+                                  need_vae_encoder=False)
+    feature_utils = feature_utils.to(device, dtype).eval()
+    return net, feature_utils, seq_cfg
+net, feature_utils, seq_cfg = get_model()
+@torch.inference_mode()
+def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
+                   cfg_strength: float, duration: float):
+    rng = torch.Generator(device=device)
+    if seed >= 0:
+        rng.manual_seed(seed)
+    else:
+        rng.seed()
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+    video_info = load_video(video, duration)
+    clip_frames = video_info.clip_frames
+    sync_frames = video_info.sync_frames
+    duration = video_info.duration_sec
+    clip_frames = clip_frames.unsqueeze(0)
+    sync_frames = sync_frames.unsqueeze(0)
+    seq_cfg.duration = duration
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    audios = generate(clip_frames,
+                      sync_frames, [prompt],
+                      negative_text=[negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=rng,
+                      cfg_strength=cfg_strength)
+    audio = audios.float().cpu()[0]
+    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
+    output_dir.mkdir(exist_ok=True, parents=True)
+    video_save_path = output_dir / f'{current_time_string}.mp4'
+    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+    gc.collect()
+    return video_save_path
+@torch.inference_mode()
+def image_to_audio(image: gr.Image, prompt: str, negative_prompt: str, seed: int, num_steps: int,
+                   cfg_strength: float, duration: float):
+    rng = torch.Generator(device=device)
+    if seed >= 0:
+        rng.manual_seed(seed)
+    else:
+        rng.seed()
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+    image_info = load_image(image)
+    clip_frames = image_info.clip_frames
+    sync_frames = image_info.sync_frames
+    clip_frames = clip_frames.unsqueeze(0)
+    sync_frames = sync_frames.unsqueeze(0)
+    seq_cfg.duration = duration
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    audios = generate(clip_frames,
+                      sync_frames, [prompt],
+                      negative_text=[negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=rng,
+                      cfg_strength=cfg_strength,
+                      image_input=True)
+    audio = audios.float().cpu()[0]
+    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
+    output_dir.mkdir(exist_ok=True, parents=True)
+    video_save_path = output_dir / f'{current_time_string}.mp4'
+    video_info = VideoInfo.from_image_info(image_info, duration, fps=Fraction(1))
+    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+    gc.collect()
+    return video_save_path
+@torch.inference_mode()
+def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int, cfg_strength: float,
+                  duration: float):
+    rng = torch.Generator(device=device)
+    if seed >= 0:
+        rng.manual_seed(seed)
+    else:
+        rng.seed()
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+    clip_frames = sync_frames = None
+    seq_cfg.duration = duration
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    audios = generate(clip_frames,
+                      sync_frames, [prompt],
+                      negative_text=[negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=rng,
+                      cfg_strength=cfg_strength)
+    audio = audios.float().cpu()[0]
+    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
+    output_dir.mkdir(exist_ok=True, parents=True)
+    audio_save_path = output_dir / f'{current_time_string}.flac'
+    torchaudio.save(audio_save_path, audio, seq_cfg.sampling_rate)
+    gc.collect()
+    return audio_save_path
+video_to_audio_tab = gr.Interface(
+    fn=video_to_audio,
+    description="""
+    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
+    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
+    NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side).
+    Doing so does not improve results.
+    """,
+    inputs=[
+        gr.Video(),
+        gr.Text(label='Prompt'),
+        gr.Text(label='Negative prompt', value='music'),
+        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
+        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
+        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
+        gr.Number(label='Duration (sec)', value=8, minimum=1),
+    ],
+    outputs='playable_video',
+    cache_examples=False,
+    title='MMAudio — Video-to-Audio Synthesis',
+    examples=[
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_beach.mp4',
+            'waves, seagulls',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_serpent.mp4',
+            '',
+            'music',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_seahorse.mp4',
+            'bubbles',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_india.mp4',
+            'Indian holy music',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_galloping.mp4',
+            'galloping',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_kraken.mp4',
+            'waves, storm',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/mochi_storm.mp4',
+            'storm',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_spring.mp4',
+            '',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_typing.mp4',
+            'typing',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_wake_up.mp4',
+            '',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_nyc.mp4',
+            '',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+    ])
+text_to_audio_tab = gr.Interface(
+    fn=text_to_audio,
+    description="""
+    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
+    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
+    """,
+    inputs=[
+        gr.Text(label='Prompt'),
+        gr.Text(label='Negative prompt'),
+        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
+        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
+        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
+        gr.Number(label='Duration (sec)', value=8, minimum=1),
+    ],
+    outputs='audio',
+    cache_examples=False,
+    title='MMAudio — Text-to-Audio Synthesis',
+)
+image_to_audio_tab = gr.Interface(
+    fn=image_to_audio,
+    description="""
+    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
+    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
+    NOTE: It takes longer to process high-resolution images (>384 px on the shorter side).
+    Doing so does not improve results.
+    """,
+    inputs=[
+        gr.Image(type='filepath'),
+        gr.Text(label='Prompt'),
+        gr.Text(label='Negative prompt'),
+        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
+        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
+        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
+        gr.Number(label='Duration (sec)', value=8, minimum=1),
+    ],
+    outputs='playable_video',
+    cache_examples=False,
+    title='MMAudio — Image-to-Audio Synthesis (experimental)',
+)
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument('--port', type=int, default=7860)
+    args = parser.parse_args()
+    gr.TabbedInterface([video_to_audio_tab, text_to_audio_tab, image_to_audio_tab],
+                       ['Video-to-Audio', 'Text-to-Audio', 'Image-to-Audio (experimental)']).launch(
+                           server_port=args.port, allowed_paths=[output_dir])

batch_eval.py ADDED Viewed

	@@ -0,0 +1,110 @@

+import logging
+import os
+from pathlib import Path
+import hydra
+import torch
+import torch.distributed as distributed
+import torchaudio
+from hydra.core.hydra_config import HydraConfig
+from omegaconf import DictConfig
+from tqdm import tqdm
+from mmaudio.data.data_setup import setup_eval_dataset
+from mmaudio.eval_utils import ModelConfig, all_model_cfg, generate
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import MMAudio, get_my_mmaudio
+from mmaudio.model.utils.features_utils import FeaturesUtils
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+local_rank = int(os.environ['LOCAL_RANK'])
+world_size = int(os.environ['WORLD_SIZE'])
+log = logging.getLogger()
+@torch.inference_mode()
+@hydra.main(version_base='1.3.2', config_path='config', config_name='eval_config.yaml')
+def main(cfg: DictConfig):
+    device = 'cuda'
+    torch.cuda.set_device(local_rank)
+    if cfg.model not in all_model_cfg:
+        raise ValueError(f'Unknown model variant: {cfg.model}')
+    model: ModelConfig = all_model_cfg[cfg.model]
+    model.download_if_needed()
+    seq_cfg = model.seq_cfg
+    run_dir = Path(HydraConfig.get().run.dir)
+    if cfg.output_name is None:
+        output_dir = run_dir / cfg.dataset
+    else:
+        output_dir = run_dir / f'{cfg.dataset}-{cfg.output_name}'
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # load a pretrained model
+    seq_cfg.duration = cfg.duration_s
+    net: MMAudio = get_my_mmaudio(cfg.model).to(device).eval()
+    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
+    log.info(f'Loaded weights from {model.model_path}')
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    log.info(f'Latent seq len: {seq_cfg.latent_seq_len}')
+    log.info(f'Clip seq len: {seq_cfg.clip_seq_len}')
+    log.info(f'Sync seq len: {seq_cfg.sync_seq_len}')
+    # misc setup
+    rng = torch.Generator(device=device)
+    rng.manual_seed(cfg.seed)
+    fm = FlowMatching(cfg.sampling.min_sigma,
+                      inference_mode=cfg.sampling.method,
+                      num_steps=cfg.sampling.num_steps)
+    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
+                                  synchformer_ckpt=model.synchformer_ckpt,
+                                  enable_conditions=True,
+                                  mode=model.mode,
+                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
+                                  need_vae_encoder=False)
+    feature_utils = feature_utils.to(device).eval()
+    if cfg.compile:
+        net.preprocess_conditions = torch.compile(net.preprocess_conditions)
+        net.predict_flow = torch.compile(net.predict_flow)
+        feature_utils.compile()
+    dataset, loader = setup_eval_dataset(cfg.dataset, cfg)
+    with torch.amp.autocast(enabled=cfg.amp, dtype=torch.bfloat16, device_type=device):
+        for batch in tqdm(loader):
+            audios = generate(batch.get('clip_video', None),
+                              batch.get('sync_video', None),
+                              batch.get('caption', None),
+                              feature_utils=feature_utils,
+                              net=net,
+                              fm=fm,
+                              rng=rng,
+                              cfg_strength=cfg.cfg_strength,
+                              clip_batch_size_multiplier=64,
+                              sync_batch_size_multiplier=64)
+            audios = audios.float().cpu()
+            names = batch['name']
+            for audio, name in zip(audios, names):
+                torchaudio.save(output_dir / f'{name}.flac', audio, seq_cfg.sampling_rate)
+def distributed_setup():
+    distributed.init_process_group(backend="nccl")
+    local_rank = distributed.get_rank()
+    world_size = distributed.get_world_size()
+    log.info(f'Initialized: local_rank={local_rank}, world_size={world_size}')
+    return local_rank, world_size
+if __name__ == '__main__':
+    distributed_setup()
+    main()
+    # clean-up
+    distributed.destroy_process_group()

config/__init__.py ADDED Viewed

File without changes

config/base_config.yaml ADDED Viewed

	@@ -0,0 +1,62 @@

+defaults:
+  - data: base
+  - eval_data: base
+  - override hydra/job_logging: custom-simplest
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: ${now:%Y-%m-%d_%H-%M-%S}-hydra
+enable_email: False
+model: small_16k
+exp_id: default
+debug: False
+cudnn_benchmark: True
+compile: True
+amp: True
+weights: null
+checkpoint: null
+seed: 14159265
+num_workers: 10 # per-GPU
+pin_memory: False # set to True if your system can handle it, i.e., have enough memory
+# NOTE: This DOSE NOT affect the model during inference in any way
+# they are just for the dataloader to fill in the missing data in multi-modal loading
+# to change the sequence length for the model, see networks.py
+data_dim:
+  text_seq_len: 77
+  clip_dim: 1024
+  sync_dim: 768
+  text_dim: 1024
+# ema configuration
+ema:
+  enable: True
+  sigma_rels: [0.05, 0.1]
+  update_every: 1
+  checkpoint_every: 5_000
+  checkpoint_folder: ${hydra:run.dir}/ema_ckpts
+  default_output_sigma: 0.05
+# sampling
+sampling:
+  mean: 0.0
+  scale: 1.0
+  min_sigma: 0.0
+  method: euler
+  num_steps: 25
+# classifier-free guidance
+null_condition_probability: 0.1
+cfg_strength: 4.5
+# checkpoint paths to external modules
+vae_16k_ckpt: ./ext_weights/v1-16.pth
+vae_44k_ckpt: ./ext_weights/v1-44.pth
+bigvgan_vocoder_ckpt: ./ext_weights/best_netG.pt
+synchformer_ckpt: ./ext_weights/synchformer_state_dict.pth

config/data/base.yaml ADDED Viewed

	@@ -0,0 +1,70 @@

+VGGSound:
+  root: ../data/video
+  subset_name: sets/vgg3-train.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+VGGSound_test:
+  root: ../data/video
+  subset_name: sets/vgg3-test.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+VGGSound_val:
+  root: ../data/video
+  subset_name: sets/vgg3-val.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+ExtractedVGG:
+  tsv: ../data/v1-16-memmap/vgg-train.tsv
+  memmap_dir: ../data/v1-16-memmap/vgg-train
+ExtractedVGG_test:
+  tag: test
+  gt_cache: ../data/eval-cache/vggsound-test
+  output_subdir: null
+  tsv: ../data/v1-16-memmap/vgg-test.tsv
+  memmap_dir: ../data/v1-16-memmap/vgg-test
+ExtractedVGG_val:
+  tag: val
+  gt_cache: ../data/eval-cache/vggsound-val
+  output_subdir: val
+  tsv: ../data/v1-16-memmap/vgg-val.tsv
+  memmap_dir: ../data/v1-16-memmap/vgg-val
+AudioCaps:
+  tsv: ../data/v1-16-memmap/audiocaps.tsv
+  memmap_dir: ../data/v1-16-memmap/audiocaps
+AudioSetSL:
+  tsv: ../data/v1-16-memmap/audioset_sl.tsv
+  memmap_dir: ../data/v1-16-memmap/audioset_sl
+BBCSound:
+  tsv: ../data/v1-16-memmap/bbcsound.tsv
+  memmap_dir: ../data/v1-16-memmap/bbcsound
+FreeSound:
+  tsv: ../data/v1-16-memmap/freesound.tsv
+  memmap_dir: ../data/v1-16-memmap/freesound
+Clotho:
+  tsv: ../data/v1-16-memmap/clotho.tsv
+  memmap_dir: ../data/v1-16-memmap/clotho
+Example_video:
+  tsv: ./training/example_output/memmap/vgg-example.tsv
+  memmap_dir: ./training/example_output/memmap/vgg-example
+Example_audio:
+  tsv: ./training/example_output/memmap/audio-example.tsv
+  memmap_dir: ./training/example_output/memmap/audio-example

config/eval_config.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+defaults:
+  - base_config
+  - override hydra/job_logging: custom-simplest
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: eval-${now:%Y-%m-%d_%H-%M-%S}-hydra
+exp_id: ${model}
+dataset: audiocaps
+duration_s: 8.0
+# for inference, this is the per-GPU batch size
+batch_size: 16
+output_name: null

config/eval_data/base.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+AudioCaps:
+  audio_path: ../data/AudioCaps-test-audioldm-ver
+  # a csv file, with a header row of 'name' and 'caption'
+  # name should match the audio file name without extension
+  # Can be downloaded here: https://github.com/hkchengrex/MMAudio/releases/download/v0.1/AudioCaps_audioldm_data.csv
+  csv_path: ../data/AudioCaps-test-audioldm-ver/data.csv
+AudioCaps_full:
+  audio_path: ../data/AudioCaps-test-full-ver
+  # a csv file, with a header row of 'name' and 'caption'
+  # name should match the audio file name without extension
+  # Can be downloaded here: https://github.com/hkchengrex/MMAudio/releases/download/v0.1/AudioCaps_full_data.csv
+  csv_path: ../data/AudioCaps-test-full-ver/data.csv
+MovieGen:
+  video_path: ../data/MovieGen/MovieGenAudioBenchSfx/video_with_audio
+  jsonl_path: ../data/MovieGen/MovieGenAudioBenchSfx/metadata
+VGGSound:
+  video_path: ../data/test-videos
+  # from the officially released csv file
+  csv_path: ../data/vggsound.csv

config/hydra/job_logging/custom-eval.yaml ADDED Viewed

	@@ -0,0 +1,32 @@

+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s][r${oc.env:LOCAL_RANK}] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+  file:
+    class: logging.FileHandler
+    formatter: simple
+    # absolute file path
+    filename: ${hydra.runtime.output_dir}/eval-${now:%Y-%m-%d_%H-%M-%S}-rank${oc.env:LOCAL_RANK}.log
+    mode: w
+root:
+  level: INFO
+  handlers: [console, file]
+disable_existing_loggers: false

config/hydra/job_logging/custom-no-rank.yaml ADDED Viewed

	@@ -0,0 +1,32 @@

+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+  file:
+    class: logging.FileHandler
+    formatter: simple
+    # absolute file path
+    filename: ${hydra.runtime.output_dir}/${now:%Y-%m-%d_%H-%M-%S}-eval.log
+    mode: w
+root:
+  level: INFO
+  handlers: [console, file]
+disable_existing_loggers: false

config/hydra/job_logging/custom-simplest.yaml ADDED Viewed

	@@ -0,0 +1,26 @@

+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+root:
+  level: INFO
+  handlers: [console]
+disable_existing_loggers: false

config/hydra/job_logging/custom.yaml ADDED Viewed

	@@ -0,0 +1,33 @@

+# @package hydra.job_logging
+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s][r${oc.env:LOCAL_RANK}] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(blue)sr${oc.env:LOCAL_RANK}%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+  file:
+    class: logging.FileHandler
+    formatter: simple
+    # absolute file path
+    filename: ${hydra.runtime.output_dir}/train-${now:%Y-%m-%d_%H-%M-%S}-rank${oc.env:LOCAL_RANK}.log
+    mode: w
+root:
+  level: INFO
+  handlers: [console, file]
+disable_existing_loggers: false

config/train_config.yaml ADDED Viewed

	@@ -0,0 +1,41 @@

+defaults:
+  - base_config
+  - override data: base
+  - override hydra/job_logging: custom
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: train-${now:%Y-%m-%d_%H-%M-%S}-hydra
+ema:
+  start: 0
+mini_train: False
+example_train: False
+enable_grad_scaler: False
+vgg_oversample_rate: 5
+log_text_interval: 200
+log_extra_interval: 20_000
+val_interval: 5_000
+eval_interval: 20_000
+save_eval_interval: 40_000
+save_weights_interval: 10_000
+save_checkpoint_interval: 10_000
+save_copy_iterations: []
+batch_size: 512
+eval_batch_size: 256 # per-GPU
+num_iterations: 300_000
+learning_rate: 1.0e-4
+linear_warmup_steps: 1_000
+lr_schedule: step
+lr_schedule_steps: [240_000, 270_000]
+lr_schedule_gamma: 0.1
+clip_grad_norm: 1.0
+weight_decay: 1.0e-6

demo.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import logging
+from argparse import ArgumentParser
+from pathlib import Path
+import torch
+import torchaudio
+from mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video, make_video,
+                                setup_eval_logging)
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import MMAudio, get_my_mmaudio
+from mmaudio.model.utils.features_utils import FeaturesUtils
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+log = logging.getLogger()
+@torch.inference_mode()
+def main():
+    setup_eval_logging()
+    parser = ArgumentParser()
+    parser.add_argument('--variant',
+                        type=str,
+                        default='large_44k_v2',
+                        help='small_16k, small_44k, medium_44k, large_44k, large_44k_v2')
+    parser.add_argument('--video', type=Path, help='Path to the video file')
+    parser.add_argument('--prompt', type=str, help='Input prompt', default='')
+    parser.add_argument('--negative_prompt', type=str, help='Negative prompt', default='')
+    parser.add_argument('--duration', type=float, default=8.0)
+    parser.add_argument('--cfg_strength', type=float, default=4.5)
+    parser.add_argument('--num_steps', type=int, default=25)
+    parser.add_argument('--mask_away_clip', action='store_true')
+    parser.add_argument('--output', type=Path, help='Output directory', default='./output')
+    parser.add_argument('--seed', type=int, help='Random seed', default=42)
+    parser.add_argument('--skip_video_composite', action='store_true')
+    parser.add_argument('--full_precision', action='store_true')
+    args = parser.parse_args()
+    if args.variant not in all_model_cfg:
+        raise ValueError(f'Unknown model variant: {args.variant}')
+    model: ModelConfig = all_model_cfg[args.variant]
+    model.download_if_needed()
+    seq_cfg = model.seq_cfg
+    if args.video:
+        video_path: Path = Path(args.video).expanduser()
+    else:
+        video_path = None
+    prompt: str = args.prompt
+    negative_prompt: str = args.negative_prompt
+    output_dir: str = args.output.expanduser()
+    seed: int = args.seed
+    num_steps: int = args.num_steps
+    duration: float = args.duration
+    cfg_strength: float = args.cfg_strength
+    skip_video_composite: bool = args.skip_video_composite
+    mask_away_clip: bool = args.mask_away_clip
+    device = 'cpu'
+    if torch.cuda.is_available():
+        device = 'cuda'
+    elif torch.backends.mps.is_available():
+        device = 'mps'
+    else:
+        log.warning('CUDA/MPS are not available, running on CPU')
+    dtype = torch.float32 if args.full_precision else torch.bfloat16
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # load a pretrained model
+    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
+    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
+    log.info(f'Loaded weights from {model.model_path}')
+    # misc setup
+    rng = torch.Generator(device=device)
+    rng.manual_seed(seed)
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
+                                  synchformer_ckpt=model.synchformer_ckpt,
+                                  enable_conditions=True,
+                                  mode=model.mode,
+                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
+                                  need_vae_encoder=False)
+    feature_utils = feature_utils.to(device, dtype).eval()
+    if video_path is not None:
+        log.info(f'Using video {video_path}')
+        video_info = load_video(video_path, duration)
+        clip_frames = video_info.clip_frames
+        sync_frames = video_info.sync_frames
+        duration = video_info.duration_sec
+        if mask_away_clip:
+            clip_frames = None
+        else:
+            clip_frames = clip_frames.unsqueeze(0)
+        sync_frames = sync_frames.unsqueeze(0)
+    else:
+        log.info('No video provided -- text-to-audio mode')
+        clip_frames = sync_frames = None
+    seq_cfg.duration = duration
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    log.info(f'Prompt: {prompt}')
+    log.info(f'Negative prompt: {negative_prompt}')
+    audios = generate(clip_frames,
+                      sync_frames, [prompt],
+                      negative_text=[negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=rng,
+                      cfg_strength=cfg_strength)
+    audio = audios.float().cpu()[0]
+    if video_path is not None:
+        save_path = output_dir / f'{video_path.stem}.flac'
+    else:
+        safe_filename = prompt.replace(' ', '_').replace('/', '_').replace('.', '')
+        save_path = output_dir / f'{safe_filename}.flac'
+    torchaudio.save(save_path, audio, seq_cfg.sampling_rate)
+    log.info(f'Audio saved to {save_path}')
+    if video_path is not None and not skip_video_composite:
+        video_save_path = output_dir / f'{video_path.stem}.mp4'
+        make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+        log.info(f'Video saved to {output_dir / video_save_path}')
+    log.info('Memory usage: %.2f GB', torch.cuda.max_memory_allocated() / (2**30))
+if __name__ == '__main__':
+    main()

docs/EVAL.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# Evaluation
+## Batch Evaluation
+To evaluate the model on a dataset, use the `batch_eval.py` script. It is significantly more efficient in large-scale evaluation compared to `demo.py`, supporting batched inference, multi-GPU inference, torch compilation, and skipping video compositions.
+An example of running this script with four GPUs is as follows:
+```bash
+OMP_NUM_THREADS=4 torchrun --standalone --nproc_per_node=4  batch_eval.py duration_s=8 dataset=vggsound model=small_16k num_workers=8
+```
+You may need to update the data paths in `config/eval_data/base.yaml`.
+More configuration options can be found in `config/base_config.yaml` and `config/eval_config.yaml`.
+## Precomputed Results
+Precomputed results for VGGSound, AudioCaps, and MovieGen are available here: https://huggingface.co/datasets/hkchengrex/MMAudio-precomputed-results
+## Obtaining Quantitative Metrics
+Our evaluation code is available here: https://github.com/hkchengrex/av-benchmark

docs/MODELS.md ADDED Viewed

	@@ -0,0 +1,50 @@

+# Pretrained models
+The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
+The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
+| Model    | Download link | File size |
+| -------- | ------- | ------- |
+| Flow prediction network, small 16kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_small_16k.pth" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
+| Flow prediction network, small 44.1kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_small_44k.pth" download="mmaudio_small_44k.pth">mmaudio_small_44k.pth</a> | 601M |
+| Flow prediction network, medium 44.1kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_medium_44k.pth" download="mmaudio_medium_44k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
+| Flow prediction network, large 44.1kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_large_44k.pth" download="mmaudio_large_44k.pth">mmaudio_large_44k.pth</a> | 3.9G |
+| Flow prediction network, large 44.1kHz, v2 **(recommended)** | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_large_44k_v2.pth" download="mmaudio_large_44k_v2.pth">mmaudio_large_44k_v2.pth</a> | 3.9G |
+| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
+| 16kHz BigVGAN vocoder (from Make-An-Audio 2) |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
+| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G |
+| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
+To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes.
+The 44.1kHz vocoder will be downloaded automatically.
+The `_v2` model performs worse in benchmarking (e.g., in  Fréchet distance), but, in my experience, generalizes better to new data.
+The expected directory structure (full):
+```bash
+MMAudio
+├── ext_weights
+│   ├── best_netG.pt
+│   ├── synchformer_state_dict.pth
+│   ├── v1-16.pth
+│   └── v1-44.pth
+├── weights
+│   ├── mmaudio_small_16k.pth
+│   ├── mmaudio_small_44k.pth
+│   ├── mmaudio_medium_44k.pth
+│   ├── mmaudio_large_44k.pth
+│   └── mmaudio_large_44k_v2.pth
+└── ...
+```
+The expected directory structure (minimal, for the recommended model only):
+```bash
+MMAudio
+├── ext_weights
+│   ├── synchformer_state_dict.pth
+│   └── v1-44.pth
+├── weights
+│   └── mmaudio_large_44k_v2.pth
+└── ...
+```

docs/TRAINING.md ADDED Viewed

	@@ -0,0 +1,184 @@

+# Training
+## Overview
+We have put a large emphasis on making training as fast as possible.
+Consequently, some pre-processing steps are required.
+Namely, before starting any training, we
+1. Obtain training data as videos, audios, and captions.
+2. Encode training audios into spectrograms and then with VAE into mean/std
+3. Extract CLIP and synchronization features from videos
+4. Extract CLIP features from text (captions)
+5. Encode all extracted features into [MemoryMappedTensors](https://pytorch.org/tensordict/main/reference/generated/tensordict.MemoryMappedTensor.html) with [TensorDict](https://pytorch.org/tensordict/main/reference/tensordict.html)
+**NOTE:** for maximum training speed (e.g., when training the base model with 2*H100s), you would need around 3~5 GB/s of random read speed. Spinning disks would not be able to catch up and most consumer-grade SSDs would struggle. In my experience, the best bet is to have a large enough system memory such that the OS can cache the data. This way, the data is read from RAM instead of disk.
+The current training script does not support `_v2` training.
+## Recommended Hardware Configuration
+These are what I recommend for a smooth and efficient training experience. These are not minimum requirements.
+- Single-node machine. We did not implement multi-node training
+- GPUs: for the small model, two 80G-H100s or above; for the large model, eight 80G-H100s or above
+- System memory: for 16kHz training, 600GB+; for 44kHz training, 700GB+
+- Storage: >2TB of fast NVMe storage. If you have enough system memory, OS caching will help and the storage does not need to be as fast.
+## Prerequisites
+1. Install [av-benchmark](https://github.com/hkchengrex/av-benchmark). We use this library to automatically evaluate on the validation set during training, and on the test set after training.
+2. Extract features for evaluation using [av-benchmark](https://github.com/hkchengrex/av-benchmark) for the validation and test set as a [validation cache](https://github.com/hkchengrex/MMAudio/blob/34bf089fdd2e457cd5ef33be96c0e1c8a0412476/config/data/base.yaml#L38) and a [test cache](https://github.com/hkchengrex/MMAudio/blob/34bf089fdd2e457cd5ef33be96c0e1c8a0412476/config/data/base.yaml#L31). You can also download the precomputed evaluation cache [here](https://huggingface.co/datasets/hkchengrex/MMAudio-precomputed-results/tree/main).
+3. You will need ffmpeg to extract frames from videos. Note that `torchaudio` imposes a maximum version limit (`ffmpeg<7`). You can install it as follows:
+```bash
+conda install -c conda-forge 'ffmpeg<7'
+```
+4. Download the training datasets. We used [VGGSound](https://arxiv.org/abs/2004.14368), [AudioCaps](https://audiocaps.github.io/), [WavCaps](https://arxiv.org/abs/2303.17395), and [Clotho](https://arxiv.org/abs/1910.09387) (paper to be updated). Note that the audio files in the huggingface release of WavCaps have been downsampled to 32kHz. To the best of our ability, we located the original (high-sampling rate) audio files and used them instead to prevent artifacts during 44.1kHz training. We did not use the "SoundBible" portion of WavCaps, since it is a small set with many short audio unsuitable for our training.
+5. Download the corresponding VAE (`v1-16.pth` for 16kHz training, and `v1-44.pth` for 44.1kHz training), vocoder models (`best_netG.pt` for 16kHz training; the vocoder for 44.1kHz training will be downloaded automatically), the [empty string encoding](https://github.com/hkchengrex/MMAudio/releases/download/v0.1/empty_string.pth), and Synchformer weights from [MODELS.md](https://github.com/hkchengrex/MMAudio/blob/main/docs/MODELS.md) place them in `ext_weights/`.
+### Helpful links for downloading the datasets
+We cannot redistribute the datasets for copyright reasons, but we do find some links helpful and they might be helpful to you as well.
+- https://huggingface.co/datasets/Meranti/CLAP_freesound
+- https://huggingface.co/datasets/agkphysics/AudioSet
+- https://sound-effects.bbcrewind.co.uk/
+For certain sources of VGGSound, you might notice desychronization between the audio and the video. This happens the video keyframes do not always align with the start of the audio and what happens during playbacks is player-dependent. We used PyTorch's decoder which can correctly handle these cases.
+## Preparing Audio-Video-Text Features
+We have prepared some example data in `training/example_videos`.
+`training/extract_video_training_latents.py` extracts audio, video, and text features and save them as a `TensorDict` with a `.tsv` file containing metadata to `output_dir`.
+To run this script, use the `torchrun` utility:
+```bash
+torchrun --standalone training/extract_video_training_latents.py
+```
+You can run this script with multiple GPUs (with `--nproc_per_node=<n>` after `--standalone` and before the script name) to speed up extraction.
+Modify the definitions near the top of the script to switch between 16kHz/44.1kHz extraction.
+Change the data path definitions in `data_cfg` if necessary.
+Arguments:
+- `latent_dir` -- where intermediate latent outputs are saved. It is safe to delete this directory afterwards.
+- `output_dir` -- where TensorDict and the metadata file are saved.
+Outputs produced in `output_dir`:
+1. A directory named `vgg-{split}` (i.e., in the TensorDict format), containing
+    a. `mean.memmap` mean values predicted by the VAE encoder (number of videos X sequence length X channel size)
+    b. `std.memmap` standard deviation values predicted by the VAE encoder (number of videos X sequence length X channel size)
+    c. `text_features.memmap` text features extracted from CLIP (number of videos X 77 (sequence length) X 1024)
+    d. `clip_features.memmap` clip features extracted from CLIP (number of videos X 64 (8 fps) X 1024)
+    e. `sync_features.memmap` synchronization features extracted from Synchformer (number of videos X 192 (24 fps) X 768)
+    f. `meta.json` that contains the metadata for the above memory mappings
+2. A tab-separated values file named `vgg-{split}.tsv` that contains two columns: `id` containing video file names without extension, and `label` containing corresponding text labels (i.e., captions)
+## Preparing Audio-Text Features
+We have prepared some example data in `training/example_audios`.
+1. Run `training/partition_clips` to partition each audio file into clips (by finding start and end points; we do not save the partitioned audio onto the disk to save disk space)
+2. Run `training/extract_audio_training_latents.py` to extract each clip's audio and text features and save them as a `TensorDict` with a `.tsv` file containing metadata to `output_dir`.
+### Partitioning the audio files
+Run
+```bash
+python training/partition_clips.py
+```
+Arguments:
+- `data_dir` -- path to a directory containing the audio files (`.flac` or `.wav`)
+- `output_dir` -- path to the output `.csv` file
+- `start` -- optional; useful when you need to run multiple processes to speed up processing -- this defines the beginning of the chunk to be processed
+- `end` -- optional; useful when you need to run multiple processes to speed up processing -- this defines the end of the chunk to be processed
+### Extracting audio and text features
+Run
+```bash
+torchrun --standalone training/extract_audio_training_latents.py
+```
+You can run this with multiple GPUs (with `--nproc_per_node=<n>`) to speed up extraction.
+Modify the definitions near the top of the script to switch between 16kHz/44.1kHz extraction.
+Arguments:
+- `data_dir` -- path to a directory containing the audio files (`.flac` or `.wav`), same as the previous step
+- `captions_tsv` -- path to the captions file, a tab-separated values (tsv) file at least with columns `id` and `caption`
+- `clips_tsv` -- path to the clips file, generated in the last step
+- `latent_dir` -- where intermediate latent outputs are saved. It is safe to delete this directory afterwards.
+- `output_dir` -- where TensorDict and the metadata file are saved.
+Outputs produced in `output_dir`:
+1. A directory named `{basename(output_dir)}` (i.e., in the TensorDict format), containing
+    a. `mean.memmap` mean values predicted by the VAE encoder (number of audios X sequence length X channel size)
+    b. `std.memmap` standard deviation values predicted by the VAE encoder (number of audios X sequence length X channel size)
+    c. `text_features.memmap` text features extracted from CLIP (number of audios X 77 (sequence length) X 1024)
+    f. `meta.json` that contains the metadata for the above memory mappings
+2. A tab-separated values file named `{basename(output_dir)}.tsv` that contains two columns: `id` containing audio file names without extension, and `label` containing corresponding text labels (i.e., captions)
+### Reference tsv files (with overlaps removed as mentioned in the paper)
+The reference tsv files can be found [here](https://github.com/hkchengrex/MMAudio/releases/tag/v0.1).
+Note that these reference tsv files are the **outputs** of `extract_audio_training_latents.py`, which means the `id` column might contain duplicate entries (one per clip). You can still use it as the `captions_tsv` input though -- the script will handle duplicates gracefully.
+Among these reference tsv files, `audioset_sl.tsv`, `bbcsound.tsv`, and `freesound.tsv` are subsets that are parts of WavCaps. These subsets might be smaller than the original datasets.
+The Clotho data contains both the development set and the validation set.
+**Update (Mar 9, 2025)**:
+We have updated a corrected set of reference tsv files. The previous tsv files contained some (<1%) corrupted captions (ie, mismatch between audio and caption, see https://github.com/hkchengrex/MMAudio/issues/56). The tsv files for VGGSound are unaffected. This reason for this error is unknown, but I cannot reproduce this error in the latest version of the code. Our pre-trained models are trained with **uncorrected** tsv files. For future training, I recommend using the corrected tsv files.
+The error statistics are as follows:
+- AudioCaps (170/43824), 0.39%
+- Freesound: (1670/180636), 0.92%
+- AudioSet: (290/100776), 0.29%
+- BBCSound: (3/29975), 0.01%
+- Clotho: (8/24332), 0.03%
+## Training on Extracted Features
+We use Distributed Data Parallel (DDP) for training.
+First, specify the data path in `config/data/base.yaml`. If you used the default parameters in the scripts above to extract features for the example data, the `Example_video` and `Example_audio` items should already be correct.
+To run training on the example data, use the following command:
+```bash
+OMP_NUM_THREADS=4 torchrun --standalone --nproc_per_node=1 train.py exp_id=debug compile=False  debug=True example_train=True  batch_size=1
+```
+This will not train a useful model, but it will check if everything is set up correctly.
+For full training on the base model with two GPUs, use the following command:
+```bash
+OMP_NUM_THREADS=4 torchrun --standalone --nproc_per_node=2 train.py exp_id=exp_1 model=small_16k
+```
+Any outputs from training will be stored in `output/<exp_id>`.
+More configuration options can be found in `config/base_config.yaml` and `config/train_config.yaml`.
+For the medium and large models, specify `vgg_oversample_rate` to be `3` to reduce overfitting.
+## Checkpoints
+Model checkpoints, including optimizer states and the latest EMA weights, are available here: https://huggingface.co/hkchengrex/MMAudio
+---
+Godspeed!

docs/images/icon.png ADDED Viewed

docs/index.html ADDED Viewed

	@@ -0,0 +1,149 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <!-- Google tag (gtag.js) -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
+    <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'G-0JKBJ3WRJZ');
+    </script>
+    <link rel="preconnect" href="https://fonts.googleapis.com">
+    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+    <link href="https://fonts.googleapis.com/css2?family=Source+Sans+3&display=swap" rel="stylesheet">
+    <meta charset="UTF-8">
+    <title>MMAudio</title>
+    <link rel="icon" type="image/png" href="images/icon.png">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- CSS only -->
+    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
+        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
+    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
+    <link rel="stylesheet" href="style.css">
+</head>
+<body>
+    <body>
+        <br><br><br><br>
+        <div class="container">
+            <div class="row text-center" style="font-size:38px">
+                <div class="col strong">
+                    Taming Multimodal Joint Training for High-Quality <br>Video-to-Audio Synthesis
+                </div>
+            </div>
+            <br>
+            <div class="row text-center" style="font-size:28px">
+                <div class="col">
+                    CVPR 2025
+                </div>
+            </div>
+            <br>
+            <div class="h-100 row text-center heavy justify-content-md-center" style="font-size:22px;">
+                <div class="col-sm-auto px-lg-2">
+                    <a href="https://hkchengrex.github.io/">Ho Kei Cheng<sup>1</sup></a>
+                </div>
+                <div class="col-sm-auto px-lg-2">
+                    <nobr><a href="https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ">Masato Ishii<sup>2</sup></a></nobr>
+                </div>
+                <div class="col-sm-auto px-lg-2">
+                    <nobr><a href="https://scholar.google.com/citations?user=sXAjHFIAAAAJ">Akio Hayakawa<sup>2</sup></a></nobr>
+                </div>
+                <div class="col-sm-auto px-lg-2">
+                    <nobr><a href="https://scholar.google.com/citations?user=XCRO260AAAAJ">Takashi Shibuya<sup>2</sup></a></nobr>
+                </div>
+                <div class="col-sm-auto px-lg-2">
+                    <nobr><a href="https://www.alexander-schwing.de/">Alexander Schwing<sup>1</sup></a></nobr>
+                </div>
+                <div class="col-sm-auto px-lg-2" >
+                    <nobr><a href="https://www.yukimitsufuji.com/">Yuki Mitsufuji<sup>2,3</sup></a></nobr>
+                </div>
+            </div>
+            <div class="h-100 row text-center heavy justify-content-md-center" style="font-size:22px;">
+                <div class="col-sm-auto px-lg-2">
+                    <sup>1</sup>University of Illinois Urbana-Champaign
+                </div>
+                <div class="col-sm-auto px-lg-2">
+                    <sup>2</sup>Sony AI
+                </div>
+                <div class="col-sm-auto px-lg-2">
+                    <sup>3</sup>Sony Group Corporation
+                </div>
+            </div>
+            <br>
+            <br>
+            <div class="h-100 row text-center justify-content-md-center" style="font-size:20px;">
+                <div class="col-sm-2">
+                    <a href="https://arxiv.org/abs/2412.15322">[Paper]</a>
+                </div>
+                <div class="col-sm-2">
+                    <a href="https://github.com/hkchengrex/MMAudio">[Code]</a>
+                </div>
+                <div class="col-sm-3">
+                    <a href="https://huggingface.co/spaces/hkchengrex/MMAudio">[Huggingface Demo]</a>
+                </div>
+                <div class="col-sm-2">
+                    <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">[Colab Demo]</a>
+                </div>
+                <div class="col-sm-3">
+                    <a href="https://replicate.com/zsxkib/mmaudio">[Replicate Demo]</a>
+                </div>
+            </div>
+            <br>
+            <hr>
+            <div class="row" style="font-size:32px">
+                <div class="col strong">
+                    TL;DR
+                </div>
+            </div>
+            <br>
+            <div class="row">
+                <div class="col">
+                    <p class="light" style="text-align: left;">
+                        MMAudio generates synchronized audio given video and/or text inputs.
+                    </p>
+                </div>
+            </div>
+            <br>
+            <hr>
+            <br>
+            <div class="row" style="font-size:32px">
+                <div class="col strong">
+                    Demo
+                </div>
+            </div>
+            <br>
+            <div class="row" style="font-size:48px">
+                <div class="col strong text-center">
+                    <a href="video_main.html" style="text-decoration: underline;">&lt;More results&gt;</a>
+                </div>
+            </div>
+            <br>
+            <div class="video-container" style="text-align: center;">
+                <iframe src="https://youtube.com/embed/YElewUT2M4M"></iframe>
+                </div>
+            <br>
+            <br><br>
+            <br><br>
+        </div>
+</body>
+</html>

docs/style.css ADDED Viewed

	@@ -0,0 +1,78 @@

+body {
+    font-family: 'Source Sans 3', sans-serif;
+    font-size: 18px;
+    margin-left: auto;
+    margin-right: auto;
+    font-weight: 400;
+    height: 100%;
+    max-width: 1000px;
+}
+table {
+    width: 100%;
+    border-collapse: collapse;
+}
+th, td {
+    border: 1px solid #ddd;
+    padding: 8px;
+    text-align: center;
+}
+th {
+    background-color: #f2f2f2;
+}
+video {
+    width: 100%;
+    height: auto;
+}
+p {
+    font-size: 28px;
+}
+h2 {
+    font-size: 36px;
+}
+.strong {
+    font-weight: 700;
+}
+.light {
+    font-weight: 100;
+}
+.heavy {
+    font-weight: 900;
+}
+.column {
+    float: left;
+}
+a:link,
+a:visited {
+    color: #05538f;
+    text-decoration: none;
+}
+a:hover {
+    color: #63cbdd;
+}
+hr {
+    border: 0;
+    height: 1px;
+    background-image: linear-gradient(to right, rgba(0, 0, 0, 0), rgba(0, 0, 0, 0.75), rgba(0, 0, 0, 0));
+}
+.video-container {
+    position: relative;
+    padding-bottom: 56.25%; /* 16:9 */
+    height: 0;
+  }
+.video-container iframe {
+    position: absolute;
+    top: 0;
+    left: 0;
+    width: 100%;
+    height: 100%;
+}

docs/style_videos.css ADDED Viewed

	@@ -0,0 +1,52 @@

+body {
+    font-family: 'Source Sans 3', sans-serif;
+    font-size: 1.5vh;
+    font-weight: 400;
+}
+table {
+    width: 100%;
+    border-collapse: collapse;
+}
+th, td {
+    border: 1px solid #ddd;
+    padding: 8px;
+    text-align: center;
+}
+th {
+    background-color: #f2f2f2;
+}
+video {
+    width: 100%;
+    height: auto;
+}
+p {
+    font-size: 1.5vh;
+    font-weight: bold;
+}
+h2 {
+    font-size: 2vh;
+    font-weight: bold;
+}
+.video-container {
+    position: relative;
+    padding-bottom: 56.25%; /* 16:9 */
+    height: 0;
+  }
+.video-container iframe {
+    position: absolute;
+    top: 0;
+    left: 0;
+    width: 100%;
+    height: 100%;
+}
+.video-header {
+    background-color: #f2f2f2;
+    text-align: center;
+    font-size: 1.5vh;
+    font-weight: bold;
+    padding: 8px;
+}

docs/video_gen.html ADDED Viewed

	@@ -0,0 +1,254 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <!-- Google tag (gtag.js) -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
+    <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'G-0JKBJ3WRJZ');
+    </script>
+    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
+    <meta charset="UTF-8">
+    <title>MMAudio</title>
+    <link rel="icon" type="image/png" href="images/icon.png">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- CSS only -->
+    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
+        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
+    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
+    <link rel="stylesheet" href="style_videos.css">
+</head>
+<body>
+    <div id="moviegen_all">
+    <h2 id="moviegen" style="text-align: center;">Comparisons with Movie Gen Audio on Videos Generated by MovieGen</h2>
+    <p id="moviegen1" style="overflow: hidden;">
+        Example 1: Ice cracking with sharp snapping sound, and metal tool scraping against the ice surface.
+        <span style="float: right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Movie Gen Audio</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/d7Lb0ihtGcE"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Ours</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/F4JoJ2r2m8U"></iframe>
+                </div>
+        </div>
+    </div>
+    <br>
+    <!-- <p id="moviegen2">Example 2: Rhythmic splashing and lapping of water. <span style="float:right;"><a href="#index">Back to index</a></span> </p>
+    <table>
+        <thead>
+            <tr>
+                <th>Movie Gen Audio</th>
+                <th>Ours</th>
+            </tr>
+        </thead>
+        <tbody>
+            <tr>
+                <td width="50%">
+                    <div class="video-container">
+                    <iframe src="https://youtube.com/embed/5gQNPK99CIk"></iframe>
+                    </div>
+                </td>
+                <td width="50%">
+                    <div class="video-container">
+                    <iframe src="https://youtube.com/embed/AbwnTzG-BpA"></iframe>
+                    </div>
+                </td>
+            </tr>
+        </tbody>
+    </table> -->
+    <p id="moviegen2" style="overflow: hidden;">
+        Example 2: Rhythmic splashing and lapping of water.
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Movie Gen Audio</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/5gQNPK99CIk"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Ours</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/AbwnTzG-BpA"></iframe>
+                </div>
+        </div>
+    </div>
+    <br>
+    <p id="moviegen3" style="overflow: hidden;">
+        Example 3: Shovel scrapes against dry earth.
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Movie Gen Audio</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/PUKGyEve7XQ"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Ours</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/CNn7i8VNkdc"></iframe>
+            </div>
+        </div>
+    </div>
+    <br>
+    <p id="moviegen4" style="overflow: hidden;">
+        (Failure case) Example 4: Creamy sound of mashed potatoes being scooped.
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Movie Gen Audio</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/PJv1zxR9JjQ"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Ours</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/c3-LJ1lNsPQ"></iframe>
+            </div>
+        </div>
+    </div>
+    <br>
+    </div>
+    <div id="hunyuan_sora_all">
+    <h2 id="hunyuan" style="text-align: center;">Results on Videos Generated by Hunyuan</h2>
+    <p style="overflow: hidden;">
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Typing</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/8ln_9hhH_nk"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Water is rushing down a stream and pouring</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/5df1FZFQj30"></iframe>
+            </div>
+        </div>
+    </div>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Waves on beach</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/7wQ9D5WgpFc"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Water droplet</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/q7M2nsalGjM"></iframe>
+            </div>
+        </div>
+    </div>
+    <br>
+    <h2 id="sora" style="text-align: center;">Results on Videos Generated by Sora</h2>
+    <p style="overflow: hidden;">
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Ships riding waves</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/JbgQzHHytk8"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Train (no text prompt given)</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/xOW7zrjpWC8"></iframe>
+            </div>
+        </div>
+    </div>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Seashore (no text prompt given)</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/fIuw5Y8ZZ9E"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Surfing (failure: unprompted music)</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/UcSTk-v0M_s"></iframe>
+            </div>
+        </div>
+    </div>
+    <br>
+    <div id="mochi_ltx_all">
+    <h2 id="mochi" style="text-align: center;">Results on Videos Generated by Mochi 1</h2>
+    <p style="overflow: hidden;">
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Magical fire and lightning (no text prompt given)</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/tTlRZaSMNwY"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Storm (no text prompt given)</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/4hrZTMJUy3w"></iframe>
+            </div>
+        </div>
+    </div>
+    <br>
+    <h2 id="ltx" style="text-align: center;">Results on Videos Generated by LTX-Video</h2>
+    <p style="overflow: hidden;">
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+    <div class="row g-1">
+        <div class="col-sm-6">
+            <div class="video-header">Firewood burning and cracking</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/P7_DDpgev0g"></iframe>
+            </div>
+        </div>
+        <div class="col-sm-6">
+            <div class="video-header">Waterfall, water splashing</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/4MvjceYnIO0"></iframe>
+            </div>
+        </div>
+    </div>
+    <br>
+    </div>
+</body>
+</html>

docs/video_main.html ADDED Viewed

	@@ -0,0 +1,98 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <!-- Google tag (gtag.js) -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
+    <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'G-0JKBJ3WRJZ');
+    </script>
+    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
+    <meta charset="UTF-8">
+    <title>MMAudio</title>
+    <link rel="icon" type="image/png" href="images/icon.png">
+    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
+    <!-- CSS only -->
+    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
+        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
+    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
+    <link rel="stylesheet" href="style_videos.css">
+    <script type="text/javascript">
+        $(document).ready(function(){
+            $("#content").load("video_gen.html #moviegen_all");
+            $("#load_moveigen").click(function(){
+                $("#content").load("video_gen.html #moviegen_all");
+            });
+            $("#load_hunyuan_sora").click(function(){
+                $("#content").load("video_gen.html #hunyuan_sora_all");
+            });
+            $("#load_mochi_ltx").click(function(){
+                $("#content").load("video_gen.html #mochi_ltx_all");
+            });
+            $("#load_vgg1").click(function(){
+                $("#content").load("video_vgg.html #vgg1");
+            });
+            $("#load_vgg2").click(function(){
+                $("#content").load("video_vgg.html #vgg2");
+            });
+            $("#load_vgg3").click(function(){
+                $("#content").load("video_vgg.html #vgg3");
+            });
+            $("#load_vgg4").click(function(){
+                $("#content").load("video_vgg.html #vgg4");
+            });
+            $("#load_vgg5").click(function(){
+                $("#content").load("video_vgg.html #vgg5");
+            });
+            $("#load_vgg6").click(function(){
+                $("#content").load("video_vgg.html #vgg6");
+            });
+            $("#load_vgg_extra").click(function(){
+                $("#content").load("video_vgg.html #vgg_extra");
+            });
+        });
+    </script>
+</head>
+<body>
+    <h1 id="index" style="text-align: center;">Index</h1>
+    <p><b>(Click on the links to load the corresponding videos)</b> <span style="float:right;"><a href="index.html">Back to project page</a></span></p>
+    <ol>
+        <li>
+            <a href="#" id="load_moveigen">Comparisons with Movie Gen Audio on Videos Generated by MovieGen</a>
+        </li>
+        <li>
+            <a href="#" id="load_hunyuan_sora">Results on Videos Generated by Hunyuan and Sora</a>
+        </li>
+        <li>
+            <a href="#" id="load_mochi_ltx">Results on Videos Generated by Mochi 1 and LTX-Video</a>
+        </li>
+        <li>
+            On VGGSound
+            <ol>
+                <li><a id='load_vgg1' href="#">Example 1: Wolf howling</a></li>
+                <li><a id='load_vgg2' href="#">Example 2: Striking a golf ball</a></li>
+                <li><a id='load_vgg3' href="#">Example 3: Hitting a drum</a></li>
+                <li><a id='load_vgg4' href="#">Example 4: Dog barking</a></li>
+                <li><a id='load_vgg5' href="#">Example 5: Playing a string instrument</a></li>
+                <li><a id='load_vgg6' href="#">Example 6: A group of people playing tambourines</a></li>
+                <li><a id='load_vgg_extra' href="#">Extra results & failure cases</a></li>
+            </ol>
+        </li>
+    </ol>
+    <div id="content" class="container-fluid">
+    </div>
+    <br>
+    <br>
+</body>
+</html>

docs/video_vgg.html ADDED Viewed

	@@ -0,0 +1,452 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <!-- Google tag (gtag.js) -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
+    <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'G-0JKBJ3WRJZ');
+    </script>
+    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
+    <meta charset="UTF-8">
+    <title>MMAudio</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- CSS only -->
+    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
+        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
+    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
+    <link rel="stylesheet" href="style_videos.css">
+</head>
+<body>
+    <div id="vgg1">
+    <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
+    <p style="overflow: hidden;">
+        Example 1: Wolf howling.
+        <span style="float:right;"><a href="#index">Back to index</a></span>
+    </p>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Ground-truth</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/9J_V74gqMUA"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Ours</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/P6O8IpjErPc"></iframe>
+                    </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V2A-Mapper</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/w-5eyqepvTk"></iframe>
+                    </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">FoleyCrafter</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/VOLfoZlRkzo"></iframe>
+                    </div>
+            </div>
+        </div>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Frieren</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/49owKyA5Pa8"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">VATT</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/QVtrFgbeGDM"></iframe>
+                    </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V-AURA</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/8r0uEfSNjvI"></iframe>
+                    </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Seeing and Hearing</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/bn-sLg2qulk"></iframe>
+                    </div>
+            </div>
+        </div>
+    </div>
+    <div id="vgg2">
+        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
+        <p style="overflow: hidden;">
+            Example 2: Striking a golf ball.
+            <span style="float:right;"><a href="#index">Back to index</a></span>
+        </p>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Ground-truth</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/1hwSu42kkho"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Ours</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/kZibDoDCNxI"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V2A-Mapper</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/jgKfLBLhh7Y"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">FoleyCrafter</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/Lfsx8mOPcJo"></iframe>
+                </div>
+            </div>
+        </div>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Frieren</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/tz-LpbB0MBc"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">VATT</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/RTDUHMi08n4"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V-AURA</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/N-3TDOsPnZQ"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Seeing and Hearing</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/QnsHnLn4gB0"></iframe>
+                </div>
+            </div>
+        </div>
+    </div>
+    <div id="vgg3">
+        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
+        <p style="overflow: hidden;">
+            Example 3: Hitting a drum.
+            <span style="float:right;"><a href="#index">Back to index</a></span>
+        </p>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Ground-truth</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/0oeIwq77w0Q"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Ours</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/-UtPV9ohuIM"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V2A-Mapper</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/9yivkgN-zwc"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">FoleyCrafter</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/kkCsXPOlBvY"></iframe>
+                </div>
+            </div>
+        </div>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Frieren</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/MbNKsVsuvig"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">VATT</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/2yYviBjrpBw"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V-AURA</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/9yivkgN-zwc"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Seeing and Hearing</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/6dnyQt4Fuhs"></iframe>
+                </div>
+            </div>
+        </div>
+    </div>
+    </div>
+    <div id="vgg4">
+        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
+        <p style="overflow: hidden;">
+            Example 4: Dog barking.
+            <span style="float:right;"><a href="#index">Back to index</a></span>
+        </p>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Ground-truth</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/ckaqvTyMYAw"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Ours</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/_aRndFZzZ-I"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V2A-Mapper</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/mNCISP3LBl0"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">FoleyCrafter</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/phZBQ3L7foE"></iframe>
+                </div>
+            </div>
+        </div>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Frieren</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/Sb5Mg1-ORao"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">VATT</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/eHmAGOmtDDg"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V-AURA</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/NEGa3krBrm0"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Seeing and Hearing</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/aO0EAXlwE7A"></iframe>
+                </div>
+            </div>
+        </div>
+    </div>
+    <div id="vgg5">
+        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
+        <p style="overflow: hidden;">
+            Example 5: Playing a string instrument.
+            <span style="float:right;"><a href="#index">Back to index</a></span>
+        </p>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Ground-truth</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/KP1QhWauIOc"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Ours</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/ovaJhWSquYE"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V2A-Mapper</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/N723FS9lcy8"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">FoleyCrafter</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/t0N4ZAAXo58"></iframe>
+                </div>
+            </div>
+        </div>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Frieren</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/8YSRs03QNNA"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">VATT</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/vOpMz55J1kY"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V-AURA</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/9JHC75vr9h0"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Seeing and Hearing</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/9w0JckNzXmY"></iframe>
+                </div>
+            </div>
+        </div>
+    </div>
+    <div id="vgg6">
+        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
+        <p style="overflow: hidden;">
+            Example 6: A group of people playing tambourines.
+            <span style="float:right;"><a href="#index">Back to index</a></span>
+        </p>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Ground-truth</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/mx6JLxzUkRc"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Ours</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/oLirHhP9Su8"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V2A-Mapper</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/HkLkHMqptv0"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">FoleyCrafter</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/rpHiiODjmNU"></iframe>
+                </div>
+            </div>
+        </div>
+        <div class="row g-1">
+            <div class="col-sm-3">
+                <div class="video-header">Frieren</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/1mVD3fJ0LpM"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">VATT</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/yjVFnJiEJlw"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">V-AURA</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/neVeMSWtRkU"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-3">
+                <div class="video-header">Seeing and Hearing</div>
+                <div class="video-container">
+                    <iframe src="https://youtube.com/embed/EUE7YwyVWz8"></iframe>
+                </div>
+            </div>
+        </div>
+    </div>
+    <div id="vgg_extra">
+        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
+        <p style="overflow: hidden;">
+            <span style="float:right;"><a href="#index">Back to index</a></span>
+        </p>
+        <div class="row g-1">
+            <div class="col-sm-3">
+            <div class="video-header">Moving train</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/Ta6H45rBzJc"></iframe>
+            </div>
+            </div>
+            <div class="col-sm-3">
+            <div class="video-header">Water splashing</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/hl6AtgHXpb4"></iframe>
+            </div>
+            </div>
+            <div class="col-sm-3">
+            <div class="video-header">Skateboarding</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/n4sCNi_9buI"></iframe>
+            </div>
+            </div>
+            <div class="col-sm-3">
+            <div class="video-header">Synchronized clapping</div>
+            <div class="video-container">
+                <iframe src="https://youtube.com/embed/oxexfpLn7FE"></iframe>
+            </div>
+            </div>
+        </div>
+        <br><br>
+        <div id="extra-failure">
+            <h2 style="text-align: center;">Failure cases</h2>
+            <p style="overflow: hidden;">
+            <span style="float:right;"><a href="#index">Back to index</a></span>
+            </p>
+            <div class="row g-1">
+            <div class="col-sm-6">
+                <div class="video-header">Human speech</div>
+                <div class="video-container">
+                <iframe src="https://youtube.com/embed/nx0CyrDu70Y"></iframe>
+                </div>
+            </div>
+            <div class="col-sm-6">
+                <div class="video-header">Unfamiliar vision input</div>
+                <div class="video-container">
+                <iframe src="https://youtube.com/embed/hfnAqmK3X7w"></iframe>
+                </div>
+            </div>
+            </div>
+        </div>
+        </div>
+</body>
+</html>

gradio_demo.py ADDED Viewed

	@@ -0,0 +1,343 @@

+import gc
+import logging
+from argparse import ArgumentParser
+from datetime import datetime
+from fractions import Fraction
+from pathlib import Path
+import gradio as gr
+import torch
+import torchaudio
+from mmaudio.eval_utils import (ModelConfig, VideoInfo, all_model_cfg, generate, load_image,
+                                load_video, make_video, setup_eval_logging)
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import MMAudio, get_my_mmaudio
+from mmaudio.model.sequence_config import SequenceConfig
+from mmaudio.model.utils.features_utils import FeaturesUtils
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+log = logging.getLogger()
+device = 'cpu'
+if torch.cuda.is_available():
+    device = 'cuda'
+elif torch.backends.mps.is_available():
+    device = 'mps'
+else:
+    log.warning('CUDA/MPS are not available, running on CPU')
+dtype = torch.bfloat16
+model: ModelConfig = all_model_cfg['large_44k_v2']
+model.download_if_needed()
+output_dir = Path('./output/gradio')
+setup_eval_logging()
+def get_model() -> tuple[MMAudio, FeaturesUtils, SequenceConfig]:
+    seq_cfg = model.seq_cfg
+    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
+    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
+    log.info(f'Loaded weights from {model.model_path}')
+    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
+                                  synchformer_ckpt=model.synchformer_ckpt,
+                                  enable_conditions=True,
+                                  mode=model.mode,
+                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
+                                  need_vae_encoder=False)
+    feature_utils = feature_utils.to(device, dtype).eval()
+    return net, feature_utils, seq_cfg
+net, feature_utils, seq_cfg = get_model()
+@torch.inference_mode()
+def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
+                   cfg_strength: float, duration: float):
+    rng = torch.Generator(device=device)
+    if seed >= 0:
+        rng.manual_seed(seed)
+    else:
+        rng.seed()
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+    video_info = load_video(video, duration)
+    clip_frames = video_info.clip_frames
+    sync_frames = video_info.sync_frames
+    duration = video_info.duration_sec
+    clip_frames = clip_frames.unsqueeze(0)
+    sync_frames = sync_frames.unsqueeze(0)
+    seq_cfg.duration = duration
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    audios = generate(clip_frames,
+                      sync_frames, [prompt],
+                      negative_text=[negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=rng,
+                      cfg_strength=cfg_strength)
+    audio = audios.float().cpu()[0]
+    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
+    output_dir.mkdir(exist_ok=True, parents=True)
+    video_save_path = output_dir / f'{current_time_string}.mp4'
+    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+    gc.collect()
+    return video_save_path
+@torch.inference_mode()
+def image_to_audio(image: gr.Image, prompt: str, negative_prompt: str, seed: int, num_steps: int,
+                   cfg_strength: float, duration: float):
+    rng = torch.Generator(device=device)
+    if seed >= 0:
+        rng.manual_seed(seed)
+    else:
+        rng.seed()
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+    image_info = load_image(image)
+    clip_frames = image_info.clip_frames
+    sync_frames = image_info.sync_frames
+    clip_frames = clip_frames.unsqueeze(0)
+    sync_frames = sync_frames.unsqueeze(0)
+    seq_cfg.duration = duration
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    audios = generate(clip_frames,
+                      sync_frames, [prompt],
+                      negative_text=[negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=rng,
+                      cfg_strength=cfg_strength,
+                      image_input=True)
+    audio = audios.float().cpu()[0]
+    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
+    output_dir.mkdir(exist_ok=True, parents=True)
+    video_save_path = output_dir / f'{current_time_string}.mp4'
+    video_info = VideoInfo.from_image_info(image_info, duration, fps=Fraction(1))
+    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+    gc.collect()
+    return video_save_path
+@torch.inference_mode()
+def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int, cfg_strength: float,
+                  duration: float):
+    rng = torch.Generator(device=device)
+    if seed >= 0:
+        rng.manual_seed(seed)
+    else:
+        rng.seed()
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+    clip_frames = sync_frames = None
+    seq_cfg.duration = duration
+    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+    audios = generate(clip_frames,
+                      sync_frames, [prompt],
+                      negative_text=[negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=rng,
+                      cfg_strength=cfg_strength)
+    audio = audios.float().cpu()[0]
+    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
+    output_dir.mkdir(exist_ok=True, parents=True)
+    audio_save_path = output_dir / f'{current_time_string}.flac'
+    torchaudio.save(audio_save_path, audio, seq_cfg.sampling_rate)
+    gc.collect()
+    return audio_save_path
+video_to_audio_tab = gr.Interface(
+    fn=video_to_audio,
+    description="""
+    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
+    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
+    NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side).
+    Doing so does not improve results.
+    """,
+    inputs=[
+        gr.Video(),
+        gr.Text(label='Prompt'),
+        gr.Text(label='Negative prompt', value='music'),
+        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
+        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
+        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
+        gr.Number(label='Duration (sec)', value=8, minimum=1),
+    ],
+    outputs='playable_video',
+    cache_examples=False,
+    title='MMAudio — Video-to-Audio Synthesis',
+    examples=[
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_beach.mp4',
+            'waves, seagulls',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_serpent.mp4',
+            '',
+            'music',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_seahorse.mp4',
+            'bubbles',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_india.mp4',
+            'Indian holy music',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_galloping.mp4',
+            'galloping',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_kraken.mp4',
+            'waves, storm',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/mochi_storm.mp4',
+            'storm',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_spring.mp4',
+            '',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_typing.mp4',
+            'typing',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_wake_up.mp4',
+            '',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+        [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_nyc.mp4',
+            '',
+            '',
+            0,
+            25,
+            4.5,
+            10,
+        ],
+    ])
+text_to_audio_tab = gr.Interface(
+    fn=text_to_audio,
+    description="""
+    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
+    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
+    """,
+    inputs=[
+        gr.Text(label='Prompt'),
+        gr.Text(label='Negative prompt'),
+        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
+        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
+        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
+        gr.Number(label='Duration (sec)', value=8, minimum=1),
+    ],
+    outputs='audio',
+    cache_examples=False,
+    title='MMAudio — Text-to-Audio Synthesis',
+)
+image_to_audio_tab = gr.Interface(
+    fn=image_to_audio,
+    description="""
+    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
+    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
+    NOTE: It takes longer to process high-resolution images (>384 px on the shorter side).
+    Doing so does not improve results.
+    """,
+    inputs=[
+        gr.Image(type='filepath'),
+        gr.Text(label='Prompt'),
+        gr.Text(label='Negative prompt'),
+        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
+        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
+        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
+        gr.Number(label='Duration (sec)', value=8, minimum=1),
+    ],
+    outputs='playable_video',
+    cache_examples=False,
+    title='MMAudio — Image-to-Audio Synthesis (experimental)',
+)
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument('--port', type=int, default=7860)
+    args = parser.parse_args()
+    gr.TabbedInterface([video_to_audio_tab, text_to_audio_tab, image_to_audio_tab],
+                       ['Video-to-Audio', 'Text-to-Audio', 'Image-to-Audio (experimental)']).launch(
+                           server_port=args.port, allowed_paths=[output_dir])

mmaudio/__init__.py ADDED Viewed

File without changes

mmaudio/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (187 Bytes). View file

mmaudio/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (185 Bytes). View file

mmaudio/__pycache__/eval_utils.cpython-310.pyc ADDED Viewed

Binary file (7.07 kB). View file

mmaudio/__pycache__/eval_utils.cpython-38.pyc ADDED Viewed

Binary file (7.03 kB). View file

mmaudio/data/__init__.py ADDED Viewed

File without changes

mmaudio/data/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (192 Bytes). View file

mmaudio/data/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (190 Bytes). View file

mmaudio/data/__pycache__/av_utils.cpython-310.pyc ADDED Viewed

Binary file (4.91 kB). View file

mmaudio/data/__pycache__/av_utils.cpython-38.pyc ADDED Viewed

Binary file (4.89 kB). View file

mmaudio/data/av_utils.py ADDED Viewed

	@@ -0,0 +1,162 @@

+from dataclasses import dataclass
+from fractions import Fraction
+from pathlib import Path
+from typing import Optional, List, Tuple
+import av
+import numpy as np
+import torch
+from av import AudioFrame
+@dataclass
+class VideoInfo:
+    duration_sec: float
+    fps: Fraction
+    clip_frames: torch.Tensor
+    sync_frames: torch.Tensor
+    all_frames: Optional[List[np.ndarray]]
+    @property
+    def height(self):
+        return self.all_frames[0].shape[0]
+    @property
+    def width(self):
+        return self.all_frames[0].shape[1]
+    @classmethod
+    def from_image_info(cls, image_info: 'ImageInfo', duration_sec: float,
+                        fps: Fraction) -> 'VideoInfo':
+        num_frames = int(duration_sec * fps)
+        all_frames = [image_info.original_frame] * num_frames
+        return cls(duration_sec=duration_sec,
+                   fps=fps,
+                   clip_frames=image_info.clip_frames,
+                   sync_frames=image_info.sync_frames,
+                   all_frames=all_frames)
+@dataclass
+class ImageInfo:
+    clip_frames: torch.Tensor
+    sync_frames: torch.Tensor
+    original_frame: Optional[np.ndarray]
+    @property
+    def height(self):
+        return self.original_frame.shape[0]
+    @property
+    def width(self):
+        return self.original_frame.shape[1]
+def read_frames(video_path: Path, list_of_fps: List[float], start_sec: float, end_sec: float,
+                need_all_frames: bool) -> Tuple[List[np.ndarray], List[np.ndarray], Fraction]:
+    output_frames = [[] for _ in list_of_fps]
+    next_frame_time_for_each_fps = [0.0 for _ in list_of_fps]
+    time_delta_for_each_fps = [1 / fps for fps in list_of_fps]
+    all_frames = []
+    # container = av.open(video_path)
+    with av.open(video_path) as container:
+        stream = container.streams.video[0]
+        fps = stream.guessed_rate
+        stream.thread_type = 'AUTO'
+        for packet in container.demux(stream):
+            for frame in packet.decode():
+                frame_time = frame.time
+                if frame_time < start_sec:
+                    continue
+                if frame_time > end_sec:
+                    break
+                frame_np = None
+                if need_all_frames:
+                    frame_np = frame.to_ndarray(format='rgb24')
+                    all_frames.append(frame_np)
+                for i, _ in enumerate(list_of_fps):
+                    this_time = frame_time
+                    while this_time >= next_frame_time_for_each_fps[i]:
+                        if frame_np is None:
+                            frame_np = frame.to_ndarray(format='rgb24')
+                        output_frames[i].append(frame_np)
+                        next_frame_time_for_each_fps[i] += time_delta_for_each_fps[i]
+    output_frames = [np.stack(frames) for frames in output_frames]
+    return output_frames, all_frames, fps
+def reencode_with_audio(video_info: VideoInfo, output_path: Path, audio: torch.Tensor,
+                        sampling_rate: int):
+    container = av.open(output_path, 'w')
+    output_video_stream = container.add_stream('h264', video_info.fps)
+    output_video_stream.codec_context.bit_rate = 10 * 1e6  # 10 Mbps
+    output_video_stream.width = video_info.width
+    output_video_stream.height = video_info.height
+    output_video_stream.pix_fmt = 'yuv420p'
+    output_audio_stream = container.add_stream('aac', sampling_rate)
+    # encode video
+    for image in video_info.all_frames:
+        image = av.VideoFrame.from_ndarray(image)
+        packet = output_video_stream.encode(image)
+        container.mux(packet)
+    for packet in output_video_stream.encode():
+        container.mux(packet)
+    # convert float tensor audio to numpy array
+    audio_np = audio.numpy().astype(np.float32)
+    audio_frame = AudioFrame.from_ndarray(audio_np, format='flt', layout='mono')
+    audio_frame.sample_rate = sampling_rate
+    for packet in output_audio_stream.encode(audio_frame):
+        container.mux(packet)
+    for packet in output_audio_stream.encode():
+        container.mux(packet)
+    container.close()
+def remux_with_audio(video_path: Path, audio: torch.Tensor, output_path: Path, sampling_rate: int):
+    """
+    NOTE: I don't think we can get the exact video duration right without re-encoding
+    so we are not using this but keeping it here for reference
+    """
+    video = av.open(video_path)
+    output = av.open(output_path, 'w')
+    input_video_stream = video.streams.video[0]
+    output_video_stream = output.add_stream(template=input_video_stream)
+    output_audio_stream = output.add_stream('aac', sampling_rate)
+    duration_sec = audio.shape[-1] / sampling_rate
+    for packet in video.demux(input_video_stream):
+        # We need to skip the "flushing" packets that `demux` generates.
+        if packet.dts is None:
+            continue
+        # We need to assign the packet to the new stream.
+        packet.stream = output_video_stream
+        output.mux(packet)
+    # convert float tensor audio to numpy array
+    audio_np = audio.numpy().astype(np.float32)
+    audio_frame = av.AudioFrame.from_ndarray(audio_np, format='flt', layout='mono')
+    audio_frame.sample_rate = sampling_rate
+    for packet in output_audio_stream.encode(audio_frame):
+        output.mux(packet)
+    for packet in output_audio_stream.encode():
+        output.mux(packet)
+    video.close()
+    output.close()
+    output.close()

mmaudio/data/data_setup.py ADDED Viewed

	@@ -0,0 +1,174 @@

+import logging
+import random
+import numpy as np
+import torch
+from omegaconf import DictConfig
+from torch.utils.data import DataLoader, Dataset
+from torch.utils.data.dataloader import default_collate
+from torch.utils.data.distributed import DistributedSampler
+from mmaudio.data.eval.audiocaps import AudioCapsData
+from mmaudio.data.eval.video_dataset import MovieGen, VGGSound
+from mmaudio.data.extracted_audio import ExtractedAudio
+from mmaudio.data.extracted_vgg import ExtractedVGG
+from mmaudio.data.mm_dataset import MultiModalDataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+# Re-seed randomness every time we start a worker
+def worker_init_fn(worker_id: int):
+    worker_seed = torch.initial_seed() % (2**31) + worker_id + local_rank * 1000
+    np.random.seed(worker_seed)
+    random.seed(worker_seed)
+    log.debug(f'Worker {worker_id} re-seeded with seed {worker_seed} in rank {local_rank}')
+def load_vgg_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedVGG(tsv_path=data_cfg.tsv,
+                           data_dim=cfg.data_dim,
+                           premade_mmap_dir=data_cfg.memmap_dir)
+    return dataset
+def load_audio_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedAudio(tsv_path=data_cfg.tsv,
+                             data_dim=cfg.data_dim,
+                             premade_mmap_dir=data_cfg.memmap_dir)
+    return dataset
+def setup_training_datasets(cfg: DictConfig) -> tuple[Dataset, DistributedSampler, DataLoader]:
+    if cfg.mini_train:
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        dataset = MultiModalDataset([vgg], [audiocaps])
+    if cfg.example_train:
+        video = load_vgg_data(cfg, cfg.data.Example_video)
+        audio = load_audio_data(cfg, cfg.data.Example_audio)
+        dataset = MultiModalDataset([video], [audio])
+    else:
+        # load the largest one first
+        freesound = load_audio_data(cfg, cfg.data.FreeSound)
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        audioset_sl = load_audio_data(cfg, cfg.data.AudioSetSL)
+        bbcsound = load_audio_data(cfg, cfg.data.BBCSound)
+        clotho = load_audio_data(cfg, cfg.data.Clotho)
+        dataset = MultiModalDataset([vgg] * cfg.vgg_oversample_rate,
+                                    [audiocaps, audioset_sl, bbcsound, freesound, clotho])
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=True,
+                                       drop_last=True,
+                                       pin_memory=pin_memory)
+    return dataset, sampler, loader
+def setup_test_datasets(cfg):
+    dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_test)
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=False,
+                                       drop_last=False,
+                                       pin_memory=pin_memory)
+    return dataset, sampler, loader
+def setup_val_datasets(cfg: DictConfig) -> tuple[Dataset, DataLoader, DataLoader]:
+    if cfg.example_train:
+        dataset = load_vgg_data(cfg, cfg.data.Example_video)
+    else:
+        dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+    val_batch_size = cfg.batch_size
+    val_eval_batch_size = cfg.eval_batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, val_loader = construct_loader(dataset,
+                                     val_batch_size,
+                                     num_workers,
+                                     shuffle=False,
+                                     drop_last=False,
+                                     pin_memory=pin_memory)
+    _, eval_loader = construct_loader(dataset,
+                                      val_eval_batch_size,
+                                      num_workers,
+                                      shuffle=False,
+                                      drop_last=False,
+                                      pin_memory=pin_memory)
+    return dataset, val_loader, eval_loader
+def setup_eval_dataset(dataset_name: str, cfg: DictConfig) -> tuple[Dataset, DataLoader]:
+    if dataset_name.startswith('audiocaps_full'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps_full.audio_path,
+                                cfg.eval_data.AudioCaps_full.csv_path)
+    elif dataset_name.startswith('audiocaps'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps.audio_path,
+                                cfg.eval_data.AudioCaps.csv_path)
+    elif dataset_name.startswith('moviegen'):
+        dataset = MovieGen(cfg.eval_data.MovieGen.video_path,
+                           cfg.eval_data.MovieGen.jsonl_path,
+                           duration_sec=cfg.duration_s)
+    elif dataset_name.startswith('vggsound'):
+        dataset = VGGSound(cfg.eval_data.VGGSound.video_path,
+                           cfg.eval_data.VGGSound.csv_path,
+                           duration_sec=cfg.duration_s)
+    else:
+        raise ValueError(f'Invalid dataset name: {dataset_name}')
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, loader = construct_loader(dataset,
+                                 batch_size,
+                                 num_workers,
+                                 shuffle=False,
+                                 drop_last=False,
+                                 pin_memory=pin_memory,
+                                 error_avoidance=True)
+    return dataset, loader
+def error_avoidance_collate(batch):
+    batch = list(filter(lambda x: x is not None, batch))
+    return default_collate(batch)
+def construct_loader(dataset: Dataset,
+                     batch_size: int,
+                     num_workers: int,
+                     *,
+                     shuffle: bool = True,
+                     drop_last: bool = True,
+                     pin_memory: bool = False,
+                     error_avoidance: bool = False) -> tuple[DistributedSampler, DataLoader]:
+    train_sampler = DistributedSampler(dataset, rank=local_rank, shuffle=shuffle)
+    train_loader = DataLoader(dataset,
+                              batch_size,
+                              sampler=train_sampler,
+                              num_workers=num_workers,
+                              worker_init_fn=worker_init_fn,
+                              drop_last=drop_last,
+                              persistent_workers=num_workers > 0,
+                              pin_memory=pin_memory,
+                              collate_fn=error_avoidance_collate if error_avoidance else None)
+    return train_sampler, train_loader

mmaudio/data/eval/__init__.py ADDED Viewed

File without changes

mmaudio/data/eval/audiocaps.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import logging
+import os
+from collections import defaultdict
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+log = logging.getLogger()
+class AudioCapsData(Dataset):
+    def __init__(self, audio_path: Union[str, Path], csv_path: Union[str, Path]):
+        df = pd.read_csv(csv_path).to_dict(orient='records')
+        audio_files = sorted(os.listdir(audio_path))
+        audio_files = set(
+            [Path(f).stem for f in audio_files if f.endswith('.wav') or f.endswith('.flac')])
+        self.data = []
+        for row in df:
+            self.data.append({
+                'name': row['name'],
+                'caption': row['caption'],
+            })
+        self.audio_path = Path(audio_path)
+        self.csv_path = Path(csv_path)
+        log.info(f'Found {len(self.data)} matching audio files in {self.audio_path}')
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        return self.data[idx]
+    def __len__(self):
+        return len(self.data)

mmaudio/data/eval/moviegen.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class MovieGenData(Dataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        sync_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+        read_clip: bool = True,
+    ):
+        self.video_root = Path(video_root)
+        self.sync_root = Path(sync_root)
+        self.jsonl_root = Path(jsonl_root)
+        self.read_clip = read_clip
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.duration_sec = duration_sec
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_augment = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_augment = v2.Compose([
+            v2.Resize((_SYNC_SIZE, _SYNC_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        self.videos = videos
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(f'CLIP video too short {video_id}')
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(f'Sync video too short {video_id}')
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_augment(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_augment(sync_chunk)
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        return self.sample(idx)
+    def __len__(self):
+        return len(self.captions)

mmaudio/data/eval/video_dataset.py ADDED Viewed

	@@ -0,0 +1,197 @@

+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class VideoDataset(Dataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        self.video_root = Path(video_root)
+        self.duration_sec = duration_sec
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        # to be implemented by subclasses
+        self.captions = {}
+        self.videos = sorted(list(self.captions.keys()))
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+    def __len__(self):
+        return len(self.captions)
+class VGGSound(VideoDataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        csv_path: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.csv_path = Path(csv_path)
+        videos = sorted(os.listdir(self.video_root))
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.captions = {}
+        df = pd.read_csv(csv_path, header=None, names=['id', 'sec', 'caption',
+                                                       'split']).to_dict(orient='records')
+        videos_no_found = []
+        for row in df:
+            if row['split'] == 'test':
+                start_sec = int(row['sec'])
+                video_id = str(row['id'])
+                # this is how our videos are named
+                video_name = f'{video_id}_{start_sec:06d}'
+                if video_name + '.mp4' not in videos:
+                    videos_no_found.append(video_name)
+                    continue
+                self.captions[video_name] = row['caption']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+            log.info(f'{len(self.captions)} useable videos found')
+            if videos_no_found:
+                log.info(f'{len(videos_no_found)} found in {csv_path} but not in {video_root}')
+                log.info(
+                    'A small amount is expected, as not all videos are still available on YouTube')
+        self.videos = sorted(list(self.captions.keys()))
+class MovieGen(VideoDataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.jsonl_root = Path(jsonl_root)
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.videos = videos

mmaudio/data/extracted_audio.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import logging
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+class ExtractedAudio(Dataset):
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [str(d['id']) for d in self.df_list]
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.text_features = td['text_features']
+        log.info(f'Loaded {len(self)} samples from {premade_mmap_dir}.')
+        log.info(f'Loaded mean: {self.mean.shape}.')
+        log.info(f'Loaded std: {self.std.shape}.')
+        log.info(f'Loaded text features: {self.text_features.shape}.')
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+        self.fake_clip_features = torch.zeros(self.data_dim['clip_seq_len'],
+                                              self.data_dim['clip_dim'])
+        self.fake_sync_features = torch.zeros(self.data_dim['sync_seq_len'],
+                                              self.data_dim['sync_dim'])
+        self.video_exist = torch.tensor(0, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'text_features': self.text_features,
+        })
+        return td
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': str(self.df_list[idx]['id']),
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.fake_clip_features,
+            'sync_features': self.fake_sync_features,
+            'text_features': self.text_features[idx],
+            'caption': self.df_list[idx]['caption'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+        }
+        return data
+    def __len__(self):
+        return len(self.ids)

mmaudio/data/extracted_vgg.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import logging
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+class ExtractedVGG(Dataset):
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [d['id'] for d in self.df_list]
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.clip_features = td['clip_features']
+        self.sync_features = td['sync_features']
+        self.text_features = td['text_features']
+        if local_rank == 0:
+            log.info(f'Loaded {len(self)} samples.')
+            log.info(f'Loaded mean: {self.mean.shape}.')
+            log.info(f'Loaded std: {self.std.shape}.')
+            log.info(f'Loaded clip_features: {self.clip_features.shape}.')
+            log.info(f'Loaded sync_features: {self.sync_features.shape}.')
+            log.info(f'Loaded text_features: {self.text_features.shape}.')
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.clip_features.shape[1] == self.data_dim['clip_seq_len'], \
+            f'{self.clip_features.shape[1]} != {self.data_dim["clip_seq_len"]}'
+        assert self.sync_features.shape[1] == self.data_dim['sync_seq_len'], \
+            f'{self.sync_features.shape[1]} != {self.data_dim["sync_seq_len"]}'
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+        assert self.clip_features.shape[-1] == self.data_dim['clip_dim'], \
+            f'{self.clip_features.shape[-1]} != {self.data_dim["clip_dim"]}'
+        assert self.sync_features.shape[-1] == self.data_dim['sync_dim'], \
+            f'{self.sync_features.shape[-1]} != {self.data_dim["sync_dim"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+        self.video_exist = torch.tensor(1, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'clip_features': self.clip_features,
+            'sync_features': self.sync_features,
+            'text_features': self.text_features,
+        })
+        return td
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': self.df_list[idx]['id'],
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.clip_features[idx],
+            'sync_features': self.sync_features[idx],
+            'text_features': self.text_features[idx],
+            'caption': self.df_list[idx]['label'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+        }
+        return data
+    def __len__(self):
+        return len(self.ids)

mmaudio/data/extraction/__init__.py ADDED Viewed

File without changes

mmaudio/data/extraction/vgg_sound.py ADDED Viewed

	@@ -0,0 +1,193 @@

+import logging
+import os
+from pathlib import Path
+from typing import Optional, Union
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class VGGSound(Dataset):
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        tsv_path: Union[str, Path] = 'sets/vgg3-train.tsv',
+        sample_rate: int = 16_000,
+        duration_sec: float = 8.0,
+        audio_samples: Optional[int] = None,
+        normalize_audio: bool = False,
+    ):
+        self.root = Path(root)
+        self.normalize_audio = normalize_audio
+        if audio_samples is None:
+            self.audio_samples = int(sample_rate * duration_sec)
+        else:
+            self.audio_samples = audio_samples
+            effective_duration = audio_samples / sample_rate
+            # make sure the duration is close enough, within 15ms
+            assert abs(effective_duration - duration_sec) < 0.015, \
+                f'audio_samples {audio_samples} does not match duration_sec {duration_sec}'
+        videos = sorted(os.listdir(self.root))
+        videos = set([Path(v).stem for v in videos])  # remove extensions
+        self.labels = {}
+        self.videos = []
+        missing_videos = []
+        # read the tsv for subset information
+        df_list = pd.read_csv(tsv_path, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            label = record['label']
+            if id in videos:
+                self.labels[id] = label
+                self.videos.append(id)
+            else:
+                missing_videos.append(id)
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {root}')
+            log.info(f'{len(self.videos)} videos found in {tsv_path}')
+            log.info(f'{len(missing_videos)} videos missing in {root}')
+        self.sample_rate = sample_rate
+        self.duration_sec = duration_sec
+        self.expected_audio_length = audio_samples
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        self.resampler = {}
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        label = self.labels[video_id]
+        reader = StreamingMediaDecoder(self.root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_audio_stream(frames_per_chunk=2**30, )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        audio_chunk = data_chunk[2]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+        # process audio
+        sample_rate = int(reader.get_out_stream_info(2).sample_rate)
+        audio_chunk = audio_chunk.transpose(0, 1)
+        audio_chunk = audio_chunk.mean(dim=0)  # mono
+        if self.normalize_audio:
+            abs_max = audio_chunk.abs().max()
+            audio_chunk = audio_chunk / abs_max * 0.95
+            if abs_max <= 1e-6:
+                raise RuntimeError(f'Audio is silent {video_id}')
+        # resample
+        if sample_rate == self.sample_rate:
+            audio_chunk = audio_chunk
+        else:
+            if sample_rate not in self.resampler:
+                # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                    sample_rate,
+                    self.sample_rate,
+                    lowpass_filter_width=64,
+                    rolloff=0.9475937167399596,
+                    resampling_method='sinc_interp_kaiser',
+                    beta=14.769656459379492,
+                )
+            audio_chunk = self.resampler[sample_rate](audio_chunk)
+        if audio_chunk.shape[0] < self.expected_audio_length:
+            raise RuntimeError(f'Audio too short {video_id}')
+        audio_chunk = audio_chunk[:self.expected_audio_length]
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+        data = {
+            'id': video_id,
+            'caption': label,
+            'audio': audio_chunk,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+    def __len__(self):
+        return len(self.labels)

mmaudio/data/extraction/wav_dataset.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import logging
+import os
+from pathlib import Path
+from typing import Union
+import open_clip
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+log = logging.getLogger()
+class WavTextClipsDataset(Dataset):
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        captions_tsv: Union[str, Path],
+        clips_tsv: Union[str, Path],
+        sample_rate: int,
+        num_samples: int,
+        normalize_audio: bool = False,
+        reject_silent: bool = False,
+        tokenizer_id: str = 'ViT-H-14-378-quickgelu',
+    ):
+        self.root = Path(root)
+        self.sample_rate = sample_rate
+        self.num_samples = num_samples
+        self.normalize_audio = normalize_audio
+        self.reject_silent = reject_silent
+        self.tokenizer = open_clip.get_tokenizer(tokenizer_id)
+        audios = sorted(os.listdir(self.root))
+        audios = set([
+            Path(audio).stem for audio in audios
+            if audio.endswith('.wav') or audio.endswith('.flac')
+        ])
+        self.captions = {}
+        # read the caption tsv
+        df_list = pd.read_csv(captions_tsv, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            caption = record['caption']
+            self.captions[id] = caption
+        # read the clip tsv
+        df_list = pd.read_csv(clips_tsv, sep='\t', dtype={
+            'id': str,
+            'name': str
+        }).to_dict('records')
+        self.clips = []
+        for record in df_list:
+            record['id'] = record['id']
+            record['name'] = record['name']
+            id = record['id']
+            name = record['name']
+            if name not in self.captions:
+                log.warning(f'Audio {name} not found in {captions_tsv}')
+                continue
+            record['caption'] = self.captions[name]
+            self.clips.append(record)
+        log.info(f'Found {len(self.clips)} audio files in {self.root}')
+        self.resampler = {}
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        try:
+            clip = self.clips[idx]
+            audio_name = clip['name']
+            audio_id = clip['id']
+            caption = clip['caption']
+            start_sample = clip['start_sample']
+            end_sample = clip['end_sample']
+            audio_path = self.root / f'{audio_name}.flac'
+            if not audio_path.exists():
+                audio_path = self.root / f'{audio_name}.wav'
+                assert audio_path.exists()
+            audio_chunk, sample_rate = torchaudio.load(audio_path)
+            audio_chunk = audio_chunk.mean(dim=0)  # mono
+            abs_max = audio_chunk.abs().max()
+            if self.normalize_audio:
+                audio_chunk = audio_chunk / abs_max * 0.95
+            if self.reject_silent and abs_max < 1e-6:
+                log.warning(f'Rejecting silent audio')
+                return None
+            audio_chunk = audio_chunk[start_sample:end_sample]
+            # resample
+            if sample_rate == self.sample_rate:
+                audio_chunk = audio_chunk
+            else:
+                if sample_rate not in self.resampler:
+                    # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                    self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                        sample_rate,
+                        self.sample_rate,
+                        lowpass_filter_width=64,
+                        rolloff=0.9475937167399596,
+                        resampling_method='sinc_interp_kaiser',
+                        beta=14.769656459379492,
+                    )
+                audio_chunk = self.resampler[sample_rate](audio_chunk)
+            if audio_chunk.shape[0] < self.num_samples:
+                raise ValueError('Audio is too short')
+            audio_chunk = audio_chunk[:self.num_samples]
+            tokens = self.tokenizer([caption])[0]
+            output = {
+                'waveform': audio_chunk,
+                'id': audio_id,
+                'caption': caption,
+                'tokens': tokens,
+            }
+            return output
+        except Exception as e:
+            log.error(f'Error reading {audio_path}: {e}')
+            return None
+    def __len__(self):
+        return len(self.clips)

mmaudio/data/mm_dataset.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import bisect
+import torch
+from torch.utils.data.dataset import Dataset
+# modified from https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#ConcatDataset
+class MultiModalDataset(Dataset):
+    datasets: list[Dataset]
+    cumulative_sizes: list[int]
+    @staticmethod
+    def cumsum(sequence):
+        r, s = [], 0
+        for e in sequence:
+            l = len(e)
+            r.append(l + s)
+            s += l
+        return r
+    def __init__(self, video_datasets: list[Dataset], audio_datasets: list[Dataset]):
+        super().__init__()
+        self.video_datasets = list(video_datasets)
+        self.audio_datasets = list(audio_datasets)
+        self.datasets = self.video_datasets + self.audio_datasets
+        self.cumulative_sizes = self.cumsum(self.datasets)
+    def __len__(self):
+        return self.cumulative_sizes[-1]
+    def __getitem__(self, idx):
+        if idx < 0:
+            if -idx > len(self):
+                raise ValueError("absolute value of index should not exceed dataset length")
+            idx = len(self) + idx
+        dataset_idx = bisect.bisect_right(self.cumulative_sizes, idx)
+        if dataset_idx == 0:
+            sample_idx = idx
+        else:
+            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
+        return self.datasets[dataset_idx][sample_idx]
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        return self.video_datasets[0].compute_latent_stats()

mmaudio/data/utils.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import logging
+import os
+import random
+import tempfile
+from pathlib import Path
+from typing import Any, Optional, Union
+import torch
+import torch.distributed as dist
+from tensordict import MemoryMappedTensor
+from torch.utils.data import DataLoader
+from torch.utils.data.dataset import Dataset
+from tqdm import tqdm
+from mmaudio.utils.dist_utils import local_rank, world_size
+scratch_path = Path(os.environ['SLURM_SCRATCH'] if 'SLURM_SCRATCH' in os.environ else '/dev/shm')
+shm_path = Path('/dev/shm')
+log = logging.getLogger()
+def reseed(seed):
+    random.seed(seed)
+    torch.manual_seed(seed)
+def local_scatter_torch(obj: Optional[Any]):
+    if world_size == 1:
+        # Just one worker. Do nothing.
+        return obj
+    array = [obj] * world_size
+    target_array = [None]
+    if local_rank == 0:
+        dist.scatter_object_list(target_array, scatter_object_input_list=array, src=0)
+    else:
+        dist.scatter_object_list(target_array, scatter_object_input_list=None, src=0)
+    return target_array[0]
+class ShardDataset(Dataset):
+    def __init__(self, root):
+        self.root = root
+        self.shards = sorted(os.listdir(root))
+    def __len__(self):
+        return len(self.shards)
+    def __getitem__(self, idx):
+        return torch.load(os.path.join(self.root, self.shards[idx]), weights_only=True)
+def get_tmp_dir(in_memory: bool) -> Path:
+    return shm_path if in_memory else scratch_path
+def load_shards_and_share(data_path: Union[str, Path], ids: list[int],
+                          in_memory: bool) -> MemoryMappedTensor:
+    if local_rank == 0:
+        with tempfile.NamedTemporaryFile(prefix='shared-tensor-', dir=get_tmp_dir(in_memory)) as f:
+            log.info(f'Loading shards from {data_path} into {f.name}...')
+            data = load_shards(data_path, ids=ids, tmp_file_path=f.name)
+            data = share_tensor_to_all(data)
+            torch.distributed.barrier()
+            f.close()  # why does the context manager not close the file for me?
+    else:
+        log.info('Waiting for the data to be shared with me...')
+        data = share_tensor_to_all(None)
+        torch.distributed.barrier()
+    return data
+def load_shards(
+    data_path: Union[str, Path],
+    ids: list[int],
+    *,
+    tmp_file_path: str,
+) -> Union[torch.Tensor, dict[str, torch.Tensor]]:
+    id_set = set(ids)
+    shards = sorted(os.listdir(data_path))
+    log.info(f'Found {len(shards)} shards in {data_path}.')
+    first_shard = torch.load(os.path.join(data_path, shards[0]), weights_only=True)
+    log.info(f'Rank {local_rank} created file {tmp_file_path}')
+    first_item = next(iter(first_shard.values()))
+    log.info(f'First item shape: {first_item.shape}')
+    mm_tensor = MemoryMappedTensor.empty(shape=(len(ids), *first_item.shape),
+                                         dtype=torch.float32,
+                                         filename=tmp_file_path,
+                                         existsok=True)
+    total_count = 0
+    used_index = set()
+    id_indexing = {i: idx for idx, i in enumerate(ids)}
+    # faster with no workers; otherwise we need to set_sharing_strategy('file_system')
+    loader = DataLoader(ShardDataset(data_path), batch_size=1, num_workers=0)
+    for data in tqdm(loader, desc='Loading shards'):
+        for i, v in data.items():
+            if i not in id_set:
+                continue
+            # tensor_index = ids.index(i)
+            tensor_index = id_indexing[i]
+            if tensor_index in used_index:
+                raise ValueError(f'Duplicate id {i} found in {data_path}.')
+            used_index.add(tensor_index)
+            mm_tensor[tensor_index] = v
+            total_count += 1
+    assert total_count == len(ids), f'Expected {len(ids)} tensors, got {total_count}.'
+    log.info(f'Loaded {total_count} tensors from {data_path}.')
+    return mm_tensor
+def share_tensor_to_all(x: Optional[MemoryMappedTensor]) -> MemoryMappedTensor:
+    """
+    x: the tensor to be shared; None if local_rank != 0
+    return: the shared tensor
+    """
+    # there is no need to share your stuff with anyone if you are alone; must be in memory
+    if world_size == 1:
+        return x
+    if local_rank == 0:
+        assert x is not None, 'x must not be None if local_rank == 0'
+    else:
+        assert x is None, 'x must be None if local_rank != 0'
+    if local_rank == 0:
+        filename = x.filename
+        meta_information = (filename, x.shape, x.dtype)
+    else:
+        meta_information = None
+    filename, data_shape, data_type = local_scatter_torch(meta_information)
+    if local_rank == 0:
+        data = x
+    else:
+        data = MemoryMappedTensor.from_filename(filename=filename,
+                                                dtype=data_type,
+                                                shape=data_shape)
+    return data

mmaudio/eval_utils.py ADDED Viewed

	@@ -0,0 +1,255 @@

+import dataclasses
+import logging
+from pathlib import Path
+from typing import Optional, Tuple, List, Dict
+import numpy as np
+import torch
+from colorlog import ColoredFormatter
+from PIL import Image
+from torchvision.transforms import v2
+from mmaudio.data.av_utils import ImageInfo, VideoInfo, read_frames, reencode_with_audio
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import MMAudio
+from mmaudio.model.sequence_config import CONFIG_16K, CONFIG_44K, SequenceConfig
+from mmaudio.model.utils.features_utils import FeaturesUtils
+from mmaudio.utils.download_utils import download_model_if_needed
+log = logging.getLogger()
+@dataclasses.dataclass
+class ModelConfig:
+    model_name: str
+    model_path: Path
+    vae_path: Path
+    bigvgan_16k_path: Optional[Path]
+    mode: str
+    synchformer_ckpt: Path = Path('./pretrained/v2a/mmaudio/ext_weights/synchformer_state_dict.pth')
+    @property
+    def seq_cfg(self) -> SequenceConfig:
+        if self.mode == '16k':
+            return CONFIG_16K
+        elif self.mode == '44k':
+            return CONFIG_44K
+    def download_if_needed(self):
+        download_model_if_needed(self.model_path)
+        download_model_if_needed(self.vae_path)
+        if self.bigvgan_16k_path is not None:
+            download_model_if_needed(self.bigvgan_16k_path)
+        download_model_if_needed(self.synchformer_ckpt)
+small_16k = ModelConfig(model_name='small_16k',
+                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_small_16k.pth'),
+                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-16.pth'),
+                        bigvgan_16k_path=Path('./pretrained/v2a/mmaudio/ext_weights/best_netG.pt'),
+                        mode='16k')
+small_44k = ModelConfig(model_name='small_44k',
+                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_small_44k.pth'),
+                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
+                        bigvgan_16k_path=None,
+                        mode='44k')
+medium_44k = ModelConfig(model_name='medium_44k',
+                         model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_medium_44k.pth'),
+                         vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
+                         bigvgan_16k_path=None,
+                         mode='44k')
+large_44k = ModelConfig(model_name='large_44k',
+                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_large_44k.pth'),
+                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
+                        bigvgan_16k_path=None,
+                        mode='44k')
+large_44k_v2 = ModelConfig(model_name='large_44k_v2',
+                           model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_large_44k_v2.pth'),
+                           vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
+                           bigvgan_16k_path=None,
+                           mode='44k')
+all_model_cfg: Dict[str, ModelConfig] = {
+    'small_16k': small_16k,
+    'small_44k': small_44k,
+    'medium_44k': medium_44k,
+    'large_44k': large_44k,
+    'large_44k_v2': large_44k_v2,
+}
+def generate(
+    clip_video: Optional[torch.Tensor],
+    sync_video: Optional[torch.Tensor],
+    text: Optional[List[str]],
+    *,
+    negative_text: Optional[List[str]] = None,
+    feature_utils: FeaturesUtils,
+    net: MMAudio,
+    fm: FlowMatching,
+    rng: torch.Generator,
+    cfg_strength: float,
+    clip_batch_size_multiplier: int = 40,
+    sync_batch_size_multiplier: int = 40,
+    image_input: bool = False,
+) -> torch.Tensor:
+    device = feature_utils.device
+    dtype = feature_utils.dtype
+    bs = len(text)
+    if clip_video is not None:
+        clip_video = clip_video.to(device, dtype, non_blocking=True)
+        clip_features = feature_utils.encode_video_with_clip(clip_video,
+                                                             batch_size=bs *
+                                                             clip_batch_size_multiplier)
+        if image_input:
+            clip_features = clip_features.expand(-1, net.clip_seq_len, -1)
+    else:
+        clip_features = net.get_empty_clip_sequence(bs)
+    if sync_video is not None and not image_input:
+        sync_video = sync_video.to(device, dtype, non_blocking=True)
+        sync_features = feature_utils.encode_video_with_sync(sync_video,
+                                                             batch_size=bs *
+                                                             sync_batch_size_multiplier)
+    else:
+        sync_features = net.get_empty_sync_sequence(bs)
+    if text is not None:
+        text_features = feature_utils.encode_text(text)
+    else:
+        text_features = net.get_empty_string_sequence(bs)
+    if negative_text is not None:
+        assert len(negative_text) == bs
+        negative_text_features = feature_utils.encode_text(negative_text)
+    else:
+        negative_text_features = net.get_empty_string_sequence(bs)
+    x0 = torch.randn(bs,
+                     net.latent_seq_len,
+                     net.latent_dim,
+                     device=device,
+                     dtype=dtype,
+                     generator=rng)
+    preprocessed_conditions = net.preprocess_conditions(clip_features, sync_features, text_features)
+    empty_conditions = net.get_empty_conditions(
+        bs, negative_text_features=negative_text_features if negative_text is not None else None)
+    cfg_ode_wrapper = lambda t, x: net.ode_wrapper(t, x, preprocessed_conditions, empty_conditions,
+                                                   cfg_strength)
+    x1 = fm.to_data(cfg_ode_wrapper, x0)
+    x1 = net.unnormalize(x1)
+    spec = feature_utils.decode(x1)
+    audio = feature_utils.vocode(spec)
+    return audio
+LOGFORMAT = "[%(log_color)s%(levelname)-8s%(reset)s]: %(log_color)s%(message)s%(reset)s"
+def setup_eval_logging(log_level: int = logging.INFO):
+    logging.root.setLevel(log_level)
+    formatter = ColoredFormatter(LOGFORMAT)
+    stream = logging.StreamHandler()
+    stream.setLevel(log_level)
+    stream.setFormatter(formatter)
+    log = logging.getLogger()
+    log.setLevel(log_level)
+    log.addHandler(stream)
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = True) -> VideoInfo:
+    clip_transform = v2.Compose([
+        v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+    ])
+    sync_transform = v2.Compose([
+        v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+        v2.CenterCrop(_SYNC_SIZE),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+        v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+    ])
+    output_frames, all_frames, orig_fps = read_frames(video_path,
+                                                      list_of_fps=[_CLIP_FPS, _SYNC_FPS],
+                                                      start_sec=0,
+                                                      end_sec=duration_sec,
+                                                      need_all_frames=load_all_frames)
+    clip_chunk, sync_chunk = output_frames
+    clip_chunk = torch.from_numpy(clip_chunk).permute(0, 3, 1, 2)
+    sync_chunk = torch.from_numpy(sync_chunk).permute(0, 3, 1, 2)
+    clip_frames = clip_transform(clip_chunk)
+    sync_frames = sync_transform(sync_chunk)
+    clip_length_sec = clip_frames.shape[0] / _CLIP_FPS
+    sync_length_sec = sync_frames.shape[0] / _SYNC_FPS
+    if clip_length_sec < duration_sec:
+        log.warning(f'Clip video is too short: {clip_length_sec:.2f} < {duration_sec:.2f}')
+        log.warning(f'Truncating to {clip_length_sec:.2f} sec')
+        duration_sec = clip_length_sec
+    if sync_length_sec < duration_sec:
+        log.warning(f'Sync video is too short: {sync_length_sec:.2f} < {duration_sec:.2f}')
+        log.warning(f'Truncating to {sync_length_sec:.2f} sec')
+        duration_sec = sync_length_sec
+    clip_frames = clip_frames[:int(_CLIP_FPS * duration_sec)]
+    sync_frames = sync_frames[:int(_SYNC_FPS * duration_sec)]
+    video_info = VideoInfo(
+        duration_sec=duration_sec,
+        fps=orig_fps,
+        clip_frames=clip_frames,
+        sync_frames=sync_frames,
+        all_frames=all_frames if load_all_frames else None,
+    )
+    return video_info
+def load_image(image_path: Path) -> VideoInfo:
+    clip_transform = v2.Compose([
+        v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+    ])
+    sync_transform = v2.Compose([
+        v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+        v2.CenterCrop(_SYNC_SIZE),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+        v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+    ])
+    frame = np.array(Image.open(image_path))
+    clip_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
+    sync_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
+    clip_frames = clip_transform(clip_chunk)
+    sync_frames = sync_transform(sync_chunk)
+    video_info = ImageInfo(
+        clip_frames=clip_frames,
+        sync_frames=sync_frames,
+        original_frame=frame,
+    )
+    return video_info
+def make_video(video_info: VideoInfo, output_path: Path, audio: torch.Tensor, sampling_rate: int):
+    reencode_with_audio(video_info, output_path, audio, sampling_rate)