Spaces:

lym0302
/

DeepSound-V1

Running

App Files Files Community

lym0302123 commited on Mar 25

Commit

1fd4e9c

1 Parent(s): 9d9a9d8

our

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +90 -86
app.py +138 -251
demo.py +0 -135
docs/images/icon.png +0 -0
docs/index.html +0 -147
docs/style.css +0 -78
docs/style_videos.css +0 -52
docs/video_gen.html +0 -254
docs/video_main.html +0 -98
docs/video_vgg.html +0 -452
{mmaudio → pipeline}/__init__.py +0 -0
pipeline/__pycache__/__init__.cpython-310.pyc +0 -0
pipeline/__pycache__/__init__.cpython-38.pyc +0 -0
pipeline/__pycache__/pipeline.cpython-310.pyc +0 -0
pipeline/__pycache__/pipeline.cpython-38.pyc +0 -0
pipeline/__pycache__/step0.cpython-310.pyc +0 -0
pipeline/__pycache__/step0.cpython-38.pyc +0 -0
pipeline/__pycache__/step1.cpython-310.pyc +0 -0
pipeline/__pycache__/step1.cpython-38.pyc +0 -0
pipeline/__pycache__/step2.cpython-310.pyc +0 -0
pipeline/__pycache__/step2.cpython-38.pyc +0 -0
pipeline/__pycache__/step3.cpython-310.pyc +0 -0
pipeline/__pycache__/step3.cpython-38.pyc +0 -0
pipeline/__pycache__/step4.cpython-310.pyc +0 -0
pipeline/__pycache__/step4.cpython-38.pyc +0 -0
pipeline/pipeline.py +175 -0
pipeline/step0.py +39 -0
pipeline/step1.py +36 -0
pipeline/step2.py +52 -0
pipeline/step3.py +129 -0
pipeline/step4.py +31 -0
pyproject.toml +0 -52
requirements.txt.bak +0 -27
third_party/MMAudio/.gitignore +146 -0
third_party/MMAudio/LICENSE +21 -0
{mmaudio/data → third_party/MMAudio/mmaudio}/__init__.py +0 -0
{mmaudio/ext/bigvgan_v2 → third_party/MMAudio/mmaudio/data}/__init__.py +0 -0
{mmaudio → third_party/MMAudio/mmaudio}/data/av_utils.py +30 -4
third_party/MMAudio/mmaudio/data/data_setup.py +174 -0
{mmaudio/ext/bigvgan_v2/alias_free_activation/cuda → third_party/MMAudio/mmaudio/data/eval}/__init__.py +0 -0
third_party/MMAudio/mmaudio/data/eval/audiocaps.py +39 -0
third_party/MMAudio/mmaudio/data/eval/moviegen.py +131 -0
third_party/MMAudio/mmaudio/data/eval/video_dataset.py +197 -0
third_party/MMAudio/mmaudio/data/extracted_audio.py +88 -0
third_party/MMAudio/mmaudio/data/extracted_vgg.py +101 -0
{mmaudio/model → third_party/MMAudio/mmaudio/data/extraction}/__init__.py +0 -0
third_party/MMAudio/mmaudio/data/extraction/vgg_sound.py +193 -0
third_party/MMAudio/mmaudio/data/extraction/wav_dataset.py +132 -0
third_party/MMAudio/mmaudio/data/mm_dataset.py +45 -0
third_party/MMAudio/mmaudio/data/utils.py +148 -0

README.md CHANGED Viewed

@@ -1,6 +1,5 @@
 ---
 title: DeepSound-V1
-emoji: 🔊
 colorFrom: blue
 colorTo: indigo
 sdk: gradio
@@ -9,155 +8,160 @@ pinned: false
 ---
-# [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)
-[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)
-University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
-[[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio)
-**Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.**
 ## Highlight
-MMAudio generates synchronized audio given video and/or text inputs.
-Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
-Moreover, a synchronization module aligns the generated audio with the video frames.
-## Results
 (All audio from our algorithm MMAudio)
-Videos from Sora:
 https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
-Videos from MovieGen/Hunyuan Video/VGGSound:
 https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
-For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
 ## Installation
-We have only tested this on Ubuntu.
 ### Prerequisites
 We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
-- Python 3.8+
-- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/)
-- ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`)
-**Clone our repository:**
 ```bash
-git clone https://github.com/hkchengrex/MMAudio.git
 ```
-**Install with pip:**
-```bash
-cd MMAudio
-pip install -e .
 ```
-(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
-**Pretrained models:**
-The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`
-| Model    | Download link | File size |
-| -------- | ------- | ------- |
-| Flow prediction network, small 16kHz | <a href="https://databank.illinois.edu/datafiles/k6jve/download" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
-| Flow prediction network, small 44.1kHz | <a href="https://databank.illinois.edu/datafiles/864ya/download" download="mmaudio_small_44k.pth">mmaudio_small_44k.pth</a> | 601M |
-| Flow prediction network, medium 44.1kHz | <a href="https://databank.illinois.edu/datafiles/pa94t/download" download="mmaudio_medium_44k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
-| Flow prediction network, large 44.1kHz **(recommended)** | <a href="https://databank.illinois.edu/datafiles/4jx76/download" download="mmaudio_large_44k.pth">mmaudio_large_44k.pth</a> | 3.9G |
-| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
-| 16kHz BigVGAN vocoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
-| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G |
-| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
-The 44.1kHz vocoder will be downloaded automatically.
-The expected directory structure (full):
 ```bash
-MMAudio
-├── ext_weights
-│   ├── best_netG.pt
-│   ├── synchformer_state_dict.pth
-│   ├── v1-16.pth
-│   └── v1-44.pth
-├── weights
-│   ├── mmaudio_small_16k.pth
-│   ├── mmaudio_small_44k.pth
-│   ├── mmaudio_medium_44k.pth
-│   └── mmaudio_large_44k.pth
-└── ...
 ```
-The expected directory structure (minimal, for the recommended model only):
 ```bash
-MMAudio
-├── ext_weights
-│   ├── synchformer_state_dict.pth
-│   └── v1-44.pth
-├── weights
-│   └── mmaudio_large_44k.pth
-└── ...
 ```
 ## Demo
-By default, these scripts use the `large_44k` model.
-In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
 ### Command-line interface
 With `demo.py`
 ```bash
-python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
 ```
-The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
 See the file for more options.
 Simply omit the `--video` option for text-to-audio synthesis.
-The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
-### Gradio interface
 Supports video-to-audio and text-to-audio synthesis.
-```
 python gradio_demo.py
-```
-### Known limitations
-1. The model sometimes generates undesired unintelligible human speech-like sounds
-2. The model sometimes generates undesired background music
-3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
-We believe all of these three limitations can be addressed with more high-quality training data.
-## Training
-Work in progress.
-## Evaluation
-Work in progress.
 ## Acknowledgement
-Many thanks to:
-- [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model
-- [BigVGAN](https://github.com/NVIDIA/BigVGAN)
-- [Synchformer](https://github.com/v-iashin/Synchformer)

 ---
 title: DeepSound-V1
 colorFrom: blue
 colorTo: indigo
 sdk: gradio
 ---
+<!-- # DeepSound-V1
+Official code for DeepSound-V1 -->
+<div align="center">
+<p align="center">
+  <h2>DeepSound-V1</h2>
+  <!-- <a href="https://arxiv.org/abs/2412.15322">Paper</a> | <a href="https://hkchengrex.github.io/MMAudio">Webpage</a> | <a href="https://huggingface.co/hkchengrex/MMAudio/tree/main">Models</a> | <a href="https://huggingface.co/spaces/hkchengrex/MMAudio"> Huggingface Demo</a> | <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">Colab Demo</a> | <a href="https://replicate.com/zsxkib/mmaudio">Replicate Demo</a> -->
+  <a href="https://github.com/lym0302/DeepSound-V1">Paper</a> | <a href="https://github.com/lym0302/DeepSound-V1">Webpage</a> | <a href="https://github.com/lym0302/DeepSound-V1"> Huggingface Demo</a>
+</p>
+</div>
+## [DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos](https://github.com/lym0302/DeepSound-V1)
+<!-- [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/) -->
+<!-- University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation -->
+<!-- ICCV 2025 -->
 ## Highlight
+DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM).
+<!-- ## Results
 (All audio from our algorithm MMAudio)
+Videos from Sora:
 https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
+Videos from Veo 2:
+https://github.com/user-attachments/assets/8a11419e-fee2-46e0-9e67-dfb03c48d00e
+Videos from MovieGen/Hunyuan Video/VGGSound:
 https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
+For more results, visit https://hkchengrex.com/MMAudio/video_main.html. -->
 ## Installation
+```bash
+conda create -n deepsound-v1 python=3.10.16 -y
+conda activate deepsound-v1
+pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu120
+pip install flash-attn==2.5.8 --no-build-isolation
+pip install -e .
+pip install -r reqirments.txt
+```
+<!-- We have only tested this on Ubuntu.
 ### Prerequisites
 We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
+- Python 3.9+
+- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
+<!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->
+<!-- **1. Install prerequisite if not yet met:**
 ```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
 ```
+(Or any other CUDA versions that your GPUs/driver support) -->
+<!-- ```
+conda install -c conda-forge 'ffmpeg<7
 ```
+(Optional, if you use miniforge and don't already have the appropriate ffmpeg) -->
+<!-- **2. Clone our repository:**
 ```bash
+git clone https://github.com/lym0302/DeepSound-V1.git
 ```
+**3. Install with pip (install pytorch first before attempting this!):**
 ```bash
+cd DeepSound-V1
+pip install -e .
 ```
+(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip) -->
+<!-- The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
+The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
+See [MODELS.md](docs/MODELS.md) for more details. -->
 ## Demo
+### Pretrained models
+See [MODELS.md](docs/MODELS.md).
 ### Command-line interface
 With `demo.py`
 ```bash
+python demo.py -i <video_path>
 ```
+All training parameters are [here]().
+<!-- The output (audio in `.wav` format, and video in `.mp4` format) will be saved in `./output`.
 See the file for more options.
 Simply omit the `--video` option for text-to-audio synthesis.
+The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality. -->
+<!-- ### Gradio interface
 Supports video-to-audio and text-to-audio synthesis.
+You can also try experimental image-to-audio synthesis which duplicates the input image to a video for processing. This might be interesting to some but it is not something MMAudio has been trained for.
+Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can specify with `--port`.
+```bash
 python gradio_demo.py
+``` -->
+## Evaluation
+Refer [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
+See [EVAL.md](docs/EVAL.md).
+## Citation
+<!-- ```bibtex
+@inproceedings{cheng2025taming,
+  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
+  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
+  booktitle={CVPR},
+  year={2025}
+}
+``` -->
+## Relevant Repositories
+- [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
 ## Acknowledgement
+Many thanks to:
+- [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
+- [MMAudio](https://github.com/hkchengrex/MMAudio)
+- [FoleyCrafter](https://github.com/open-mmlab/FoleyCrafter)
+- [BS-RoFormer](https://github.com/ZFTurbo/Music-Source-Separation-Training)

app.py CHANGED Viewed

@@ -1,275 +1,162 @@
-import spaces
-import logging
-from datetime import datetime
-from pathlib import Path
-import gradio as gr
-import torch
-import torchaudio
 import os
-try:
-    import mmaudio
-except ImportError:
-    os.system("pip install -e .")
-    import mmaudio
-from mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video, make_video,
-                                setup_eval_logging)
-from mmaudio.model.flow_matching import FlowMatching
-from mmaudio.model.networks import MMAudio, get_my_mmaudio
-from mmaudio.model.sequence_config import SequenceConfig
-from mmaudio.model.utils.features_utils import FeaturesUtils
-import tempfile
-torch.backends.cuda.matmul.allow_tf32 = True
-torch.backends.cudnn.allow_tf32 = True
-log = logging.getLogger()
-device = 'cpu'
-dtype = torch.bfloat16
-model: ModelConfig = all_model_cfg['large_44k_v2']
-model.download_if_needed()
-output_dir = Path('./output/gradio')
 setup_eval_logging()
-def get_model() -> tuple[MMAudio, FeaturesUtils, SequenceConfig]:
-    seq_cfg = model.seq_cfg
-    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
-    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
-    log.info(f'Loaded weights from {model.model_path}')
-    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
-                                  synchformer_ckpt=model.synchformer_ckpt,
-                                  enable_conditions=True,
-                                  mode=model.mode,
-                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
-                                  need_vae_encoder=False)
-    feature_utils = feature_utils.to(device, dtype).eval()
-    return net, feature_utils, seq_cfg
-net, feature_utils, seq_cfg = get_model()
-@spaces.GPU(duration=120)
-@torch.inference_mode()
-def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
-                   cfg_strength: float, duration: float):
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-    video_info = load_video(video, duration)
-    clip_frames = video_info.clip_frames
-    sync_frames = video_info.sync_frames
-    duration = video_info.duration_sec
-    clip_frames = clip_frames.unsqueeze(0)
-    sync_frames = sync_frames.unsqueeze(0)
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-    # current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    video_save_path = tempfile.NamedTemporaryFile(delete=False, suffix='.mp4').name
-    # output_dir.mkdir(exist_ok=True, parents=True)
-    # video_save_path = output_dir / f'{current_time_string}.mp4'
-    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-    log.info(f'Saved video to {video_save_path}')
-    return video_save_path
-@spaces.GPU(duration=120)
-@torch.inference_mode()
-def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int, cfg_strength: float,
-                  duration: float):
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-    clip_frames = sync_frames = None
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-    audio_save_path = tempfile.NamedTemporaryFile(delete=False, suffix='.flac').name
-    torchaudio.save(audio_save_path, audio, seq_cfg.sampling_rate)
-    log.info(f'Saved audio to {audio_save_path}')
-    return audio_save_path
 video_to_audio_tab = gr.Interface(
     fn=video_to_audio,
     description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
-    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
     NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side).
     Doing so does not improve results.
-    The model has been trained on 8-second videos. Using much longer or shorter videos will degrade performance. Around 5s~12s should be fine.
     """,
     inputs=[
         gr.Video(),
         gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt', value='music'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
-    ],
-    outputs='playable_video',
-    cache_examples=False,
-    title='MMAudio — Video-to-Audio Synthesis',
-    examples=[
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_beach.mp4',
-            'waves, seagulls',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_serpent.mp4',
-            '',
-            'music',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_seahorse.mp4',
-            'bubbles',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_india.mp4',
-            'Indian holy music',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_galloping.mp4',
-            'galloping',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_kraken.mp4',
-            'waves, storm',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_nyc.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/mochi_storm.mp4',
-            'storm',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_spring.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_typing.mp4',
-            'typing',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_wake_up.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-    ])
-text_to_audio_tab = gr.Interface(
-    fn=text_to_audio,
-    inputs=[
-        gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
     ],
-    outputs='audio',
     cache_examples=False,
-    title='MMAudio — Text-to-Audio Synthesis',
 )
 if __name__ == "__main__":
-    gr.TabbedInterface([video_to_audio_tab, text_to_audio_tab],
-                       ['Video-to-Audio', 'Text-to-Audio']).launch(allowed_paths=[output_dir])

 import os
+import sys
+import time
+import gradio as gr
+import subprocess
+from pathlib import Path
+import requests
+from moviepy.editor import AudioFileClip, VideoFileClip
+project_root = os.path.dirname(os.path.abspath(__file__))
+mmaudio_path = os.path.join(project_root, 'third_party', 'MMAudio')
+sys.path.append(mmaudio_path)
+from pipeline.pipeline import Pipeline
+from third_party.MMAudio.mmaudio.eval_utils import setup_eval_logging
+# # download model
+# os.makedirs("pretrained/mllm", exist_ok=True)
+# from huggingface_hub import snapshot_download
+# repo_local_path = snapshot_download(repo_id="lym0302/VideoLLaMA2.1-7B-AV-CoT", cache_dir='pretrained/mllm')
+# remove_vo_model_dir = "pretrained/remove_vo/checkpoints"
+# os.makedirs(remove_vo_model_dir, exist_ok=True)
+# urls = ["https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/model_bs_roformer_ep_317_sdr_12.9755.ckpt",
+#         "https://raw.githubusercontent.com/ZFTurbo/Music-Source-Separation-Training/main/configs/viperx/model_bs_roformer_ep_317_sdr_12.9755.yaml"]
+# for url in urls:
+#     file_name = url.split("/")[-1]  # Extract file name from URL
+#     file_path = os.path.join(remove_vo_model_dir, file_name)
+#     response = requests.get(url, stream=True)
+#     if response.status_code == 200:
+#         with open(file_path, "wb") as f:
+#             for chunk in response.iter_content(chunk_size=8192):  # Use a chunk size of 8 KB
+#                 f.write(chunk)
+#         print(f"File downloaded successfully and saved to {file_path}")
+#     else:
+#         print(f"Failed to download the file. Status code: {response.status_code}")
+# os.makedirs("pretrained/v2a/mmaudio", exist_ok=True)
 setup_eval_logging()
+pipeline = Pipeline(
+    step0_model_dir='pretrained/mllm/models--lym0302--VideoLLaMA2.1-7B-AV-CoT',
+    step1_mode='mmaudio_medium_44k',
+    step2_model_dir='pretrained/mllm/models--lym0302--VideoLLaMA2.1-7B-AV-CoT',
+    step2_mode='cot',
+    step3_mode='bs_roformer',
+)
+output_dir = "output_gradio"
+os.makedirs(output_dir, exist_ok=True)
+skip_final_video = False
+def video_to_audio(
+        video_input: gr.Video,
+        prompt: str='',
+        negative_prompt: str='',
+        mode: str='s4',
+        postp_mode: str='neg',
+        duration: float=10,
+        seed: int=42,):
+    log_messages = []  # 用于存储日志
+    def log_info(msg):
+        log_messages.append(msg)
+        return "\n".join(log_messages)  # 每次返回完整的日志历史
+    if not video_input:
+        yield None, log_info("Error: No video input provided.")
+        return
+    yield None, log_info("Generate high-quality audio from video step-by-step...")  # 初始化日志
+    st_infer = time.time()
+    video_input = str(video_input)
+    for step_results in pipeline.run_for_gradio(
+        video_input=video_input,
+        output_dir=output_dir,
+        mode=mode,
+        postp_mode=postp_mode,
+        prompt=prompt,
+        negative_prompt=negative_prompt,
+        duration=duration,
+        seed=seed
+    ):
+        if step_results['log'] == 'Finish step-by-step v2a.':
+            break
+        else:
+            yield None, log_info(step_results['log'])
+    temp_final_audio_path = step_results["temp_final_audio_path"]
+    temp_final_video_path = step_results["temp_final_video_path"]
+    video_name_stem = Path(video_input).stem
+    final_audio_path = str(Path(output_dir) / f'{video_name_stem}.wav')
+    final_video_path = str(Path(output_dir) / f'{video_name_stem}.mp4')
+    if temp_final_audio_path is not None:
+        subprocess.run(['cp', str(temp_final_audio_path), final_audio_path], check=True)
+        step_results["final_audio_path"] = final_audio_path
+        if skip_final_video:
+            step_results["final_video_path"] = None
+        else:
+            if temp_final_video_path is not None:
+                subprocess.run(['cp', str(temp_final_video_path), final_video_path], check=True)
+            else:
+                audio = AudioFileClip(final_audio_path)
+                video = VideoFileClip(video_input)
+                duration = min(audio.duration, video.duration)
+                audio = audio.subclip(0, duration)
+                video.audio = audio
+                video = video.subclip(0, duration)
+                video.write_videofile(final_video_path)
+            step_results["final_video_path"] = final_video_path
+    et_infer = time.time()
+    print(f"Inference time: {et_infer - st_infer:.2f} s.")
+    print("step_results: ", step_results)
+    yield (final_video_path if os.path.exists(final_video_path) else None), log_info(step_results['log'])
 video_to_audio_tab = gr.Interface(
     fn=video_to_audio,
+    # Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
     description="""
+    Code: <a href="https://github.com/lym0302/DeepSound-V1">https://github.com/lym0302/DeepSound-V1</a><br>
     NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side).
     Doing so does not improve results.
+    This is a step-by-step v2a process and may take a long time.
+    If Post Processing is set to 'rm', the generated video may be None.
     """,
     inputs=[
         gr.Video(),
         gr.Text(label='Prompt'),
+        gr.Text(label='Negative prompt', value=''),
+        gr.Radio(["s3", "s4"], label="Mode", value="s4"),
+        gr.Radio(["rm", "rep", "neg"], label="Post Processing", value="neg"),
+        gr.Number(label='Duration (sec)', value=10, minimum=1),
+        gr.Number(label='Seed (42: random)', value=42, precision=0, minimum=-1),
     ],
+    outputs=[gr.Video(label="Generated Video"), gr.Text(label="Logs"),],
     cache_examples=False,
+    title='DeepSound-V1 — Video-to-Audio Synthesis',
 )
 if __name__ == "__main__":
+    gr.TabbedInterface([video_to_audio_tab],
+                       ['Video-to-Audio']).launch(allowed_paths=[output_dir])
+# if __name__ == "__main__":
+#     port = 8000
+#     gr.TabbedInterface([video_to_audio_tab, ],
+#                        ['Video-to-Audio', ]).launch(
+#                            server_port=port, allowed_paths=[output_dir])

demo.py DELETED Viewed

@@ -1,135 +0,0 @@
-import logging
-from argparse import ArgumentParser
-from pathlib import Path
-import torch
-import torchaudio
-from mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video, make_video,
-                                setup_eval_logging)
-from mmaudio.model.flow_matching import FlowMatching
-from mmaudio.model.networks import MMAudio, get_my_mmaudio
-from mmaudio.model.utils.features_utils import FeaturesUtils
-torch.backends.cuda.matmul.allow_tf32 = True
-torch.backends.cudnn.allow_tf32 = True
-log = logging.getLogger()
-@torch.inference_mode()
-def main():
-    setup_eval_logging()
-    parser = ArgumentParser()
-    parser.add_argument('--variant',
-                        type=str,
-                        default='large_44k_v2',
-                        help='small_16k, small_44k, medium_44k, large_44k, large_44k_v2')
-    parser.add_argument('--video', type=Path, help='Path to the video file')
-    parser.add_argument('--prompt', type=str, help='Input prompt', default='')
-    parser.add_argument('--negative_prompt', type=str, help='Negative prompt', default='')
-    parser.add_argument('--duration', type=float, default=8.0)
-    parser.add_argument('--cfg_strength', type=float, default=4.5)
-    parser.add_argument('--num_steps', type=int, default=25)
-    parser.add_argument('--mask_away_clip', action='store_true')
-    parser.add_argument('--output', type=Path, help='Output directory', default='./output')
-    parser.add_argument('--seed', type=int, help='Random seed', default=42)
-    parser.add_argument('--skip_video_composite', action='store_true')
-    parser.add_argument('--full_precision', action='store_true')
-    args = parser.parse_args()
-    if args.variant not in all_model_cfg:
-        raise ValueError(f'Unknown model variant: {args.variant}')
-    model: ModelConfig = all_model_cfg[args.variant]
-    model.download_if_needed()
-    seq_cfg = model.seq_cfg
-    if args.video:
-        video_path: Path = Path(args.video).expanduser()
-    else:
-        video_path = None
-    prompt: str = args.prompt
-    negative_prompt: str = args.negative_prompt
-    output_dir: str = args.output.expanduser()
-    seed: int = args.seed
-    num_steps: int = args.num_steps
-    duration: float = args.duration
-    cfg_strength: float = args.cfg_strength
-    skip_video_composite: bool = args.skip_video_composite
-    mask_away_clip: bool = args.mask_away_clip
-    device = 'cuda'
-    dtype = torch.float32 if args.full_precision else torch.bfloat16
-    output_dir.mkdir(parents=True, exist_ok=True)
-    # load a pretrained model
-    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
-    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
-    log.info(f'Loaded weights from {model.model_path}')
-    # misc setup
-    rng = torch.Generator(device=device)
-    rng.manual_seed(seed)
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
-                                  synchformer_ckpt=model.synchformer_ckpt,
-                                  enable_conditions=True,
-                                  mode=model.mode,
-                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
-                                  need_vae_encoder=False)
-    feature_utils = feature_utils.to(device, dtype).eval()
-    if video_path is not None:
-        log.info(f'Using video {video_path}')
-        video_info = load_video(video_path, duration)
-        clip_frames = video_info.clip_frames
-        sync_frames = video_info.sync_frames
-        duration = video_info.duration_sec
-        if mask_away_clip:
-            clip_frames = None
-        else:
-            clip_frames = clip_frames.unsqueeze(0)
-        sync_frames = sync_frames.unsqueeze(0)
-    else:
-        log.info('No video provided -- text-to-audio mode')
-        clip_frames = sync_frames = None
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    log.info(f'Prompt: {prompt}')
-    log.info(f'Negative prompt: {negative_prompt}')
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-    if video_path is not None:
-        save_path = output_dir / f'{video_path.stem}.flac'
-    else:
-        safe_filename = prompt.replace(' ', '_').replace('/', '_').replace('.', '')
-        save_path = output_dir / f'{safe_filename}.flac'
-    torchaudio.save(save_path, audio, seq_cfg.sampling_rate)
-    log.info(f'Audio saved to {save_path}')
-    if video_path is not None and not skip_video_composite:
-        video_save_path = output_dir / f'{video_path.stem}.mp4'
-        make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-        log.info(f'Video saved to {output_dir / video_save_path}')
-    log.info('Memory usage: %.2f GB', torch.cuda.max_memory_allocated() / (2**30))
-if __name__ == '__main__':
-    main()

docs/images/icon.png DELETED Viewed

Binary file (163 Bytes)

docs/index.html DELETED Viewed

@@ -1,147 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-    <link rel="preconnect" href="https://fonts.googleapis.com">
-    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
-    <link href="https://fonts.googleapis.com/css2?family=Source+Sans+3&display=swap" rel="stylesheet">
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-    <link rel="icon" type="image/png" href="images/icon.png">
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
-    <link rel="stylesheet" href="style.css">
-</head>
-<body>
-    <body>
-        <br><br><br><br>
-        <div class="container">
-            <div class="row text-center" style="font-size:38px">
-                <div class="col strong">
-                    Taming Multimodal Joint Training for High-Quality <br>Video-to-Audio Synthesis
-                </div>
-            </div>
-            <br>
-            <div class="row text-center" style="font-size:28px">
-                <div class="col">
-                    arXiv 2024
-                </div>
-            </div>
-            <br>
-            <div class="h-100 row text-center heavy justify-content-md-center" style="font-size:22px;">
-                <div class="col-sm-auto px-lg-2">
-                    <a href="https://hkchengrex.github.io/">Ho Kei Cheng<sup>1</sup></a>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ">Masato Ishii<sup>2</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://scholar.google.com/citations?user=sXAjHFIAAAAJ">Akio Hayakawa<sup>2</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://scholar.google.com/citations?user=XCRO260AAAAJ">Takashi Shibuya<sup>2</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://www.alexander-schwing.de/">Alexander Schwing<sup>1</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2" >
-                    <nobr><a href="https://www.yukimitsufuji.com/">Yuki Mitsufuji<sup>2,3</sup></a></nobr>
-                </div>
-            </div>
-            <div class="h-100 row text-center heavy justify-content-md-center" style="font-size:22px;">
-                <div class="col-sm-auto px-lg-2">
-                    <sup>1</sup>University of Illinois Urbana-Champaign
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <sup>2</sup>Sony AI
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <sup>3</sup>Sony Group Corporation
-                </div>
-            </div>
-            <br>
-            <br>
-            <div class="h-100 row text-center justify-content-md-center" style="font-size:20px;">
-                <!-- <div class="col-sm-2">
-                    <a href="https://arxiv.org/abs/2310.12982">[arXiv]</a>
-                </div> -->
-                <div class="col-sm-3">
-                    <a href="">[Paper (being prepared)]</a>
-                </div>
-                <div class="col-sm-3">
-                    <a href="https://github.com/hkchengrex/MMAudio">[Code]</a>
-                </div>
-                <!-- <div class="col-sm-2">
-                    <a
-                        href="https://colab.research.google.com/drive/1yo43XTbjxuWA7XgCUO9qxAi7wBI6HzvP?usp=sharing">[Colab]</a>
-                </div> -->
-            </div>
-            <br>
-            <hr>
-            <div class="row" style="font-size:32px">
-                <div class="col strong">
-                    TL;DR
-                </div>
-            </div>
-            <br>
-            <div class="row">
-                <div class="col">
-                    <p class="light" style="text-align: left;">
-                        MMAudio generates synchronized audio given video and/or text inputs.
-                    </p>
-                </div>
-            </div>
-            <br>
-            <hr>
-            <br>
-            <div class="row" style="font-size:32px">
-                <div class="col strong">
-                    Demo
-                </div>
-            </div>
-            <br>
-            <div class="row" style="font-size:48px">
-                <div class="col strong text-center">
-                    <a href="video_main.html" style="text-decoration: underline;">&lt;More results&gt;</a>
-                </div>
-            </div>
-            <br>
-            <div class="video-container" style="text-align: center;">
-                <iframe src="https://youtube.com/embed/YElewUT2M4M"></iframe>
-                </div>
-            <br>
-            <br><br>
-            <br><br>
-        </div>
-</body>
-</html>

docs/style.css DELETED Viewed

@@ -1,78 +0,0 @@
-body {
-    font-family: 'Source Sans 3', sans-serif;
-    font-size: 18px;
-    margin-left: auto;
-    margin-right: auto;
-    font-weight: 400;
-    height: 100%;
-    max-width: 1000px;
-}
-table {
-    width: 100%;
-    border-collapse: collapse;
-}
-th, td {
-    border: 1px solid #ddd;
-    padding: 8px;
-    text-align: center;
-}
-th {
-    background-color: #f2f2f2;
-}
-video {
-    width: 100%;
-    height: auto;
-}
-p {
-    font-size: 28px;
-}
-h2 {
-    font-size: 36px;
-}
-.strong {
-    font-weight: 700;
-}
-.light {
-    font-weight: 100;
-}
-.heavy {
-    font-weight: 900;
-}
-.column {
-    float: left;
-}
-a:link,
-a:visited {
-    color: #05538f;
-    text-decoration: none;
-}
-a:hover {
-    color: #63cbdd;
-}
-hr {
-    border: 0;
-    height: 1px;
-    background-image: linear-gradient(to right, rgba(0, 0, 0, 0), rgba(0, 0, 0, 0.75), rgba(0, 0, 0, 0));
-}
-.video-container {
-    position: relative;
-    padding-bottom: 56.25%; /* 16:9 */
-    height: 0;
-  }
-.video-container iframe {
-    position: absolute;
-    top: 0;
-    left: 0;
-    width: 100%;
-    height: 100%;
-}

docs/style_videos.css DELETED Viewed

@@ -1,52 +0,0 @@
-body {
-    font-family: 'Source Sans 3', sans-serif;
-    font-size: 1.5vh;
-    font-weight: 400;
-}
-table {
-    width: 100%;
-    border-collapse: collapse;
-}
-th, td {
-    border: 1px solid #ddd;
-    padding: 8px;
-    text-align: center;
-}
-th {
-    background-color: #f2f2f2;
-}
-video {
-    width: 100%;
-    height: auto;
-}
-p {
-    font-size: 1.5vh;
-    font-weight: bold;
-}
-h2 {
-    font-size: 2vh;
-    font-weight: bold;
-}
-.video-container {
-    position: relative;
-    padding-bottom: 56.25%; /* 16:9 */
-    height: 0;
-  }
-.video-container iframe {
-    position: absolute;
-    top: 0;
-    left: 0;
-    width: 100%;
-    height: 100%;
-}
-.video-header {
-    background-color: #f2f2f2;
-    text-align: center;
-    font-size: 1.5vh;
-    font-weight: bold;
-    padding: 8px;
-}

docs/video_gen.html DELETED Viewed

@@ -1,254 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-    <link rel="icon" type="image/png" href="images/icon.png">
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
-    <link rel="stylesheet" href="style_videos.css">
-</head>
-<body>
-    <div id="moviegen_all">
-    <h2 id="moviegen" style="text-align: center;">Comparisons with Movie Gen Audio on Videos Generated by MovieGen</h2>
-    <p id="moviegen1" style="overflow: hidden;">
-        Example 1: Ice cracking with sharp snapping sound, and metal tool scraping against the ice surface.
-        <span style="float: right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/d7Lb0ihtGcE"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/F4JoJ2r2m8U"></iframe>
-                </div>
-        </div>
-    </div>
-    <br>
-    <!-- <p id="moviegen2">Example 2: Rhythmic splashing and lapping of water. <span style="float:right;"><a href="#index">Back to index</a></span> </p>
-    <table>
-        <thead>
-            <tr>
-                <th>Movie Gen Audio</th>
-                <th>Ours</th>
-            </tr>
-        </thead>
-        <tbody>
-            <tr>
-                <td width="50%">
-                    <div class="video-container">
-                    <iframe src="https://youtube.com/embed/5gQNPK99CIk"></iframe>
-                    </div>
-                </td>
-                <td width="50%">
-                    <div class="video-container">
-                    <iframe src="https://youtube.com/embed/AbwnTzG-BpA"></iframe>
-                    </div>
-                </td>
-            </tr>
-        </tbody>
-    </table> -->
-    <p id="moviegen2" style="overflow: hidden;">
-        Example 2: Rhythmic splashing and lapping of water.
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/5gQNPK99CIk"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/AbwnTzG-BpA"></iframe>
-                </div>
-        </div>
-    </div>
-    <br>
-    <p id="moviegen3" style="overflow: hidden;">
-        Example 3: Shovel scrapes against dry earth.
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/PUKGyEve7XQ"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/CNn7i8VNkdc"></iframe>
-            </div>
-        </div>
-    </div>
-    <br>
-    <p id="moviegen4" style="overflow: hidden;">
-        (Failure case) Example 4: Creamy sound of mashed potatoes being scooped.
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/PJv1zxR9JjQ"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/c3-LJ1lNsPQ"></iframe>
-            </div>
-        </div>
-    </div>
-    <br>
-    </div>
-    <div id="hunyuan_sora_all">
-    <h2 id="hunyuan" style="text-align: center;">Results on Videos Generated by Hunyuan</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Typing</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/8ln_9hhH_nk"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Water is rushing down a stream and pouring</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/5df1FZFQj30"></iframe>
-            </div>
-        </div>
-    </div>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Waves on beach</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/7wQ9D5WgpFc"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Water droplet</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/q7M2nsalGjM"></iframe>
-            </div>
-        </div>
-    </div>
-    <br>
-    <h2 id="sora" style="text-align: center;">Results on Videos Generated by Sora</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Ships riding waves</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/JbgQzHHytk8"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Train (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/xOW7zrjpWC8"></iframe>
-            </div>
-        </div>
-    </div>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Seashore (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/fIuw5Y8ZZ9E"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Surfing (failure: unprompted music)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/UcSTk-v0M_s"></iframe>
-            </div>
-        </div>
-    </div>
-    <br>
-    <div id="mochi_ltx_all">
-    <h2 id="mochi" style="text-align: center;">Results on Videos Generated by Mochi 1</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Magical fire and lightning (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/tTlRZaSMNwY"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Storm (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/4hrZTMJUy3w"></iframe>
-            </div>
-        </div>
-    </div>
-    <br>
-    <h2 id="ltx" style="text-align: center;">Results on Videos Generated by LTX-Video</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Firewood burning and cracking</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/P7_DDpgev0g"></iframe>
-            </div>
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Waterfall, water splashing</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/4MvjceYnIO0"></iframe>
-            </div>
-        </div>
-    </div>
-    <br>
-    </div>
-</body>
-</html>

docs/video_main.html DELETED Viewed

@@ -1,98 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-    <link rel="icon" type="image/png" href="images/icon.png">
-    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
-    <link rel="stylesheet" href="style_videos.css">
-    <script type="text/javascript">
-        $(document).ready(function(){
-            $("#content").load("video_gen.html #moviegen_all");
-            $("#load_moveigen").click(function(){
-                $("#content").load("video_gen.html #moviegen_all");
-            });
-            $("#load_hunyuan_sora").click(function(){
-                $("#content").load("video_gen.html #hunyuan_sora_all");
-            });
-            $("#load_mochi_ltx").click(function(){
-                $("#content").load("video_gen.html #mochi_ltx_all");
-            });
-            $("#load_vgg1").click(function(){
-                $("#content").load("video_vgg.html #vgg1");
-            });
-            $("#load_vgg2").click(function(){
-                $("#content").load("video_vgg.html #vgg2");
-            });
-            $("#load_vgg3").click(function(){
-                $("#content").load("video_vgg.html #vgg3");
-            });
-            $("#load_vgg4").click(function(){
-                $("#content").load("video_vgg.html #vgg4");
-            });
-            $("#load_vgg5").click(function(){
-                $("#content").load("video_vgg.html #vgg5");
-            });
-            $("#load_vgg6").click(function(){
-                $("#content").load("video_vgg.html #vgg6");
-            });
-            $("#load_vgg_extra").click(function(){
-                $("#content").load("video_vgg.html #vgg_extra");
-            });
-        });
-    </script>
-</head>
-<body>
-    <h1 id="index" style="text-align: center;">Index</h1>
-    <p><b>(Click on the links to load the corresponding videos)</b> <span style="float:right;"><a href="index.html">Back to project page</a></span></p>
-    <ol>
-        <li>
-            <a href="#" id="load_moveigen">Comparisons with Movie Gen Audio on Videos Generated by MovieGen</a>
-        </li>
-        <li>
-            <a href="#" id="load_hunyuan_sora">Results on Videos Generated by Hunyuan and Sora</a>
-        </li>
-        <li>
-            <a href="#" id="load_mochi_ltx">Results on Videos Generated by Mochi 1 and LTX-Video</a>
-        </li>
-        <li>
-            On VGGSound
-            <ol>
-                <li><a id='load_vgg1' href="#">Example 1: Wolf howling</a></li>
-                <li><a id='load_vgg2' href="#">Example 2: Striking a golf ball</a></li>
-                <li><a id='load_vgg3' href="#">Example 3: Hitting a drum</a></li>
-                <li><a id='load_vgg4' href="#">Example 4: Dog barking</a></li>
-                <li><a id='load_vgg5' href="#">Example 5: Playing a string instrument</a></li>
-                <li><a id='load_vgg6' href="#">Example 6: A group of people playing tambourines</a></li>
-                <li><a id='load_vgg_extra' href="#">Extra results & failure cases</a></li>
-            </ol>
-        </li>
-    </ol>
-    <div id="content" class="container-fluid">
-    </div>
-    <br>
-    <br>
-</body>
-</html>

docs/video_vgg.html DELETED Viewed

@@ -1,452 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
-    <link rel="stylesheet" href="style_videos.css">
-</head>
-<body>
-    <div id="vgg1">
-    <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-    <p style="overflow: hidden;">
-        Example 1: Wolf howling.
-        <span style="float:right;"><a href="#index">Back to index</a></span>
-    </p>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9J_V74gqMUA"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/P6O8IpjErPc"></iframe>
-                    </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/w-5eyqepvTk"></iframe>
-                    </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/VOLfoZlRkzo"></iframe>
-                    </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/49owKyA5Pa8"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/QVtrFgbeGDM"></iframe>
-                    </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/8r0uEfSNjvI"></iframe>
-                    </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/bn-sLg2qulk"></iframe>
-                    </div>
-            </div>
-        </div>
-    </div>
-    <div id="vgg2">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 2: Striking a golf ball.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/1hwSu42kkho"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/kZibDoDCNxI"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/jgKfLBLhh7Y"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/Lfsx8mOPcJo"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/tz-LpbB0MBc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/RTDUHMi08n4"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/N-3TDOsPnZQ"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/QnsHnLn4gB0"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-    <div id="vgg3">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 3: Hitting a drum.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/0oeIwq77w0Q"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/-UtPV9ohuIM"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9yivkgN-zwc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/kkCsXPOlBvY"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/MbNKsVsuvig"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/2yYviBjrpBw"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9yivkgN-zwc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/6dnyQt4Fuhs"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-    </div>
-    <div id="vgg4">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 4: Dog barking.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/ckaqvTyMYAw"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/_aRndFZzZ-I"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/mNCISP3LBl0"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/phZBQ3L7foE"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/Sb5Mg1-ORao"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/eHmAGOmtDDg"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/NEGa3krBrm0"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/aO0EAXlwE7A"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-    <div id="vgg5">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 5: Playing a string instrument.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/KP1QhWauIOc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/ovaJhWSquYE"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/N723FS9lcy8"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/t0N4ZAAXo58"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/8YSRs03QNNA"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/vOpMz55J1kY"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9JHC75vr9h0"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9w0JckNzXmY"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-    <div id="vgg6">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 6: A group of people playing tambourines.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/mx6JLxzUkRc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/oLirHhP9Su8"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/HkLkHMqptv0"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/rpHiiODjmNU"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/1mVD3fJ0LpM"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/yjVFnJiEJlw"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/neVeMSWtRkU"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/EUE7YwyVWz8"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-    <div id="vgg_extra">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-        <div class="row g-1">
-            <div class="col-sm-3">
-            <div class="video-header">Moving train</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/Ta6H45rBzJc"></iframe>
-            </div>
-            </div>
-            <div class="col-sm-3">
-            <div class="video-header">Water splashing</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/hl6AtgHXpb4"></iframe>
-            </div>
-            </div>
-            <div class="col-sm-3">
-            <div class="video-header">Skateboarding</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/n4sCNi_9buI"></iframe>
-            </div>
-            </div>
-            <div class="col-sm-3">
-            <div class="video-header">Synchronized clapping</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/oxexfpLn7FE"></iframe>
-            </div>
-            </div>
-        </div>
-        <br><br>
-        <div id="extra-failure">
-            <h2 style="text-align: center;">Failure cases</h2>
-            <p style="overflow: hidden;">
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-            </p>
-            <div class="row g-1">
-            <div class="col-sm-6">
-                <div class="video-header">Human speech</div>
-                <div class="video-container">
-                <iframe src="https://youtube.com/embed/nx0CyrDu70Y"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-6">
-                <div class="video-header">Unfamiliar vision input</div>
-                <div class="video-container">
-                <iframe src="https://youtube.com/embed/hfnAqmK3X7w"></iframe>
-                </div>
-            </div>
-            </div>
-        </div>
-        </div>
-</body>
-</html>

{mmaudio → pipeline}/__init__.py RENAMED Viewed

File without changes

pipeline/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (178 Bytes). View file

pipeline/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (166 Bytes). View file

pipeline/__pycache__/pipeline.cpython-310.pyc ADDED Viewed

Binary file (4.62 kB). View file

pipeline/__pycache__/pipeline.cpython-38.pyc ADDED Viewed

Binary file (2.66 kB). View file

pipeline/__pycache__/step0.cpython-310.pyc ADDED Viewed

Binary file (1.48 kB). View file

pipeline/__pycache__/step0.cpython-38.pyc ADDED Viewed

Binary file (1.35 kB). View file

pipeline/__pycache__/step1.cpython-310.pyc ADDED Viewed

Binary file (1.39 kB). View file

pipeline/__pycache__/step1.cpython-38.pyc ADDED Viewed

Binary file (1.3 kB). View file

pipeline/__pycache__/step2.cpython-310.pyc ADDED Viewed

Binary file (1.71 kB). View file

pipeline/__pycache__/step2.cpython-38.pyc ADDED Viewed

Binary file (1.61 kB). View file

pipeline/__pycache__/step3.cpython-310.pyc ADDED Viewed

Binary file (3.62 kB). View file

pipeline/__pycache__/step3.cpython-38.pyc ADDED Viewed

Binary file (3.42 kB). View file

pipeline/__pycache__/step4.cpython-310.pyc ADDED Viewed

Binary file (1.16 kB). View file

pipeline/__pycache__/step4.cpython-38.pyc ADDED Viewed

Binary file (1.08 kB). View file

pipeline/pipeline.py ADDED Viewed

	@@ -0,0 +1,175 @@

+# coding=utf-8
+from .step0 import Step0
+from .step1 import Step1
+from .step2 import Step2
+from .step3 import Step3
+from .step4 import Step4
+import logging
+import re
+import os
+class Pipeline:
+    def __init__(self, step0_model_dir, step1_mode, step2_model_dir, step2_mode, step3_mode):
+        self.step0 = Step0(step0_model_dir)
+        self.step1 = Step1(step1_mode)
+        self.step2 = Step2(step2_model_dir, step2_mode)
+        self.step3 = Step3(model_type=step3_mode)
+        self.step4 = Step4()
+        self.step_processors = [self.step1, self.step2, self.step3, self.step4]
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+    def run(self, video_input, output_dir, mode='s4', postp_mode='rep', prompt='', negative_prompt='', duration=10, seed=42):
+        step0_resp = self.step0.run(video_input)
+        step0_resp_list = re.findall(r'(Step\d:.*?)(?=Step\d:|$)', step0_resp, re.DOTALL)
+        step_infos = [step_info.strip().split("\n")[0] for step_info in step0_resp_list]
+        step3_temp_dir = os.path.join(output_dir, "remove_vo")
+        step_results = {"temp_final_audio_path": None, "temp_final_video_path": None}
+        for step_info in step_infos:
+            self.log.info(f"Start to {step_info}")
+            if step_info == 'Step1: Generate audio from video.':
+                step1_audio_path, step1_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt, duration=duration, seed=seed)
+                step_results["step1_audio_path"] = step1_audio_path
+                step_results["step1_video_path"] = step1_video_path
+            elif step_info == 'Step2: Given a video and its generated audio, determine whether the audio contains voice-over.':
+                is_vo = self.step2.run(str(step_results["step1_video_path"]))
+                step_results["is_vo"] = is_vo
+                if not step_results["is_vo"]: # not voice-over
+                    step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                    step_results["temp_final_video_path"] = step_results["step1_video_path"]
+                    return step_results
+            elif step_info == 'Step3: Remove voice-over from audio.':
+                step3_audio_path = self.step3.run(input_audio_path=step_results["step1_audio_path"],
+                                temp_store_dir=step3_temp_dir,
+                                output_dir=output_dir)
+                step_results["step3_audio_path"] = step3_audio_path
+                if mode == 's3':
+                    step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+                    return step_results
+            elif step_info == 'Step4: Determine whether the audio is silent.':
+                is_silent = self.step4.run(step_results["step3_audio_path"])
+                step_results["is_silent"] = is_silent
+            else:
+                self.log.error(f"Step-by-Step Error !!!!!!!!!")
+                return step_results
+        if not step_results["is_silent"]:  #  not silent
+            step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+        else:
+            self.log.info(f"Start to post process, use mode: {postp_mode}")
+            if postp_mode == "rm":
+                step_results["temp_final_audio_path"] = None
+            elif postp_mode == "rep":
+                step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                step_results["temp_final_video_path"] = step_results["step1_video_path"]
+            elif postp_mode == "neg":
+                neg_audio_path, neg_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt='human voice', duration=duration, seed=seed, is_postp=True)
+                step_results["temp_final_audio_path"] = neg_audio_path
+                step_results["temp_final_video_path"] = neg_video_path
+            else:
+                self.log.error(f"Error postp_mode: {postp_mode}")
+            self.log.info(f"After post-processing, audio is {step_results['temp_final_audio_path']} and video is {step_results['temp_final_video_path']}")
+            self.log.info(f"Finish Post-Process successfully.\n")
+        return step_results
+    def run_for_gradio(self, video_input, output_dir, mode='s4', postp_mode='rep', prompt='', negative_prompt='', duration=10, seed=42):
+        step_results = {"temp_final_audio_path": None,
+                        "temp_final_video_path": None,
+                        'log': ''}
+        step0_resp = self.step0.run(video_input)
+        step0_resp_list = re.findall(r'(Step\d:.*?)(?=Step\d:|$)', step0_resp, re.DOTALL)
+        step_infos = [step_info.strip().split("\n")[0] for step_info in step0_resp_list]
+        step3_temp_dir = os.path.join(output_dir, "remove_vo")
+        for step_info in step_infos:
+            self.log.info(f"Start to {step_info}")
+            step_results['log'] = f"Start to {step_info}"
+            yield step_results
+            if step_info == 'Step1: Generate audio from video.':
+                step1_audio_path, step1_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt, duration=duration, seed=seed)
+                step_results["step1_audio_path"] = step1_audio_path
+                step_results["step1_video_path"] = step1_video_path
+                step_results['log'] = "Step1 completed."
+                yield step_results
+            elif step_info == 'Step2: Given a video and its generated audio, determine whether the audio contains voice-over.':
+                is_vo = self.step2.run(str(step_results["step1_video_path"]))
+                step_results["is_vo"] = is_vo
+                step_results['log'] = f"Step2 completed. Contain voice-over? {'Yes' if is_vo else 'No'}"
+                yield step_results
+                if not step_results["is_vo"]: # not voice-over
+                    step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                    step_results["temp_final_video_path"] = step_results["step1_video_path"]
+                    step_results['log'] = "Finish step-by-step v2a."
+                    yield step_results
+            elif step_info == 'Step3: Remove voice-over from audio.':
+                step3_audio_path = self.step3.run(input_audio_path=step_results["step1_audio_path"],
+                                temp_store_dir=step3_temp_dir,
+                                output_dir=output_dir)
+                step_results["step3_audio_path"] = step3_audio_path
+                step_results['log'] = f"Step3 completed."
+                yield step_results
+                if mode == 's3':
+                    step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+                    step_results['log'] = "Finish step-by-step v2a."
+                    yield step_results
+            elif step_info == 'Step4: Determine whether the audio is silent.':
+                is_silent = self.step4.run(step_results["step3_audio_path"])
+                step_results["is_silent"] = is_silent
+                step_results['log'] = f"Step4 completed. Silent? {'Yes' if is_silent else 'No'}"
+                yield step_results
+            else:
+                self.log.error(f"Step-by-Step Error !!!!!!!!!")
+                step_results['log'] = f"Step-by-Step Error !!!!!!!!!"
+                yield step_results
+                step_results['log'] = "Finish step-by-step v2a."
+                yield step_results
+        if not step_results["is_silent"]:  #  not silent
+            step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+            step_results['log'] = "Finish step-by-step v2a."
+            yield step_results
+        else:
+            step_results['log'] = f"Post-processing with mode: {postp_mode}"
+            yield step_results
+            self.log.info(f"Start to post process, use mode: {postp_mode}")
+            if postp_mode == "rm":
+                step_results["temp_final_audio_path"] = None
+            elif postp_mode == "rep":
+                step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                step_results["temp_final_video_path"] = step_results["step1_video_path"]
+            elif postp_mode == "neg":
+                neg_audio_path, neg_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt='human voice', duration=duration, seed=seed, is_postp=True)
+                step_results["temp_final_audio_path"] = neg_audio_path
+                step_results["temp_final_video_path"] = neg_video_path
+            else:
+                self.log.error(f"Error postp_mode: {postp_mode}")
+            self.log.info(f"After post-processing, audio is {step_results['temp_final_audio_path']} and video is {step_results['temp_final_video_path']}")
+            self.log.info(f"Finish Post-Process successfully.\n")
+            step_results['log'] = f"Post-processing completed."
+            yield step_results
+        step_results['log'] = "Finish step-by-step v2a."
+        yield step_results

pipeline/step0.py ADDED Viewed

	@@ -0,0 +1,39 @@

+# coding=utf-8
+# CoT generate step-by-step
+from third_party.VideoLLaMA2.videollama2 import model_init, mm_infer
+import logging
+class Step0:
+    def __init__(self, model_path, modal_type='v'):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        self.model, self.processor, self.tokenizer = model_init(model_path)
+        self.modal_type=modal_type
+        if modal_type == "a":
+            self.model.model.vision_tower = None
+        elif modal_type == "v":
+            self.model.model.audio_tower = None
+        elif modal_type == "av":
+            pass
+        else:
+            raise NotImplementedError
+        self.modal = 'audio' if modal_type == "a" else "video"
+        self.question = f"Generate high-quality audio from video step-by-step."
+        self.preprocess = self.processor[self.modal]
+    def run(self, video_path):
+        self.log.info("######################################################################################################")
+        self.log.info("Generate high-quality audio from video step-by-step...")
+        audio_video_tensor = self.preprocess(video_path, va=False)
+        output = mm_infer(
+            audio_video_tensor,
+            self.question,
+            model=self.model,
+            tokenizer=self.tokenizer,
+            modal=self.modal,
+            do_sample=False,
+        )
+        return output

pipeline/step1.py ADDED Viewed

	@@ -0,0 +1,36 @@

+# coding=utf-8
+# V2A
+import logging
+class Step1:
+    def __init__(self, step1_mode):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        if step1_mode.startswith('mmaudio'):
+            from v2a_models.v2a_mmaudio import V2A_MMAudio
+            variant = step1_mode.replace("mmaudio_", "")
+            self.v2a_model = V2A_MMAudio(variant)
+        elif step1_mode == "foleycrafter":
+            from v2a_models.v2a_foleycrafter import V2A_FoleyCrafter
+            self.v2a_model = V2A_FoleyCrafter()
+        else:
+            self.log.error(f"Error step1_mode: {step1_mode}")
+    def run(self, video_path, output_dir, prompt='', negative_prompt='', duration=10, seed=42, is_postp=False,):
+        # self.log.info("Step1: Generate audio from video.")
+        step1_audio_path, step1_video_path = self.v2a_model.generate_audio(
+            video_path=video_path,
+            output_dir=output_dir,
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            duration=duration,
+            seed=seed,
+            is_postp=is_postp)
+        self.log.info(f"The audio generated by Step1 is in {step1_audio_path}, and the video is in {step1_video_path}")
+        self.log.info("Finish Step1 successfully.\n")
+        return step1_audio_path, step1_video_path

pipeline/step2.py ADDED Viewed

	@@ -0,0 +1,52 @@

+# coding=utf-8
+# judge voice-over
+from third_party.VideoLLaMA2.videollama2 import model_init, mm_infer
+import logging
+class Step2:
+    def __init__(self, model_path, step2_mode, modal_type="av"):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        self.model, self.processor, self.tokenizer = model_init(model_path)
+        self.modal_type=modal_type
+        if modal_type == "a":
+            self.model.model.vision_tower = None
+        elif modal_type == "v":
+            self.model.model.audio_tower = None
+        elif modal_type == "av":
+            pass
+        else:
+            raise NotImplementedError
+        self.modal = 'audio' if modal_type == "a" else "video"
+        self.question = f"Given a video and its corresponding audio, determine whether the audio contains voice-over? Options: A. Yes, B. No. Choose A or B."
+        self.preprocess = self.processor[self.modal]
+        self.step2_mode = step2_mode
+    def run(self, video_audio_path):
+        # self.log.info("Step2: Given a video and its generated audio, determine whether the audio contains voice-over.")
+        audio_video_tensor = self.preprocess(video_audio_path, va=True)
+        output = mm_infer(
+            audio_video_tensor,
+            self.question,
+            model=self.model,
+            tokenizer=self.tokenizer,
+            modal=self.modal,
+            do_sample=False,
+        )
+        # print("oooooooooooooooooooooo: ", output)
+        if self.step2_mode == "cot":
+            output = output.split("<CONCLUSION>")[-1][1]
+        print("1111111111111111111111111: ", output)
+        output = (output == "A")
+        if output:
+            self.log.info(f"The video generated by Step1 ({video_audio_path}) contains voice-over.")
+        else:
+            self.log.info(f"The video generated by Step1 ({video_audio_path}) does not contain voice-over.")
+        self.log.info("Finish Step2 successfully.\n")
+        return output

pipeline/step3.py ADDED Viewed

	@@ -0,0 +1,129 @@

+# coding=utf-8
+# Remove voice-over
+import logging
+import argparse
+import subprocess
+import librosa
+import os
+import torch
+import soundfile as sf
+import numpy as np
+# Using the embedded version of Python can also correctly import the utils module.
+# current_dir = os.path.dirname(os.path.abspath(__file__))
+# sys.path.append(current_dir)
+from third_party.MusicSourceSeparationTraining.utils import demix, load_config, normalize_audio, denormalize_audio, draw_spectrogram
+from third_party.MusicSourceSeparationTraining.utils import prefer_target_instrument, apply_tta, load_start_checkpoint
+from third_party.MusicSourceSeparationTraining.models.bs_roformer import BSRoformer
+import warnings
+warnings.filterwarnings("ignore")
+model_base_dir = "pretrained/remove_vo/checkpoints"
+MODEL_PATHS = {"bs_roformer": [f"{model_base_dir}/model_bs_roformer_ep_317_sdr_12.9755.ckpt", f"{model_base_dir}/model_bs_roformer_ep_317_sdr_12.9755.yaml"]}
+class Step3:
+    def __init__(self, model_type="bs_roformer"):
+        model_path, config_path = MODEL_PATHS[model_type]
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        self.device = 'cpu'
+        if torch.cuda.is_available():
+            self.device = 'cuda'
+        elif torch.backends.mps.is_available():
+            self.device = 'mps'
+        else:
+            self.log.warning('CUDA/MPS are not available, running on CPU')
+        self.model_type = model_type
+        # self.model, self.config = get_model_from_config(model_type, config_path)
+        self.config = load_config(model_type, config_path)
+        self.model = BSRoformer(**dict(self.config.model))
+        args = argparse.Namespace()
+        args.start_check_point = model_path
+        args.model_type = model_type
+        args.lora_checkpoint = ''
+        load_start_checkpoint(args, self.model, type_='inference')
+        self.model = self.model.to(self.device)
+        self.sample_rate = getattr(self.config.audio, 'sample_rate', 44100)
+    def run(self,
+            input_audio_path,
+            temp_store_dir,  # for remove result dir
+            output_dir,  # for final dir
+            disable_detailed_pbar: bool=False,
+            use_tta: bool= False,
+            extract_instrumental: bool=True,
+            codec="wav",
+            subtype="FLOAT",
+            draw_spectro=0,
+            ):
+        # self.log.info("Step3: Remove voice-over from audio.")
+        os.makedirs(output_dir, exist_ok=True)
+        if disable_detailed_pbar:
+            detailed_pbar = False
+        else:
+            detailed_pbar = True
+        instruments = prefer_target_instrument(self.config)[:]
+        mix, sr = librosa.load(input_audio_path, sr=self.sample_rate, mono=False)
+        # If mono audio we must adjust it depending on model
+        if len(mix.shape) == 1:
+            mix = np.expand_dims(mix, axis=0)
+            if 'num_channels' in self.config.audio:
+                if self.config.audio['num_channels'] == 2:
+                    print(f'Convert mono track to stereo...')
+                    mix = np.concatenate([mix, mix], axis=0)
+        mix_orig = mix.copy()
+        if 'normalize' in self.config.inference:
+            if self.config.inference['normalize'] is True:
+                mix, norm_params = normalize_audio(mix)
+        waveforms_orig = demix(self.config, self.model, mix, self.device, model_type=self.model_type, pbar=detailed_pbar)
+        if use_tta:
+            waveforms_orig = apply_tta(self.config, self.model, mix, waveforms_orig, self.device, self.model_type)
+        if extract_instrumental:
+            instr = 'vocals' if 'vocals' in instruments else instruments[0]
+            waveforms_orig['instrumental'] = mix_orig - waveforms_orig[instr]
+            if 'instrumental' not in instruments:
+                instruments.append('instrumental')
+        file_name = os.path.splitext(os.path.basename(input_audio_path))[0].replace(".step1", "")
+        temp_output_dir = os.path.join(temp_store_dir, file_name)
+        os.makedirs(temp_output_dir, exist_ok=True)
+        for instr in instruments:
+            estimates = waveforms_orig[instr]
+            if 'normalize' in self.config.inference:
+                if self.config.inference['normalize'] is True:
+                    estimates = denormalize_audio(estimates, norm_params)
+            output_path = os.path.join(temp_output_dir, f"{instr}.{codec}")
+            sf.write(output_path, estimates.T, sr, subtype=subtype)
+            if draw_spectro > 0:
+                output_img_path = os.path.join(temp_output_dir, f"{instr}.jpg")
+                draw_spectrogram(estimates.T, sr, draw_spectro, output_img_path)
+        instrumental_file = os.path.join(temp_output_dir, 'instrumental.wav')
+        step3_audio_path = f"{output_dir}/{file_name}.step3.wav"
+        subprocess.run(['cp', instrumental_file, step3_audio_path])
+        self.log.info(f"The voice-over has been removed, and the audio is saved in {step3_audio_path}")
+        self.log.info("Finish Step3 successfully.\n")
+        return step3_audio_path

pipeline/step4.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# coding=utf-8
+# Silence detection
+import logging
+import librosa
+import numpy as np
+class Step4:
+    def __init__(self):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+    def run(self,
+            audio_path,
+            silence_thresh=-50,
+            duration_thresh=0.9):
+        # self.log.info("Step4: Determine whether the audio is silent.")
+        y, sr = librosa.load(audio_path, sr=None)
+        energy = librosa.feature.rms(y=y)[0]
+        energy_db = librosa.amplitude_to_db(energy)
+        silent_ratio = np.sum(energy_db < silence_thresh) / len(energy_db)
+        is_silent = silent_ratio > duration_thresh
+        if is_silent:
+            self.log.info(f"The audio after removing the voiceover ({audio_path}) is silent.")
+        else:
+            self.log.info(f"The audio after removing the voiceover ({audio_path}) is not silent.")
+        self.log.info("Finish Step4 successfully.\n")
+        return is_silent

pyproject.toml DELETED Viewed

@@ -1,52 +0,0 @@
-[build-system]
-requires = ["hatchling"]
-build-backend = "hatchling.build"
-[tool.hatch.metadata]
-allow-direct-references = true
-[tool.yapf]
-based_on_style = "pep8"
-indent_width = 4
-column_limit = 100
-[project]
-name = "mmaudio"
-version = "1.0.0"
-authors = [{ name = "Rex Cheng", email = "[email protected]" }]
-description = ""
-readme = "README.md"
-requires-python = ">=3.9"
-classifiers = [
-  "Programming Language :: Python :: 3",
-  "Operating System :: OS Independent",
-]
-dependencies = [
-  'torch >= 2.5.1',
-  'python-dotenv',
-  'cython',
-  'gitpython >= 3.1',
-  'tensorboard >= 2.11',
-  'numpy >= 1.21, <2.1',
-  'Pillow >= 9.5',
-  'opencv-python >= 4.8',
-  'scipy >= 1.7',
-  'tqdm >= 4.66.1',
-  'gradio >= 3.34',
-  'einops >= 0.6',
-  'hydra-core >= 1.3.2',
-  'requests',
-  'torchdiffeq',
-  'librosa >= 0.8.1',
-  'nitrous-ema',
-  'safetensors',
-  'auraloss',
-  'hydra_colorlog',
-  'tensordict',
-  'colorlog',
-  'open_clip_torch',
-  'soundfile',
-]
-[tool.hatch.build.targets.wheel]
-packages = ["mmaudio"]

requirements.txt.bak DELETED Viewed

@@ -1,27 +0,0 @@
-torch == 2.4.0
-torchvision
-torchaudio
-python-dotenv
-cython
-gitpython >= 3.1
-tensorboard >= 2.11
-numpy >= 1.21, <2.1
-Pillow >= 9.5
-opencv-python >= 4.8
-scipy >= 1.7
-tqdm >= 4.66.1
-gradio >= 3.34
-einops >= 0.6
-hydra-core >= 1.3.2
-requests
-torchdiffeq
-librosa >= 0.8.1
-nitrous-ema
-safetensors
-auraloss
-hydra_colorlog
-tensordict
-colorlog
-open_clip_torch
-soundfile
-av

third_party/MMAudio/.gitignore ADDED Viewed

	@@ -0,0 +1,146 @@

+run_*.sh
+log/
+saves
+saves/
+weights/
+weights
+output/
+output
+pretrained/
+workspace
+workspace/
+ext_weights/
+ext_weights
+.checkpoints/
+.vscode/
+training/example_output/
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/

third_party/MMAudio/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Sony Research Inc.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

{mmaudio/data → third_party/MMAudio/mmaudio}/__init__.py RENAMED Viewed

File without changes

{mmaudio/ext/bigvgan_v2 → third_party/MMAudio/mmaudio/data}/__init__.py RENAMED Viewed

File without changes

{mmaudio → third_party/MMAudio/mmaudio}/data/av_utils.py RENAMED Viewed

@@ -1,7 +1,7 @@
 from dataclasses import dataclass
 from fractions import Fraction
 from pathlib import Path
-from typing import Optional
 import av
 import numpy as np
@@ -15,7 +15,7 @@ class VideoInfo:
     fps: Fraction
     clip_frames: torch.Tensor
     sync_frames: torch.Tensor
-    all_frames: Optional[list[np.ndarray]]
     @property
     def height(self):
@@ -25,9 +25,35 @@ class VideoInfo:
     def width(self):
         return self.all_frames[0].shape[1]
-def read_frames(video_path: Path, list_of_fps: list[float], start_sec: float, end_sec: float,
-                need_all_frames: bool) -> tuple[list[np.ndarray], list[np.ndarray], Fraction]:
     output_frames = [[] for _ in list_of_fps]
     next_frame_time_for_each_fps = [0.0 for _ in list_of_fps]
     time_delta_for_each_fps = [1 / fps for fps in list_of_fps]

 from dataclasses import dataclass
 from fractions import Fraction
 from pathlib import Path
+from typing import Optional, List, Tuple
 import av
 import numpy as np
     fps: Fraction
     clip_frames: torch.Tensor
     sync_frames: torch.Tensor
+    all_frames: Optional[List[np.ndarray]]
     @property
     def height(self):
     def width(self):
         return self.all_frames[0].shape[1]
+    @classmethod
+    def from_image_info(cls, image_info: 'ImageInfo', duration_sec: float,
+                        fps: Fraction) -> 'VideoInfo':
+        num_frames = int(duration_sec * fps)
+        all_frames = [image_info.original_frame] * num_frames
+        return cls(duration_sec=duration_sec,
+                   fps=fps,
+                   clip_frames=image_info.clip_frames,
+                   sync_frames=image_info.sync_frames,
+                   all_frames=all_frames)
+@dataclass
+class ImageInfo:
+    clip_frames: torch.Tensor
+    sync_frames: torch.Tensor
+    original_frame: Optional[np.ndarray]
+    @property
+    def height(self):
+        return self.original_frame.shape[0]
+    @property
+    def width(self):
+        return self.original_frame.shape[1]
+def read_frames(video_path: Path, list_of_fps: List[float], start_sec: float, end_sec: float,
+                need_all_frames: bool) -> Tuple[List[np.ndarray], List[np.ndarray], Fraction]:
     output_frames = [[] for _ in list_of_fps]
     next_frame_time_for_each_fps = [0.0 for _ in list_of_fps]
     time_delta_for_each_fps = [1 / fps for fps in list_of_fps]

third_party/MMAudio/mmaudio/data/data_setup.py ADDED Viewed

	@@ -0,0 +1,174 @@

+import logging
+import random
+import numpy as np
+import torch
+from omegaconf import DictConfig
+from torch.utils.data import DataLoader, Dataset
+from torch.utils.data.dataloader import default_collate
+from torch.utils.data.distributed import DistributedSampler
+from mmaudio.data.eval.audiocaps import AudioCapsData
+from mmaudio.data.eval.video_dataset import MovieGen, VGGSound
+from mmaudio.data.extracted_audio import ExtractedAudio
+from mmaudio.data.extracted_vgg import ExtractedVGG
+from mmaudio.data.mm_dataset import MultiModalDataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+# Re-seed randomness every time we start a worker
+def worker_init_fn(worker_id: int):
+    worker_seed = torch.initial_seed() % (2**31) + worker_id + local_rank * 1000
+    np.random.seed(worker_seed)
+    random.seed(worker_seed)
+    log.debug(f'Worker {worker_id} re-seeded with seed {worker_seed} in rank {local_rank}')
+def load_vgg_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedVGG(tsv_path=data_cfg.tsv,
+                           data_dim=cfg.data_dim,
+                           premade_mmap_dir=data_cfg.memmap_dir)
+    return dataset
+def load_audio_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedAudio(tsv_path=data_cfg.tsv,
+                             data_dim=cfg.data_dim,
+                             premade_mmap_dir=data_cfg.memmap_dir)
+    return dataset
+def setup_training_datasets(cfg: DictConfig) -> tuple[Dataset, DistributedSampler, DataLoader]:
+    if cfg.mini_train:
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        dataset = MultiModalDataset([vgg], [audiocaps])
+    if cfg.example_train:
+        video = load_vgg_data(cfg, cfg.data.Example_video)
+        audio = load_audio_data(cfg, cfg.data.Example_audio)
+        dataset = MultiModalDataset([video], [audio])
+    else:
+        # load the largest one first
+        freesound = load_audio_data(cfg, cfg.data.FreeSound)
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        audioset_sl = load_audio_data(cfg, cfg.data.AudioSetSL)
+        bbcsound = load_audio_data(cfg, cfg.data.BBCSound)
+        clotho = load_audio_data(cfg, cfg.data.Clotho)
+        dataset = MultiModalDataset([vgg] * cfg.vgg_oversample_rate,
+                                    [audiocaps, audioset_sl, bbcsound, freesound, clotho])
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=True,
+                                       drop_last=True,
+                                       pin_memory=pin_memory)
+    return dataset, sampler, loader
+def setup_test_datasets(cfg):
+    dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_test)
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=False,
+                                       drop_last=False,
+                                       pin_memory=pin_memory)
+    return dataset, sampler, loader
+def setup_val_datasets(cfg: DictConfig) -> tuple[Dataset, DataLoader, DataLoader]:
+    if cfg.example_train:
+        dataset = load_vgg_data(cfg, cfg.data.Example_video)
+    else:
+        dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+    val_batch_size = cfg.batch_size
+    val_eval_batch_size = cfg.eval_batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, val_loader = construct_loader(dataset,
+                                     val_batch_size,
+                                     num_workers,
+                                     shuffle=False,
+                                     drop_last=False,
+                                     pin_memory=pin_memory)
+    _, eval_loader = construct_loader(dataset,
+                                      val_eval_batch_size,
+                                      num_workers,
+                                      shuffle=False,
+                                      drop_last=False,
+                                      pin_memory=pin_memory)
+    return dataset, val_loader, eval_loader
+def setup_eval_dataset(dataset_name: str, cfg: DictConfig) -> tuple[Dataset, DataLoader]:
+    if dataset_name.startswith('audiocaps_full'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps_full.audio_path,
+                                cfg.eval_data.AudioCaps_full.csv_path)
+    elif dataset_name.startswith('audiocaps'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps.audio_path,
+                                cfg.eval_data.AudioCaps.csv_path)
+    elif dataset_name.startswith('moviegen'):
+        dataset = MovieGen(cfg.eval_data.MovieGen.video_path,
+                           cfg.eval_data.MovieGen.jsonl_path,
+                           duration_sec=cfg.duration_s)
+    elif dataset_name.startswith('vggsound'):
+        dataset = VGGSound(cfg.eval_data.VGGSound.video_path,
+                           cfg.eval_data.VGGSound.csv_path,
+                           duration_sec=cfg.duration_s)
+    else:
+        raise ValueError(f'Invalid dataset name: {dataset_name}')
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, loader = construct_loader(dataset,
+                                 batch_size,
+                                 num_workers,
+                                 shuffle=False,
+                                 drop_last=False,
+                                 pin_memory=pin_memory,
+                                 error_avoidance=True)
+    return dataset, loader
+def error_avoidance_collate(batch):
+    batch = list(filter(lambda x: x is not None, batch))
+    return default_collate(batch)
+def construct_loader(dataset: Dataset,
+                     batch_size: int,
+                     num_workers: int,
+                     *,
+                     shuffle: bool = True,
+                     drop_last: bool = True,
+                     pin_memory: bool = False,
+                     error_avoidance: bool = False) -> tuple[DistributedSampler, DataLoader]:
+    train_sampler = DistributedSampler(dataset, rank=local_rank, shuffle=shuffle)
+    train_loader = DataLoader(dataset,
+                              batch_size,
+                              sampler=train_sampler,
+                              num_workers=num_workers,
+                              worker_init_fn=worker_init_fn,
+                              drop_last=drop_last,
+                              persistent_workers=num_workers > 0,
+                              pin_memory=pin_memory,
+                              collate_fn=error_avoidance_collate if error_avoidance else None)
+    return train_sampler, train_loader

{mmaudio/ext/bigvgan_v2/alias_free_activation/cuda → third_party/MMAudio/mmaudio/data/eval}/__init__.py RENAMED Viewed

File without changes

third_party/MMAudio/mmaudio/data/eval/audiocaps.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import logging
+import os
+from collections import defaultdict
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+log = logging.getLogger()
+class AudioCapsData(Dataset):
+    def __init__(self, audio_path: Union[str, Path], csv_path: Union[str, Path]):
+        df = pd.read_csv(csv_path).to_dict(orient='records')
+        audio_files = sorted(os.listdir(audio_path))
+        audio_files = set(
+            [Path(f).stem for f in audio_files if f.endswith('.wav') or f.endswith('.flac')])
+        self.data = []
+        for row in df:
+            self.data.append({
+                'name': row['name'],
+                'caption': row['caption'],
+            })
+        self.audio_path = Path(audio_path)
+        self.csv_path = Path(csv_path)
+        log.info(f'Found {len(self.data)} matching audio files in {self.audio_path}')
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        return self.data[idx]
+    def __len__(self):
+        return len(self.data)

third_party/MMAudio/mmaudio/data/eval/moviegen.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class MovieGenData(Dataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        sync_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+        read_clip: bool = True,
+    ):
+        self.video_root = Path(video_root)
+        self.sync_root = Path(sync_root)
+        self.jsonl_root = Path(jsonl_root)
+        self.read_clip = read_clip
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.duration_sec = duration_sec
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_augment = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_augment = v2.Compose([
+            v2.Resize((_SYNC_SIZE, _SYNC_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        self.videos = videos
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(f'CLIP video too short {video_id}')
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(f'Sync video too short {video_id}')
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_augment(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_augment(sync_chunk)
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        return self.sample(idx)
+    def __len__(self):
+        return len(self.captions)

third_party/MMAudio/mmaudio/data/eval/video_dataset.py ADDED Viewed

	@@ -0,0 +1,197 @@

+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class VideoDataset(Dataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        self.video_root = Path(video_root)
+        self.duration_sec = duration_sec
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        # to be implemented by subclasses
+        self.captions = {}
+        self.videos = sorted(list(self.captions.keys()))
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+    def __len__(self):
+        return len(self.captions)
+class VGGSound(VideoDataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        csv_path: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.csv_path = Path(csv_path)
+        videos = sorted(os.listdir(self.video_root))
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.captions = {}
+        df = pd.read_csv(csv_path, header=None, names=['id', 'sec', 'caption',
+                                                       'split']).to_dict(orient='records')
+        videos_no_found = []
+        for row in df:
+            if row['split'] == 'test':
+                start_sec = int(row['sec'])
+                video_id = str(row['id'])
+                # this is how our videos are named
+                video_name = f'{video_id}_{start_sec:06d}'
+                if video_name + '.mp4' not in videos:
+                    videos_no_found.append(video_name)
+                    continue
+                self.captions[video_name] = row['caption']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+            log.info(f'{len(self.captions)} useable videos found')
+            if videos_no_found:
+                log.info(f'{len(videos_no_found)} found in {csv_path} but not in {video_root}')
+                log.info(
+                    'A small amount is expected, as not all videos are still available on YouTube')
+        self.videos = sorted(list(self.captions.keys()))
+class MovieGen(VideoDataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.jsonl_root = Path(jsonl_root)
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.videos = videos

third_party/MMAudio/mmaudio/data/extracted_audio.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import logging
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+class ExtractedAudio(Dataset):
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [str(d['id']) for d in self.df_list]
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.text_features = td['text_features']
+        log.info(f'Loaded {len(self)} samples from {premade_mmap_dir}.')
+        log.info(f'Loaded mean: {self.mean.shape}.')
+        log.info(f'Loaded std: {self.std.shape}.')
+        log.info(f'Loaded text features: {self.text_features.shape}.')
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+        self.fake_clip_features = torch.zeros(self.data_dim['clip_seq_len'],
+                                              self.data_dim['clip_dim'])
+        self.fake_sync_features = torch.zeros(self.data_dim['sync_seq_len'],
+                                              self.data_dim['sync_dim'])
+        self.video_exist = torch.tensor(0, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'text_features': self.text_features,
+        })
+        return td
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': str(self.df_list[idx]['id']),
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.fake_clip_features,
+            'sync_features': self.fake_sync_features,
+            'text_features': self.text_features[idx],
+            'caption': self.df_list[idx]['caption'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+        }
+        return data
+    def __len__(self):
+        return len(self.ids)

third_party/MMAudio/mmaudio/data/extracted_vgg.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import logging
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+class ExtractedVGG(Dataset):
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [d['id'] for d in self.df_list]
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.clip_features = td['clip_features']
+        self.sync_features = td['sync_features']
+        self.text_features = td['text_features']
+        if local_rank == 0:
+            log.info(f'Loaded {len(self)} samples.')
+            log.info(f'Loaded mean: {self.mean.shape}.')
+            log.info(f'Loaded std: {self.std.shape}.')
+            log.info(f'Loaded clip_features: {self.clip_features.shape}.')
+            log.info(f'Loaded sync_features: {self.sync_features.shape}.')
+            log.info(f'Loaded text_features: {self.text_features.shape}.')
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.clip_features.shape[1] == self.data_dim['clip_seq_len'], \
+            f'{self.clip_features.shape[1]} != {self.data_dim["clip_seq_len"]}'
+        assert self.sync_features.shape[1] == self.data_dim['sync_seq_len'], \
+            f'{self.sync_features.shape[1]} != {self.data_dim["sync_seq_len"]}'
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+        assert self.clip_features.shape[-1] == self.data_dim['clip_dim'], \
+            f'{self.clip_features.shape[-1]} != {self.data_dim["clip_dim"]}'
+        assert self.sync_features.shape[-1] == self.data_dim['sync_dim'], \
+            f'{self.sync_features.shape[-1]} != {self.data_dim["sync_dim"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+        self.video_exist = torch.tensor(1, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'clip_features': self.clip_features,
+            'sync_features': self.sync_features,
+            'text_features': self.text_features,
+        })
+        return td
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': self.df_list[idx]['id'],
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.clip_features[idx],
+            'sync_features': self.sync_features[idx],
+            'text_features': self.text_features[idx],
+            'caption': self.df_list[idx]['label'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+        }
+        return data
+    def __len__(self):
+        return len(self.ids)

{mmaudio/model → third_party/MMAudio/mmaudio/data/extraction}/__init__.py RENAMED Viewed

File without changes

third_party/MMAudio/mmaudio/data/extraction/vgg_sound.py ADDED Viewed

	@@ -0,0 +1,193 @@

+import logging
+import os
+from pathlib import Path
+from typing import Optional, Union
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class VGGSound(Dataset):
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        tsv_path: Union[str, Path] = 'sets/vgg3-train.tsv',
+        sample_rate: int = 16_000,
+        duration_sec: float = 8.0,
+        audio_samples: Optional[int] = None,
+        normalize_audio: bool = False,
+    ):
+        self.root = Path(root)
+        self.normalize_audio = normalize_audio
+        if audio_samples is None:
+            self.audio_samples = int(sample_rate * duration_sec)
+        else:
+            self.audio_samples = audio_samples
+            effective_duration = audio_samples / sample_rate
+            # make sure the duration is close enough, within 15ms
+            assert abs(effective_duration - duration_sec) < 0.015, \
+                f'audio_samples {audio_samples} does not match duration_sec {duration_sec}'
+        videos = sorted(os.listdir(self.root))
+        videos = set([Path(v).stem for v in videos])  # remove extensions
+        self.labels = {}
+        self.videos = []
+        missing_videos = []
+        # read the tsv for subset information
+        df_list = pd.read_csv(tsv_path, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            label = record['label']
+            if id in videos:
+                self.labels[id] = label
+                self.videos.append(id)
+            else:
+                missing_videos.append(id)
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {root}')
+            log.info(f'{len(self.videos)} videos found in {tsv_path}')
+            log.info(f'{len(missing_videos)} videos missing in {root}')
+        self.sample_rate = sample_rate
+        self.duration_sec = duration_sec
+        self.expected_audio_length = audio_samples
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        self.resampler = {}
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        label = self.labels[video_id]
+        reader = StreamingMediaDecoder(self.root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_audio_stream(frames_per_chunk=2**30, )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        audio_chunk = data_chunk[2]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+        # process audio
+        sample_rate = int(reader.get_out_stream_info(2).sample_rate)
+        audio_chunk = audio_chunk.transpose(0, 1)
+        audio_chunk = audio_chunk.mean(dim=0)  # mono
+        if self.normalize_audio:
+            abs_max = audio_chunk.abs().max()
+            audio_chunk = audio_chunk / abs_max * 0.95
+            if abs_max <= 1e-6:
+                raise RuntimeError(f'Audio is silent {video_id}')
+        # resample
+        if sample_rate == self.sample_rate:
+            audio_chunk = audio_chunk
+        else:
+            if sample_rate not in self.resampler:
+                # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                    sample_rate,
+                    self.sample_rate,
+                    lowpass_filter_width=64,
+                    rolloff=0.9475937167399596,
+                    resampling_method='sinc_interp_kaiser',
+                    beta=14.769656459379492,
+                )
+            audio_chunk = self.resampler[sample_rate](audio_chunk)
+        if audio_chunk.shape[0] < self.expected_audio_length:
+            raise RuntimeError(f'Audio too short {video_id}')
+        audio_chunk = audio_chunk[:self.expected_audio_length]
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+        data = {
+            'id': video_id,
+            'caption': label,
+            'audio': audio_chunk,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+    def __len__(self):
+        return len(self.labels)

third_party/MMAudio/mmaudio/data/extraction/wav_dataset.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import logging
+import os
+from pathlib import Path
+from typing import Union
+import open_clip
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+log = logging.getLogger()
+class WavTextClipsDataset(Dataset):
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        captions_tsv: Union[str, Path],
+        clips_tsv: Union[str, Path],
+        sample_rate: int,
+        num_samples: int,
+        normalize_audio: bool = False,
+        reject_silent: bool = False,
+        tokenizer_id: str = 'ViT-H-14-378-quickgelu',
+    ):
+        self.root = Path(root)
+        self.sample_rate = sample_rate
+        self.num_samples = num_samples
+        self.normalize_audio = normalize_audio
+        self.reject_silent = reject_silent
+        self.tokenizer = open_clip.get_tokenizer(tokenizer_id)
+        audios = sorted(os.listdir(self.root))
+        audios = set([
+            Path(audio).stem for audio in audios
+            if audio.endswith('.wav') or audio.endswith('.flac')
+        ])
+        self.captions = {}
+        # read the caption tsv
+        df_list = pd.read_csv(captions_tsv, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            caption = record['caption']
+            self.captions[id] = caption
+        # read the clip tsv
+        df_list = pd.read_csv(clips_tsv, sep='\t', dtype={
+            'id': str,
+            'name': str
+        }).to_dict('records')
+        self.clips = []
+        for record in df_list:
+            record['id'] = record['id']
+            record['name'] = record['name']
+            id = record['id']
+            name = record['name']
+            if name not in self.captions:
+                log.warning(f'Audio {name} not found in {captions_tsv}')
+                continue
+            record['caption'] = self.captions[name]
+            self.clips.append(record)
+        log.info(f'Found {len(self.clips)} audio files in {self.root}')
+        self.resampler = {}
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        try:
+            clip = self.clips[idx]
+            audio_name = clip['name']
+            audio_id = clip['id']
+            caption = clip['caption']
+            start_sample = clip['start_sample']
+            end_sample = clip['end_sample']
+            audio_path = self.root / f'{audio_name}.flac'
+            if not audio_path.exists():
+                audio_path = self.root / f'{audio_name}.wav'
+                assert audio_path.exists()
+            audio_chunk, sample_rate = torchaudio.load(audio_path)
+            audio_chunk = audio_chunk.mean(dim=0)  # mono
+            abs_max = audio_chunk.abs().max()
+            if self.normalize_audio:
+                audio_chunk = audio_chunk / abs_max * 0.95
+            if self.reject_silent and abs_max < 1e-6:
+                log.warning(f'Rejecting silent audio')
+                return None
+            audio_chunk = audio_chunk[start_sample:end_sample]
+            # resample
+            if sample_rate == self.sample_rate:
+                audio_chunk = audio_chunk
+            else:
+                if sample_rate not in self.resampler:
+                    # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                    self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                        sample_rate,
+                        self.sample_rate,
+                        lowpass_filter_width=64,
+                        rolloff=0.9475937167399596,
+                        resampling_method='sinc_interp_kaiser',
+                        beta=14.769656459379492,
+                    )
+                audio_chunk = self.resampler[sample_rate](audio_chunk)
+            if audio_chunk.shape[0] < self.num_samples:
+                raise ValueError('Audio is too short')
+            audio_chunk = audio_chunk[:self.num_samples]
+            tokens = self.tokenizer([caption])[0]
+            output = {
+                'waveform': audio_chunk,
+                'id': audio_id,
+                'caption': caption,
+                'tokens': tokens,
+            }
+            return output
+        except Exception as e:
+            log.error(f'Error reading {audio_path}: {e}')
+            return None
+    def __len__(self):
+        return len(self.clips)

third_party/MMAudio/mmaudio/data/mm_dataset.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import bisect
+import torch
+from torch.utils.data.dataset import Dataset
+# modified from https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#ConcatDataset
+class MultiModalDataset(Dataset):
+    datasets: list[Dataset]
+    cumulative_sizes: list[int]
+    @staticmethod
+    def cumsum(sequence):
+        r, s = [], 0
+        for e in sequence:
+            l = len(e)
+            r.append(l + s)
+            s += l
+        return r
+    def __init__(self, video_datasets: list[Dataset], audio_datasets: list[Dataset]):
+        super().__init__()
+        self.video_datasets = list(video_datasets)
+        self.audio_datasets = list(audio_datasets)
+        self.datasets = self.video_datasets + self.audio_datasets
+        self.cumulative_sizes = self.cumsum(self.datasets)
+    def __len__(self):
+        return self.cumulative_sizes[-1]
+    def __getitem__(self, idx):
+        if idx < 0:
+            if -idx > len(self):
+                raise ValueError("absolute value of index should not exceed dataset length")
+            idx = len(self) + idx
+        dataset_idx = bisect.bisect_right(self.cumulative_sizes, idx)
+        if dataset_idx == 0:
+            sample_idx = idx
+        else:
+            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
+        return self.datasets[dataset_idx][sample_idx]
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        return self.video_datasets[0].compute_latent_stats()

third_party/MMAudio/mmaudio/data/utils.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import logging
+import os
+import random
+import tempfile
+from pathlib import Path
+from typing import Any, Optional, Union
+import torch
+import torch.distributed as dist
+from tensordict import MemoryMappedTensor
+from torch.utils.data import DataLoader
+from torch.utils.data.dataset import Dataset
+from tqdm import tqdm
+from mmaudio.utils.dist_utils import local_rank, world_size
+scratch_path = Path(os.environ['SLURM_SCRATCH'] if 'SLURM_SCRATCH' in os.environ else '/dev/shm')
+shm_path = Path('/dev/shm')
+log = logging.getLogger()
+def reseed(seed):
+    random.seed(seed)
+    torch.manual_seed(seed)
+def local_scatter_torch(obj: Optional[Any]):
+    if world_size == 1:
+        # Just one worker. Do nothing.
+        return obj
+    array = [obj] * world_size
+    target_array = [None]
+    if local_rank == 0:
+        dist.scatter_object_list(target_array, scatter_object_input_list=array, src=0)
+    else:
+        dist.scatter_object_list(target_array, scatter_object_input_list=None, src=0)
+    return target_array[0]
+class ShardDataset(Dataset):
+    def __init__(self, root):
+        self.root = root
+        self.shards = sorted(os.listdir(root))
+    def __len__(self):
+        return len(self.shards)
+    def __getitem__(self, idx):
+        return torch.load(os.path.join(self.root, self.shards[idx]), weights_only=True)
+def get_tmp_dir(in_memory: bool) -> Path:
+    return shm_path if in_memory else scratch_path
+def load_shards_and_share(data_path: Union[str, Path], ids: list[int],
+                          in_memory: bool) -> MemoryMappedTensor:
+    if local_rank == 0:
+        with tempfile.NamedTemporaryFile(prefix='shared-tensor-', dir=get_tmp_dir(in_memory)) as f:
+            log.info(f'Loading shards from {data_path} into {f.name}...')
+            data = load_shards(data_path, ids=ids, tmp_file_path=f.name)
+            data = share_tensor_to_all(data)
+            torch.distributed.barrier()
+            f.close()  # why does the context manager not close the file for me?
+    else:
+        log.info('Waiting for the data to be shared with me...')
+        data = share_tensor_to_all(None)
+        torch.distributed.barrier()
+    return data
+def load_shards(
+    data_path: Union[str, Path],
+    ids: list[int],
+    *,
+    tmp_file_path: str,
+) -> Union[torch.Tensor, dict[str, torch.Tensor]]:
+    id_set = set(ids)
+    shards = sorted(os.listdir(data_path))
+    log.info(f'Found {len(shards)} shards in {data_path}.')
+    first_shard = torch.load(os.path.join(data_path, shards[0]), weights_only=True)
+    log.info(f'Rank {local_rank} created file {tmp_file_path}')
+    first_item = next(iter(first_shard.values()))
+    log.info(f'First item shape: {first_item.shape}')
+    mm_tensor = MemoryMappedTensor.empty(shape=(len(ids), *first_item.shape),
+                                         dtype=torch.float32,
+                                         filename=tmp_file_path,
+                                         existsok=True)
+    total_count = 0
+    used_index = set()
+    id_indexing = {i: idx for idx, i in enumerate(ids)}
+    # faster with no workers; otherwise we need to set_sharing_strategy('file_system')
+    loader = DataLoader(ShardDataset(data_path), batch_size=1, num_workers=0)
+    for data in tqdm(loader, desc='Loading shards'):
+        for i, v in data.items():
+            if i not in id_set:
+                continue
+            # tensor_index = ids.index(i)
+            tensor_index = id_indexing[i]
+            if tensor_index in used_index:
+                raise ValueError(f'Duplicate id {i} found in {data_path}.')
+            used_index.add(tensor_index)
+            mm_tensor[tensor_index] = v
+            total_count += 1
+    assert total_count == len(ids), f'Expected {len(ids)} tensors, got {total_count}.'
+    log.info(f'Loaded {total_count} tensors from {data_path}.')
+    return mm_tensor
+def share_tensor_to_all(x: Optional[MemoryMappedTensor]) -> MemoryMappedTensor:
+    """
+    x: the tensor to be shared; None if local_rank != 0
+    return: the shared tensor
+    """
+    # there is no need to share your stuff with anyone if you are alone; must be in memory
+    if world_size == 1:
+        return x
+    if local_rank == 0:
+        assert x is not None, 'x must not be None if local_rank == 0'
+    else:
+        assert x is None, 'x must be None if local_rank != 0'
+    if local_rank == 0:
+        filename = x.filename
+        meta_information = (filename, x.shape, x.dtype)
+    else:
+        meta_information = None
+    filename, data_shape, data_type = local_scatter_torch(meta_information)
+    if local_rank == 0:
+        data = x
+    else:
+        data = MemoryMappedTensor.from_filename(filename=filename,
+                                                dtype=data_type,
+                                                shape=data_shape)
+    return data