Spaces:

lym0302
/

DeepSound-V1

Running

App Files Files Community

lym0302123 commited on Mar 21

Commit

0163d98

1 Parent(s): 4cfc7ca

new exp

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +69 -101
app.py +34 -102
batch_eval.py +0 -110
config/__init__.py +0 -0
config/base_config.yaml +0 -62
config/data/base.yaml +0 -70
config/eval_config.yaml +0 -17
config/eval_data/base.yaml +0 -22
config/hydra/job_logging/custom-eval.yaml +0 -32
config/hydra/job_logging/custom-no-rank.yaml +0 -32
config/hydra/job_logging/custom-simplest.yaml +0 -26
config/hydra/job_logging/custom.yaml +0 -33
config/train_config.yaml +0 -41
demo.py +1 -7
docs/EVAL.md +0 -22
docs/MODELS.md +0 -50
docs/TRAINING.md +0 -184
docs/index.html +10 -12
gradio_demo.py +0 -343
mmaudio/__pycache__/__init__.cpython-310.pyc +0 -0
mmaudio/__pycache__/__init__.cpython-38.pyc +0 -0
mmaudio/__pycache__/eval_utils.cpython-310.pyc +0 -0
mmaudio/__pycache__/eval_utils.cpython-38.pyc +0 -0
mmaudio/data/__pycache__/__init__.cpython-310.pyc +0 -0
mmaudio/data/__pycache__/__init__.cpython-38.pyc +0 -0
mmaudio/data/__pycache__/av_utils.cpython-310.pyc +0 -0
mmaudio/data/__pycache__/av_utils.cpython-38.pyc +0 -0
mmaudio/data/av_utils.py +4 -30
mmaudio/data/data_setup.py +0 -174
mmaudio/data/eval/__init__.py +0 -0
mmaudio/data/eval/audiocaps.py +0 -39
mmaudio/data/eval/moviegen.py +0 -131
mmaudio/data/eval/video_dataset.py +0 -197
mmaudio/data/extracted_audio.py +0 -88
mmaudio/data/extracted_vgg.py +0 -101
mmaudio/data/extraction/__init__.py +0 -0
mmaudio/data/extraction/vgg_sound.py +0 -193
mmaudio/data/extraction/wav_dataset.py +0 -132
mmaudio/data/mm_dataset.py +0 -45
mmaudio/data/utils.py +0 -148
mmaudio/eval_utils.py +25 -63
mmaudio/ext/__pycache__/__init__.cpython-310.pyc +0 -0
mmaudio/ext/__pycache__/__init__.cpython-38.pyc +0 -0
mmaudio/ext/__pycache__/mel_converter.cpython-310.pyc +0 -0
mmaudio/ext/__pycache__/mel_converter.cpython-38.pyc +0 -0
mmaudio/ext/__pycache__/rotary_embeddings.cpython-310.pyc +0 -0
mmaudio/ext/__pycache__/rotary_embeddings.cpython-38.pyc +0 -0
mmaudio/ext/autoencoder/__pycache__/__init__.cpython-310.pyc +0 -0
mmaudio/ext/autoencoder/__pycache__/__init__.cpython-38.pyc +0 -0
mmaudio/ext/autoencoder/__pycache__/autoencoder.cpython-310.pyc +0 -0

README.md CHANGED Viewed

@@ -1,27 +1,25 @@
 ---
 title: DeepSound-V1
-colorFrom: indigo
-colorTo: purple
 sdk: gradio
-sdk_version: 5.22.0
 app_file: app.py
 pinned: false
 ---
-<div align="center">
-<p align="center">
-  <h2>MMAudio</h2>
-  <a href="https://arxiv.org/abs/2412.15322">Paper</a> | <a href="https://hkchengrex.github.io/MMAudio">Webpage</a> | <a href="https://huggingface.co/hkchengrex/MMAudio/tree/main">Models</a> | <a href="https://huggingface.co/spaces/hkchengrex/MMAudio"> Huggingface Demo</a> | <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">Colab Demo</a> | <a href="https://replicate.com/zsxkib/mmaudio">Replicate Demo</a>
-</p>
-</div>
-## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)
 [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)
 University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
-CVPR 2025
 ## Highlight
@@ -29,25 +27,22 @@ MMAudio generates synchronized audio given video and/or text inputs.
 Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
 Moreover, a synchronization module aligns the generated audio with the video frames.
 ## Results
 (All audio from our algorithm MMAudio)
-Videos from Sora:
 https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
-Videos from Veo 2:
-https://github.com/user-attachments/assets/8a11419e-fee2-46e0-9e67-dfb03c48d00e
-Videos from MovieGen/Hunyuan Video/VGGSound:
 https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
 For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
 ## Installation
 We have only tested this on Ubuntu.
@@ -56,30 +51,17 @@ We have only tested this on Ubuntu.
 We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
-- Python 3.9+
-- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
-<!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->
-**1. Install prerequisite if not yet met:**
-```bash
-pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
-```
-(Or any other CUDA versions that your GPUs/driver support)
-<!-- ```
-conda install -c conda-forge 'ffmpeg<7
-```
-(Optional, if you use miniforge and don't already have the appropriate ffmpeg) -->
-**2. Clone our repository:**
 ```bash
 git clone https://github.com/hkchengrex/MMAudio.git
 ```
-**3. Install with pip (install pytorch first before attempting this!):**
 ```bash
 cd MMAudio
@@ -88,108 +70,94 @@ pip install -e .
 (If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
 **Pretrained models:**
-The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
-The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
-See [MODELS.md](docs/MODELS.md) for more details.
 ## Demo
-By default, these scripts use the `large_44k_v2` model.
 In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
 ### Command-line interface
 With `demo.py`
 ```bash
 python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
 ```
 The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
 See the file for more options.
 Simply omit the `--video` option for text-to-audio synthesis.
 The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
 ### Gradio interface
 Supports video-to-audio and text-to-audio synthesis.
-You can also try experimental image-to-audio synthesis which duplicates the input image to a video for processing. This might be interesting to some but it is not something MMAudio has been trained for.
-Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can specify with `--port`.
-```bash
 python gradio_demo.py
 ```
-### FAQ
-1. Video processing
-    - Processing higher-resolution videos takes longer due to encoding and decoding (which can take >95% of the processing time!), but it does not improve the quality of results.
-    - The CLIP encoder resizes input frames to 384×384 pixels.
-    - Synchformer resizes the shorter edge to 224 pixels and applies a center crop, focusing only on the central square of each frame.
-2. Frame rates
-    - The CLIP model operates at 8 FPS, while Synchformer works at 25 FPS.
-    - Frame rate conversion happens on-the-fly via the video reader.
-    - For input videos with a frame rate below 25 FPS, frames will be duplicated to match the required rate.
-3. Failure cases
-As with most models of this type, failures can occur, and the reasons are not always clear. Below are some known failure modes. If you notice a failure mode or believe there’s a bug, feel free to open an issue in the repository.
-4. Performance variations
-We notice that there can be subtle performance variations in different hardware and software environments. Some of the reasons include using/not using `torch.compile`, video reader library/backend, inference precision, batch sizes, random seeds, etc. We (will) provide pre-computed results on standard benchmark for reference. Results obtained from this codebase should be similar but might not be exactly the same.
 ### Known limitations
-1. The model sometimes generates unintelligible human speech-like sounds
-2. The model sometimes generates background music (without explicit training, it would not be high quality)
 3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
 We believe all of these three limitations can be addressed with more high-quality training data.
 ## Training
-See [TRAINING.md](docs/TRAINING.md).
 ## Evaluation
-See [EVAL.md](docs/EVAL.md).
-## Training Datasets
-MMAudio was trained on several datasets, including [AudioSet](https://research.google.com/audioset/), [Freesound](https://github.com/LAION-AI/audio-dataset/blob/main/laion-audio-630k/README.md), [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [AudioCaps](https://audiocaps.github.io/), and [WavCaps](https://github.com/XinhaoMei/WavCaps). These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.
-## Update Logs
-- 2025-03-09: Uploaded the corrected tsv files. See [TRAINING.md](docs/TRAINING.md).
-- 2025-02-27: Disabled the GradScaler by default to improve training stability. See #49.
-- 2024-12-23: Added training and batch evaluation scripts.
-- 2024-12-14: Removed the `ffmpeg<7` requirement for the demos by replacing `torio.io.StreamingMediaDecoder` with `pyav` for reading frames. The read frames are also cached, so we are not reading the same frames again during reconstruction. This should speed things up and make installation less of a hassle.
-- 2024-12-13: Improved for-loop processing in CLIP/Sync feature extraction by introducing a batch size multiplier. We can approximately use 40x batch size for CLIP/Sync without using more memory, thereby speeding up processing. Removed VAE encoder during inference -- we don't need it.
-- 2024-12-11: Replaced `torio.io.StreamingMediaDecoder` with `pyav` for reading framerate when reconstructing the input video. `torio.io.StreamingMediaDecoder` does not work reliably in huggingface ZeroGPU's environment, and I suspect that it might not work in some other environments as well.
-## Citation
-```bibtex
-@inproceedings{cheng2025taming,
-  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
-  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
-  booktitle={CVPR},
-  year={2025}
-}
-```
-## Relevant Repositories
-- [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
-## Disclaimer
-We have no affiliation with and have no knowledge of the party behind the domain "mmaudio.net".
 ## Acknowledgement
 Many thanks to:
-- [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model and the VAE architecture
 - [BigVGAN](https://github.com/NVIDIA/BigVGAN)
 - [Synchformer](https://github.com/v-iashin/Synchformer)
-- [EDM2](https://github.com/NVlabs/edm2) for the magnitude-preserving VAE network architecture

 ---
 title: DeepSound-V1
+emoji: 🔊
+colorFrom: blue
+colorTo: indigo
 sdk: gradio
 app_file: app.py
 pinned: false
 ---
+# [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)
 [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)
 University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
+[[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio)
+**Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.**
 ## Highlight
 Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
 Moreover, a synchronization module aligns the generated audio with the video frames.
 ## Results
 (All audio from our algorithm MMAudio)
+Videos from Sora:
 https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
+Videos from MovieGen/Hunyuan Video/VGGSound:
 https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
 For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
 ## Installation
 We have only tested this on Ubuntu.
 We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
+- Python 3.8+
+- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/)
+- ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`)
+**Clone our repository:**
 ```bash
 git clone https://github.com/hkchengrex/MMAudio.git
 ```
+**Install with pip:**
 ```bash
 cd MMAudio
 (If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
 **Pretrained models:**
+The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`
+| Model    | Download link | File size |
+| -------- | ------- | ------- |
+| Flow prediction network, small 16kHz | <a href="https://databank.illinois.edu/datafiles/k6jve/download" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
+| Flow prediction network, small 44.1kHz | <a href="https://databank.illinois.edu/datafiles/864ya/download" download="mmaudio_small_44k.pth">mmaudio_small_44k.pth</a> | 601M |
+| Flow prediction network, medium 44.1kHz | <a href="https://databank.illinois.edu/datafiles/pa94t/download" download="mmaudio_medium_44k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
+| Flow prediction network, large 44.1kHz **(recommended)** | <a href="https://databank.illinois.edu/datafiles/4jx76/download" download="mmaudio_large_44k.pth">mmaudio_large_44k.pth</a> | 3.9G |
+| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
+| 16kHz BigVGAN vocoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
+| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G |
+| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
+The 44.1kHz vocoder will be downloaded automatically.
+The expected directory structure (full):
+```bash
+MMAudio
+├── ext_weights
+│   ├── best_netG.pt
+│   ├── synchformer_state_dict.pth
+│   ├── v1-16.pth
+│   └── v1-44.pth
+├── weights
+│   ├── mmaudio_small_16k.pth
+│   ├── mmaudio_small_44k.pth
+│   ├── mmaudio_medium_44k.pth
+│   └── mmaudio_large_44k.pth
+└── ...
+```
+The expected directory structure (minimal, for the recommended model only):
+```bash
+MMAudio
+├── ext_weights
+│   ├── synchformer_state_dict.pth
+│   └── v1-44.pth
+├── weights
+│   └── mmaudio_large_44k.pth
+└── ...
+```
 ## Demo
+By default, these scripts use the `large_44k` model.
 In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
 ### Command-line interface
 With `demo.py`
 ```bash
 python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
 ```
 The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
 See the file for more options.
 Simply omit the `--video` option for text-to-audio synthesis.
 The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
 ### Gradio interface
 Supports video-to-audio and text-to-audio synthesis.
+```
 python gradio_demo.py
 ```
 ### Known limitations
+1. The model sometimes generates undesired unintelligible human speech-like sounds
+2. The model sometimes generates undesired background music
 3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
 We believe all of these three limitations can be addressed with more high-quality training data.
 ## Training
+Work in progress.
 ## Evaluation
+Work in progress.
 ## Acknowledgement
 Many thanks to:
+- [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model
 - [BigVGAN](https://github.com/NVIDIA/BigVGAN)
 - [Synchformer](https://github.com/v-iashin/Synchformer)

app.py CHANGED Viewed

@@ -1,33 +1,33 @@
-import gc
 import logging
-from argparse import ArgumentParser
 from datetime import datetime
-from fractions import Fraction
 from pathlib import Path
 import gradio as gr
 import torch
 import torchaudio
-from mmaudio.eval_utils import (ModelConfig, VideoInfo, all_model_cfg, generate, load_image,
-                                load_video, make_video, setup_eval_logging)
 from mmaudio.model.flow_matching import FlowMatching
 from mmaudio.model.networks import MMAudio, get_my_mmaudio
 from mmaudio.model.sequence_config import SequenceConfig
 from mmaudio.model.utils.features_utils import FeaturesUtils
 torch.backends.cuda.matmul.allow_tf32 = True
 torch.backends.cudnn.allow_tf32 = True
 log = logging.getLogger()
-device = 'cpu'
-if torch.cuda.is_available():
-    device = 'cuda'
-elif torch.backends.mps.is_available():
-    device = 'mps'
-else:
-    log.warning('CUDA/MPS are not available, running on CPU')
 dtype = torch.bfloat16
 model: ModelConfig = all_model_cfg['large_44k_v2']
@@ -58,6 +58,7 @@ def get_model() -> tuple[MMAudio, FeaturesUtils, SequenceConfig]:
 net, feature_utils, seq_cfg = get_model()
 @torch.inference_mode()
 def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
                    cfg_strength: float, duration: float):
@@ -88,53 +89,16 @@ def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int
                       cfg_strength=cfg_strength)
     audio = audios.float().cpu()[0]
-    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    output_dir.mkdir(exist_ok=True, parents=True)
-    video_save_path = output_dir / f'{current_time_string}.mp4'
-    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-    gc.collect()
-    return video_save_path
-@torch.inference_mode()
-def image_to_audio(image: gr.Image, prompt: str, negative_prompt: str, seed: int, num_steps: int,
-                   cfg_strength: float, duration: float):
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-    image_info = load_image(image)
-    clip_frames = image_info.clip_frames
-    sync_frames = image_info.sync_frames
-    clip_frames = clip_frames.unsqueeze(0)
-    sync_frames = sync_frames.unsqueeze(0)
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength,
-                      image_input=True)
-    audio = audios.float().cpu()[0]
-    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    output_dir.mkdir(exist_ok=True, parents=True)
-    video_save_path = output_dir / f'{current_time_string}.mp4'
-    video_info = VideoInfo.from_image_info(image_info, duration, fps=Fraction(1))
     make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-    gc.collect()
     return video_save_path
 @torch.inference_mode()
 def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int, cfg_strength: float,
                   duration: float):
@@ -160,11 +124,9 @@ def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int,
                       cfg_strength=cfg_strength)
     audio = audios.float().cpu()[0]
-    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    output_dir.mkdir(exist_ok=True, parents=True)
-    audio_save_path = output_dir / f'{current_time_string}.flac'
     torchaudio.save(audio_save_path, audio, seq_cfg.sampling_rate)
-    gc.collect()
     return audio_save_path
@@ -176,6 +138,8 @@ video_to_audio_tab = gr.Interface(
     NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side).
     Doing so does not improve results.
     """,
     inputs=[
         gr.Video(),
@@ -245,8 +209,8 @@ video_to_audio_tab = gr.Interface(
             10,
         ],
         [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/mochi_storm.mp4',
-            'storm',
             '',
             0,
             25,
@@ -254,8 +218,8 @@ video_to_audio_tab = gr.Interface(
             10,
         ],
         [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_spring.mp4',
-            '',
             '',
             0,
             25,
@@ -263,8 +227,8 @@ video_to_audio_tab = gr.Interface(
             10,
         ],
         [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_typing.mp4',
-            'typing',
             '',
             0,
             25,
@@ -272,8 +236,8 @@ video_to_audio_tab = gr.Interface(
             10,
         ],
         [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_wake_up.mp4',
-            '',
             '',
             0,
             25,
@@ -281,7 +245,7 @@ video_to_audio_tab = gr.Interface(
             10,
         ],
         [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_nyc.mp4',
             '',
             '',
             0,
@@ -293,10 +257,6 @@ video_to_audio_tab = gr.Interface(
 text_to_audio_tab = gr.Interface(
     fn=text_to_audio,
-    description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
-    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
-    """,
     inputs=[
         gr.Text(label='Prompt'),
         gr.Text(label='Negative prompt'),
@@ -310,34 +270,6 @@ text_to_audio_tab = gr.Interface(
     title='MMAudio — Text-to-Audio Synthesis',
 )
-image_to_audio_tab = gr.Interface(
-    fn=image_to_audio,
-    description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
-    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
-    NOTE: It takes longer to process high-resolution images (>384 px on the shorter side).
-    Doing so does not improve results.
-    """,
-    inputs=[
-        gr.Image(type='filepath'),
-        gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
-    ],
-    outputs='playable_video',
-    cache_examples=False,
-    title='MMAudio — Image-to-Audio Synthesis (experimental)',
-)
 if __name__ == "__main__":
-    parser = ArgumentParser()
-    parser.add_argument('--port', type=int, default=7860)
-    args = parser.parse_args()
-    gr.TabbedInterface([video_to_audio_tab, text_to_audio_tab, image_to_audio_tab],
-                       ['Video-to-Audio', 'Text-to-Audio', 'Image-to-Audio (experimental)']).launch(
-                           server_port=args.port, allowed_paths=[output_dir])

+import spaces
 import logging
 from datetime import datetime
 from pathlib import Path
 import gradio as gr
 import torch
 import torchaudio
+import os
+try:
+    import mmaudio
+except ImportError:
+    os.system("pip install -e .")
+    import mmaudio
+from mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video, make_video,
+                                setup_eval_logging)
 from mmaudio.model.flow_matching import FlowMatching
 from mmaudio.model.networks import MMAudio, get_my_mmaudio
 from mmaudio.model.sequence_config import SequenceConfig
 from mmaudio.model.utils.features_utils import FeaturesUtils
+import tempfile
 torch.backends.cuda.matmul.allow_tf32 = True
 torch.backends.cudnn.allow_tf32 = True
 log = logging.getLogger()
+device = 'cuda'
 dtype = torch.bfloat16
 model: ModelConfig = all_model_cfg['large_44k_v2']
 net, feature_utils, seq_cfg = get_model()
+@spaces.GPU(duration=120)
 @torch.inference_mode()
 def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
                    cfg_strength: float, duration: float):
                       cfg_strength=cfg_strength)
     audio = audios.float().cpu()[0]
+    # current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
+    video_save_path = tempfile.NamedTemporaryFile(delete=False, suffix='.mp4').name
+    # output_dir.mkdir(exist_ok=True, parents=True)
+    # video_save_path = output_dir / f'{current_time_string}.mp4'
     make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+    log.info(f'Saved video to {video_save_path}')
     return video_save_path
+@spaces.GPU(duration=120)
 @torch.inference_mode()
 def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int, cfg_strength: float,
                   duration: float):
                       cfg_strength=cfg_strength)
     audio = audios.float().cpu()[0]
+    audio_save_path = tempfile.NamedTemporaryFile(delete=False, suffix='.flac').name
     torchaudio.save(audio_save_path, audio, seq_cfg.sampling_rate)
+    log.info(f'Saved audio to {audio_save_path}')
     return audio_save_path
     NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side).
     Doing so does not improve results.
+    The model has been trained on 8-second videos. Using much longer or shorter videos will degrade performance. Around 5s~12s should be fine.
     """,
     inputs=[
         gr.Video(),
             10,
         ],
         [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_nyc.mp4',
+            '',
             '',
             0,
             25,
             10,
         ],
         [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/mochi_storm.mp4',
+            'storm',
             '',
             0,
             25,
             10,
         ],
         [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_spring.mp4',
+            '',
             '',
             0,
             25,
             10,
         ],
         [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_typing.mp4',
+            'typing',
             '',
             0,
             25,
             10,
         ],
         [
+            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_wake_up.mp4',
             '',
             '',
             0,
 text_to_audio_tab = gr.Interface(
     fn=text_to_audio,
     inputs=[
         gr.Text(label='Prompt'),
         gr.Text(label='Negative prompt'),
     title='MMAudio — Text-to-Audio Synthesis',
 )
 if __name__ == "__main__":
+    gr.TabbedInterface([video_to_audio_tab, text_to_audio_tab],
+                       ['Video-to-Audio', 'Text-to-Audio']).launch(allowed_paths=[output_dir])

batch_eval.py DELETED Viewed

@@ -1,110 +0,0 @@
-import logging
-import os
-from pathlib import Path
-import hydra
-import torch
-import torch.distributed as distributed
-import torchaudio
-from hydra.core.hydra_config import HydraConfig
-from omegaconf import DictConfig
-from tqdm import tqdm
-from mmaudio.data.data_setup import setup_eval_dataset
-from mmaudio.eval_utils import ModelConfig, all_model_cfg, generate
-from mmaudio.model.flow_matching import FlowMatching
-from mmaudio.model.networks import MMAudio, get_my_mmaudio
-from mmaudio.model.utils.features_utils import FeaturesUtils
-torch.backends.cuda.matmul.allow_tf32 = True
-torch.backends.cudnn.allow_tf32 = True
-local_rank = int(os.environ['LOCAL_RANK'])
-world_size = int(os.environ['WORLD_SIZE'])
-log = logging.getLogger()
-@torch.inference_mode()
-@hydra.main(version_base='1.3.2', config_path='config', config_name='eval_config.yaml')
-def main(cfg: DictConfig):
-    device = 'cuda'
-    torch.cuda.set_device(local_rank)
-    if cfg.model not in all_model_cfg:
-        raise ValueError(f'Unknown model variant: {cfg.model}')
-    model: ModelConfig = all_model_cfg[cfg.model]
-    model.download_if_needed()
-    seq_cfg = model.seq_cfg
-    run_dir = Path(HydraConfig.get().run.dir)
-    if cfg.output_name is None:
-        output_dir = run_dir / cfg.dataset
-    else:
-        output_dir = run_dir / f'{cfg.dataset}-{cfg.output_name}'
-    output_dir.mkdir(parents=True, exist_ok=True)
-    # load a pretrained model
-    seq_cfg.duration = cfg.duration_s
-    net: MMAudio = get_my_mmaudio(cfg.model).to(device).eval()
-    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
-    log.info(f'Loaded weights from {model.model_path}')
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    log.info(f'Latent seq len: {seq_cfg.latent_seq_len}')
-    log.info(f'Clip seq len: {seq_cfg.clip_seq_len}')
-    log.info(f'Sync seq len: {seq_cfg.sync_seq_len}')
-    # misc setup
-    rng = torch.Generator(device=device)
-    rng.manual_seed(cfg.seed)
-    fm = FlowMatching(cfg.sampling.min_sigma,
-                      inference_mode=cfg.sampling.method,
-                      num_steps=cfg.sampling.num_steps)
-    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
-                                  synchformer_ckpt=model.synchformer_ckpt,
-                                  enable_conditions=True,
-                                  mode=model.mode,
-                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
-                                  need_vae_encoder=False)
-    feature_utils = feature_utils.to(device).eval()
-    if cfg.compile:
-        net.preprocess_conditions = torch.compile(net.preprocess_conditions)
-        net.predict_flow = torch.compile(net.predict_flow)
-        feature_utils.compile()
-    dataset, loader = setup_eval_dataset(cfg.dataset, cfg)
-    with torch.amp.autocast(enabled=cfg.amp, dtype=torch.bfloat16, device_type=device):
-        for batch in tqdm(loader):
-            audios = generate(batch.get('clip_video', None),
-                              batch.get('sync_video', None),
-                              batch.get('caption', None),
-                              feature_utils=feature_utils,
-                              net=net,
-                              fm=fm,
-                              rng=rng,
-                              cfg_strength=cfg.cfg_strength,
-                              clip_batch_size_multiplier=64,
-                              sync_batch_size_multiplier=64)
-            audios = audios.float().cpu()
-            names = batch['name']
-            for audio, name in zip(audios, names):
-                torchaudio.save(output_dir / f'{name}.flac', audio, seq_cfg.sampling_rate)
-def distributed_setup():
-    distributed.init_process_group(backend="nccl")
-    local_rank = distributed.get_rank()
-    world_size = distributed.get_world_size()
-    log.info(f'Initialized: local_rank={local_rank}, world_size={world_size}')
-    return local_rank, world_size
-if __name__ == '__main__':
-    distributed_setup()
-    main()
-    # clean-up
-    distributed.destroy_process_group()

config/__init__.py DELETED Viewed

File without changes

config/base_config.yaml DELETED Viewed

@@ -1,62 +0,0 @@
-defaults:
-  - data: base
-  - eval_data: base
-  - override hydra/job_logging: custom-simplest
-  - _self_
-hydra:
-  run:
-    dir: ./output/${exp_id}
-  output_subdir: ${now:%Y-%m-%d_%H-%M-%S}-hydra
-enable_email: False
-model: small_16k
-exp_id: default
-debug: False
-cudnn_benchmark: True
-compile: True
-amp: True
-weights: null
-checkpoint: null
-seed: 14159265
-num_workers: 10 # per-GPU
-pin_memory: False # set to True if your system can handle it, i.e., have enough memory
-# NOTE: This DOSE NOT affect the model during inference in any way
-# they are just for the dataloader to fill in the missing data in multi-modal loading
-# to change the sequence length for the model, see networks.py
-data_dim:
-  text_seq_len: 77
-  clip_dim: 1024
-  sync_dim: 768
-  text_dim: 1024
-# ema configuration
-ema:
-  enable: True
-  sigma_rels: [0.05, 0.1]
-  update_every: 1
-  checkpoint_every: 5_000
-  checkpoint_folder: ${hydra:run.dir}/ema_ckpts
-  default_output_sigma: 0.05
-# sampling
-sampling:
-  mean: 0.0
-  scale: 1.0
-  min_sigma: 0.0
-  method: euler
-  num_steps: 25
-# classifier-free guidance
-null_condition_probability: 0.1
-cfg_strength: 4.5
-# checkpoint paths to external modules
-vae_16k_ckpt: ./ext_weights/v1-16.pth
-vae_44k_ckpt: ./ext_weights/v1-44.pth
-bigvgan_vocoder_ckpt: ./ext_weights/best_netG.pt
-synchformer_ckpt: ./ext_weights/synchformer_state_dict.pth

config/data/base.yaml DELETED Viewed

@@ -1,70 +0,0 @@
-VGGSound:
-  root: ../data/video
-  subset_name: sets/vgg3-train.tsv
-  fps: 8
-  height: 384
-  width: 384
-  sample_duration_sec: 8.0
-VGGSound_test:
-  root: ../data/video
-  subset_name: sets/vgg3-test.tsv
-  fps: 8
-  height: 384
-  width: 384
-  sample_duration_sec: 8.0
-VGGSound_val:
-  root: ../data/video
-  subset_name: sets/vgg3-val.tsv
-  fps: 8
-  height: 384
-  width: 384
-  sample_duration_sec: 8.0
-ExtractedVGG:
-  tsv: ../data/v1-16-memmap/vgg-train.tsv
-  memmap_dir: ../data/v1-16-memmap/vgg-train
-ExtractedVGG_test:
-  tag: test
-  gt_cache: ../data/eval-cache/vggsound-test
-  output_subdir: null
-  tsv: ../data/v1-16-memmap/vgg-test.tsv
-  memmap_dir: ../data/v1-16-memmap/vgg-test
-ExtractedVGG_val:
-  tag: val
-  gt_cache: ../data/eval-cache/vggsound-val
-  output_subdir: val
-  tsv: ../data/v1-16-memmap/vgg-val.tsv
-  memmap_dir: ../data/v1-16-memmap/vgg-val
-AudioCaps:
-  tsv: ../data/v1-16-memmap/audiocaps.tsv
-  memmap_dir: ../data/v1-16-memmap/audiocaps
-AudioSetSL:
-  tsv: ../data/v1-16-memmap/audioset_sl.tsv
-  memmap_dir: ../data/v1-16-memmap/audioset_sl
-BBCSound:
-  tsv: ../data/v1-16-memmap/bbcsound.tsv
-  memmap_dir: ../data/v1-16-memmap/bbcsound
-FreeSound:
-  tsv: ../data/v1-16-memmap/freesound.tsv
-  memmap_dir: ../data/v1-16-memmap/freesound
-Clotho:
-  tsv: ../data/v1-16-memmap/clotho.tsv
-  memmap_dir: ../data/v1-16-memmap/clotho
-Example_video:
-  tsv: ./training/example_output/memmap/vgg-example.tsv
-  memmap_dir: ./training/example_output/memmap/vgg-example
-Example_audio:
-  tsv: ./training/example_output/memmap/audio-example.tsv
-  memmap_dir: ./training/example_output/memmap/audio-example

config/eval_config.yaml DELETED Viewed

@@ -1,17 +0,0 @@
-defaults:
-  - base_config
-  - override hydra/job_logging: custom-simplest
-  - _self_
-hydra:
-  run:
-    dir: ./output/${exp_id}
-  output_subdir: eval-${now:%Y-%m-%d_%H-%M-%S}-hydra
-exp_id: ${model}
-dataset: audiocaps
-duration_s: 8.0
-# for inference, this is the per-GPU batch size
-batch_size: 16
-output_name: null

config/eval_data/base.yaml DELETED Viewed

@@ -1,22 +0,0 @@
-AudioCaps:
-  audio_path: ../data/AudioCaps-test-audioldm-ver
-  # a csv file, with a header row of 'name' and 'caption'
-  # name should match the audio file name without extension
-  # Can be downloaded here: https://github.com/hkchengrex/MMAudio/releases/download/v0.1/AudioCaps_audioldm_data.csv
-  csv_path: ../data/AudioCaps-test-audioldm-ver/data.csv
-AudioCaps_full:
-  audio_path: ../data/AudioCaps-test-full-ver
-  # a csv file, with a header row of 'name' and 'caption'
-  # name should match the audio file name without extension
-  # Can be downloaded here: https://github.com/hkchengrex/MMAudio/releases/download/v0.1/AudioCaps_full_data.csv
-  csv_path: ../data/AudioCaps-test-full-ver/data.csv
-MovieGen:
-  video_path: ../data/MovieGen/MovieGenAudioBenchSfx/video_with_audio
-  jsonl_path: ../data/MovieGen/MovieGenAudioBenchSfx/metadata
-VGGSound:
-  video_path: ../data/test-videos
-  # from the officially released csv file
-  csv_path: ../data/vggsound.csv

config/hydra/job_logging/custom-eval.yaml DELETED Viewed

@@ -1,32 +0,0 @@
-# python logging configuration for tasks
-version: 1
-formatters:
-  simple:
-    format: '[%(asctime)s][%(levelname)s][r${oc.env:LOCAL_RANK}] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-  colorlog:
-    '()': 'colorlog.ColoredFormatter'
-    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-    log_colors:
-      DEBUG: purple
-      INFO: green
-      WARNING: yellow
-      ERROR: red
-      CRITICAL: red
-handlers:
-  console:
-    class: logging.StreamHandler
-    formatter: colorlog
-    stream: ext://sys.stdout
-  file:
-    class: logging.FileHandler
-    formatter: simple
-    # absolute file path
-    filename: ${hydra.runtime.output_dir}/eval-${now:%Y-%m-%d_%H-%M-%S}-rank${oc.env:LOCAL_RANK}.log
-    mode: w
-root:
-  level: INFO
-  handlers: [console, file]
-disable_existing_loggers: false

config/hydra/job_logging/custom-no-rank.yaml DELETED Viewed

@@ -1,32 +0,0 @@
-# python logging configuration for tasks
-version: 1
-formatters:
-  simple:
-    format: '[%(asctime)s][%(levelname)s] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-  colorlog:
-    '()': 'colorlog.ColoredFormatter'
-    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-    log_colors:
-      DEBUG: purple
-      INFO: green
-      WARNING: yellow
-      ERROR: red
-      CRITICAL: red
-handlers:
-  console:
-    class: logging.StreamHandler
-    formatter: colorlog
-    stream: ext://sys.stdout
-  file:
-    class: logging.FileHandler
-    formatter: simple
-    # absolute file path
-    filename: ${hydra.runtime.output_dir}/${now:%Y-%m-%d_%H-%M-%S}-eval.log
-    mode: w
-root:
-  level: INFO
-  handlers: [console, file]
-disable_existing_loggers: false

config/hydra/job_logging/custom-simplest.yaml DELETED Viewed

@@ -1,26 +0,0 @@
-# python logging configuration for tasks
-version: 1
-formatters:
-  simple:
-    format: '[%(asctime)s][%(levelname)s] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-  colorlog:
-    '()': 'colorlog.ColoredFormatter'
-    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-    log_colors:
-      DEBUG: purple
-      INFO: green
-      WARNING: yellow
-      ERROR: red
-      CRITICAL: red
-handlers:
-  console:
-    class: logging.StreamHandler
-    formatter: colorlog
-    stream: ext://sys.stdout
-root:
-  level: INFO
-  handlers: [console]
-disable_existing_loggers: false

config/hydra/job_logging/custom.yaml DELETED Viewed

@@ -1,33 +0,0 @@
-# @package hydra.job_logging
-# python logging configuration for tasks
-version: 1
-formatters:
-  simple:
-    format: '[%(asctime)s][%(levelname)s][r${oc.env:LOCAL_RANK}] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-  colorlog:
-    '()': 'colorlog.ColoredFormatter'
-    format: '[%(cyan)s%(asctime)s%(reset)s][%(blue)sr${oc.env:LOCAL_RANK}%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
-    datefmt: '%Y-%m-%d %H:%M:%S'
-    log_colors:
-      DEBUG: purple
-      INFO: green
-      WARNING: yellow
-      ERROR: red
-      CRITICAL: red
-handlers:
-  console:
-    class: logging.StreamHandler
-    formatter: colorlog
-    stream: ext://sys.stdout
-  file:
-    class: logging.FileHandler
-    formatter: simple
-    # absolute file path
-    filename: ${hydra.runtime.output_dir}/train-${now:%Y-%m-%d_%H-%M-%S}-rank${oc.env:LOCAL_RANK}.log
-    mode: w
-root:
-  level: INFO
-  handlers: [console, file]
-disable_existing_loggers: false

config/train_config.yaml DELETED Viewed

@@ -1,41 +0,0 @@
-defaults:
-  - base_config
-  - override data: base
-  - override hydra/job_logging: custom
-  - _self_
-hydra:
-  run:
-    dir: ./output/${exp_id}
-  output_subdir: train-${now:%Y-%m-%d_%H-%M-%S}-hydra
-ema:
-  start: 0
-mini_train: False
-example_train: False
-enable_grad_scaler: False
-vgg_oversample_rate: 5
-log_text_interval: 200
-log_extra_interval: 20_000
-val_interval: 5_000
-eval_interval: 20_000
-save_eval_interval: 40_000
-save_weights_interval: 10_000
-save_checkpoint_interval: 10_000
-save_copy_iterations: []
-batch_size: 512
-eval_batch_size: 256 # per-GPU
-num_iterations: 300_000
-learning_rate: 1.0e-4
-linear_warmup_steps: 1_000
-lr_schedule: step
-lr_schedule_steps: [240_000, 270_000]
-lr_schedule_gamma: 0.1
-clip_grad_norm: 1.0
-weight_decay: 1.0e-6

demo.py CHANGED Viewed

@@ -62,13 +62,7 @@ def main():
     skip_video_composite: bool = args.skip_video_composite
     mask_away_clip: bool = args.mask_away_clip
-    device = 'cpu'
-    if torch.cuda.is_available():
-        device = 'cuda'
-    elif torch.backends.mps.is_available():
-        device = 'mps'
-    else:
-        log.warning('CUDA/MPS are not available, running on CPU')
     dtype = torch.float32 if args.full_precision else torch.bfloat16
     output_dir.mkdir(parents=True, exist_ok=True)

     skip_video_composite: bool = args.skip_video_composite
     mask_away_clip: bool = args.mask_away_clip
+    device = 'cuda'
     dtype = torch.float32 if args.full_precision else torch.bfloat16
     output_dir.mkdir(parents=True, exist_ok=True)

docs/EVAL.md DELETED Viewed

@@ -1,22 +0,0 @@
-# Evaluation
-## Batch Evaluation
-To evaluate the model on a dataset, use the `batch_eval.py` script. It is significantly more efficient in large-scale evaluation compared to `demo.py`, supporting batched inference, multi-GPU inference, torch compilation, and skipping video compositions.
-An example of running this script with four GPUs is as follows:
-```bash
-OMP_NUM_THREADS=4 torchrun --standalone --nproc_per_node=4  batch_eval.py duration_s=8 dataset=vggsound model=small_16k num_workers=8
-```
-You may need to update the data paths in `config/eval_data/base.yaml`.
-More configuration options can be found in `config/base_config.yaml` and `config/eval_config.yaml`.
-## Precomputed Results
-Precomputed results for VGGSound, AudioCaps, and MovieGen are available here: https://huggingface.co/datasets/hkchengrex/MMAudio-precomputed-results
-## Obtaining Quantitative Metrics
-Our evaluation code is available here: https://github.com/hkchengrex/av-benchmark

docs/MODELS.md DELETED Viewed

@@ -1,50 +0,0 @@
-# Pretrained models
-The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
-The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
-| Model    | Download link | File size |
-| -------- | ------- | ------- |
-| Flow prediction network, small 16kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_small_16k.pth" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
-| Flow prediction network, small 44.1kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_small_44k.pth" download="mmaudio_small_44k.pth">mmaudio_small_44k.pth</a> | 601M |
-| Flow prediction network, medium 44.1kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_medium_44k.pth" download="mmaudio_medium_44k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
-| Flow prediction network, large 44.1kHz | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_large_44k.pth" download="mmaudio_large_44k.pth">mmaudio_large_44k.pth</a> | 3.9G |
-| Flow prediction network, large 44.1kHz, v2 **(recommended)** | <a href="https://huggingface.co/hkchengrex/MMAudio/resolve/main/weights/mmaudio_large_44k_v2.pth" download="mmaudio_large_44k_v2.pth">mmaudio_large_44k_v2.pth</a> | 3.9G |
-| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
-| 16kHz BigVGAN vocoder (from Make-An-Audio 2) |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
-| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G |
-| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
-To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes.
-The 44.1kHz vocoder will be downloaded automatically.
-The `_v2` model performs worse in benchmarking (e.g., in  Fréchet distance), but, in my experience, generalizes better to new data.
-The expected directory structure (full):
-```bash
-MMAudio
-├── ext_weights
-│   ├── best_netG.pt
-│   ├── synchformer_state_dict.pth
-│   ├── v1-16.pth
-│   └── v1-44.pth
-├── weights
-│   ├── mmaudio_small_16k.pth
-│   ├── mmaudio_small_44k.pth
-│   ├── mmaudio_medium_44k.pth
-│   ├── mmaudio_large_44k.pth
-│   └── mmaudio_large_44k_v2.pth
-└── ...
-```
-The expected directory structure (minimal, for the recommended model only):
-```bash
-MMAudio
-├── ext_weights
-│   ├── synchformer_state_dict.pth
-│   └── v1-44.pth
-├── weights
-│   └── mmaudio_large_44k_v2.pth
-└── ...
-```

docs/TRAINING.md DELETED Viewed

@@ -1,184 +0,0 @@
-# Training
-## Overview
-We have put a large emphasis on making training as fast as possible.
-Consequently, some pre-processing steps are required.
-Namely, before starting any training, we
-1. Obtain training data as videos, audios, and captions.
-2. Encode training audios into spectrograms and then with VAE into mean/std
-3. Extract CLIP and synchronization features from videos
-4. Extract CLIP features from text (captions)
-5. Encode all extracted features into [MemoryMappedTensors](https://pytorch.org/tensordict/main/reference/generated/tensordict.MemoryMappedTensor.html) with [TensorDict](https://pytorch.org/tensordict/main/reference/tensordict.html)
-**NOTE:** for maximum training speed (e.g., when training the base model with 2*H100s), you would need around 3~5 GB/s of random read speed. Spinning disks would not be able to catch up and most consumer-grade SSDs would struggle. In my experience, the best bet is to have a large enough system memory such that the OS can cache the data. This way, the data is read from RAM instead of disk.
-The current training script does not support `_v2` training.
-## Recommended Hardware Configuration
-These are what I recommend for a smooth and efficient training experience. These are not minimum requirements.
-- Single-node machine. We did not implement multi-node training
-- GPUs: for the small model, two 80G-H100s or above; for the large model, eight 80G-H100s or above
-- System memory: for 16kHz training, 600GB+; for 44kHz training, 700GB+
-- Storage: >2TB of fast NVMe storage. If you have enough system memory, OS caching will help and the storage does not need to be as fast.
-## Prerequisites
-1. Install [av-benchmark](https://github.com/hkchengrex/av-benchmark). We use this library to automatically evaluate on the validation set during training, and on the test set after training.
-2. Extract features for evaluation using [av-benchmark](https://github.com/hkchengrex/av-benchmark) for the validation and test set as a [validation cache](https://github.com/hkchengrex/MMAudio/blob/34bf089fdd2e457cd5ef33be96c0e1c8a0412476/config/data/base.yaml#L38) and a [test cache](https://github.com/hkchengrex/MMAudio/blob/34bf089fdd2e457cd5ef33be96c0e1c8a0412476/config/data/base.yaml#L31). You can also download the precomputed evaluation cache [here](https://huggingface.co/datasets/hkchengrex/MMAudio-precomputed-results/tree/main).
-3. You will need ffmpeg to extract frames from videos. Note that `torchaudio` imposes a maximum version limit (`ffmpeg<7`). You can install it as follows:
-```bash
-conda install -c conda-forge 'ffmpeg<7'
-```
-4. Download the training datasets. We used [VGGSound](https://arxiv.org/abs/2004.14368), [AudioCaps](https://audiocaps.github.io/), [WavCaps](https://arxiv.org/abs/2303.17395), and [Clotho](https://arxiv.org/abs/1910.09387) (paper to be updated). Note that the audio files in the huggingface release of WavCaps have been downsampled to 32kHz. To the best of our ability, we located the original (high-sampling rate) audio files and used them instead to prevent artifacts during 44.1kHz training. We did not use the "SoundBible" portion of WavCaps, since it is a small set with many short audio unsuitable for our training.
-5. Download the corresponding VAE (`v1-16.pth` for 16kHz training, and `v1-44.pth` for 44.1kHz training), vocoder models (`best_netG.pt` for 16kHz training; the vocoder for 44.1kHz training will be downloaded automatically), the [empty string encoding](https://github.com/hkchengrex/MMAudio/releases/download/v0.1/empty_string.pth), and Synchformer weights from [MODELS.md](https://github.com/hkchengrex/MMAudio/blob/main/docs/MODELS.md) place them in `ext_weights/`.
-### Helpful links for downloading the datasets
-We cannot redistribute the datasets for copyright reasons, but we do find some links helpful and they might be helpful to you as well.
-- https://huggingface.co/datasets/Meranti/CLAP_freesound
-- https://huggingface.co/datasets/agkphysics/AudioSet
-- https://sound-effects.bbcrewind.co.uk/
-For certain sources of VGGSound, you might notice desychronization between the audio and the video. This happens the video keyframes do not always align with the start of the audio and what happens during playbacks is player-dependent. We used PyTorch's decoder which can correctly handle these cases.
-## Preparing Audio-Video-Text Features
-We have prepared some example data in `training/example_videos`.
-`training/extract_video_training_latents.py` extracts audio, video, and text features and save them as a `TensorDict` with a `.tsv` file containing metadata to `output_dir`.
-To run this script, use the `torchrun` utility:
-```bash
-torchrun --standalone training/extract_video_training_latents.py
-```
-You can run this script with multiple GPUs (with `--nproc_per_node=<n>` after `--standalone` and before the script name) to speed up extraction.
-Modify the definitions near the top of the script to switch between 16kHz/44.1kHz extraction.
-Change the data path definitions in `data_cfg` if necessary.
-Arguments:
-- `latent_dir` -- where intermediate latent outputs are saved. It is safe to delete this directory afterwards.
-- `output_dir` -- where TensorDict and the metadata file are saved.
-Outputs produced in `output_dir`:
-1. A directory named `vgg-{split}` (i.e., in the TensorDict format), containing
-    a. `mean.memmap` mean values predicted by the VAE encoder (number of videos X sequence length X channel size)
-    b. `std.memmap` standard deviation values predicted by the VAE encoder (number of videos X sequence length X channel size)
-    c. `text_features.memmap` text features extracted from CLIP (number of videos X 77 (sequence length) X 1024)
-    d. `clip_features.memmap` clip features extracted from CLIP (number of videos X 64 (8 fps) X 1024)
-    e. `sync_features.memmap` synchronization features extracted from Synchformer (number of videos X 192 (24 fps) X 768)
-    f. `meta.json` that contains the metadata for the above memory mappings
-2. A tab-separated values file named `vgg-{split}.tsv` that contains two columns: `id` containing video file names without extension, and `label` containing corresponding text labels (i.e., captions)
-## Preparing Audio-Text Features
-We have prepared some example data in `training/example_audios`.
-1. Run `training/partition_clips` to partition each audio file into clips (by finding start and end points; we do not save the partitioned audio onto the disk to save disk space)
-2. Run `training/extract_audio_training_latents.py` to extract each clip's audio and text features and save them as a `TensorDict` with a `.tsv` file containing metadata to `output_dir`.
-### Partitioning the audio files
-Run
-```bash
-python training/partition_clips.py
-```
-Arguments:
-- `data_dir` -- path to a directory containing the audio files (`.flac` or `.wav`)
-- `output_dir` -- path to the output `.csv` file
-- `start` -- optional; useful when you need to run multiple processes to speed up processing -- this defines the beginning of the chunk to be processed
-- `end` -- optional; useful when you need to run multiple processes to speed up processing -- this defines the end of the chunk to be processed
-### Extracting audio and text features
-Run
-```bash
-torchrun --standalone training/extract_audio_training_latents.py
-```
-You can run this with multiple GPUs (with `--nproc_per_node=<n>`) to speed up extraction.
-Modify the definitions near the top of the script to switch between 16kHz/44.1kHz extraction.
-Arguments:
-- `data_dir` -- path to a directory containing the audio files (`.flac` or `.wav`), same as the previous step
-- `captions_tsv` -- path to the captions file, a tab-separated values (tsv) file at least with columns `id` and `caption`
-- `clips_tsv` -- path to the clips file, generated in the last step
-- `latent_dir` -- where intermediate latent outputs are saved. It is safe to delete this directory afterwards.
-- `output_dir` -- where TensorDict and the metadata file are saved.
-Outputs produced in `output_dir`:
-1. A directory named `{basename(output_dir)}` (i.e., in the TensorDict format), containing
-    a. `mean.memmap` mean values predicted by the VAE encoder (number of audios X sequence length X channel size)
-    b. `std.memmap` standard deviation values predicted by the VAE encoder (number of audios X sequence length X channel size)
-    c. `text_features.memmap` text features extracted from CLIP (number of audios X 77 (sequence length) X 1024)
-    f. `meta.json` that contains the metadata for the above memory mappings
-2. A tab-separated values file named `{basename(output_dir)}.tsv` that contains two columns: `id` containing audio file names without extension, and `label` containing corresponding text labels (i.e., captions)
-### Reference tsv files (with overlaps removed as mentioned in the paper)
-The reference tsv files can be found [here](https://github.com/hkchengrex/MMAudio/releases/tag/v0.1).
-Note that these reference tsv files are the **outputs** of `extract_audio_training_latents.py`, which means the `id` column might contain duplicate entries (one per clip). You can still use it as the `captions_tsv` input though -- the script will handle duplicates gracefully.
-Among these reference tsv files, `audioset_sl.tsv`, `bbcsound.tsv`, and `freesound.tsv` are subsets that are parts of WavCaps. These subsets might be smaller than the original datasets.
-The Clotho data contains both the development set and the validation set.
-**Update (Mar 9, 2025)**:
-We have updated a corrected set of reference tsv files. The previous tsv files contained some (<1%) corrupted captions (ie, mismatch between audio and caption, see https://github.com/hkchengrex/MMAudio/issues/56). The tsv files for VGGSound are unaffected. This reason for this error is unknown, but I cannot reproduce this error in the latest version of the code. Our pre-trained models are trained with **uncorrected** tsv files. For future training, I recommend using the corrected tsv files.
-The error statistics are as follows:
-- AudioCaps (170/43824), 0.39%
-- Freesound: (1670/180636), 0.92%
-- AudioSet: (290/100776), 0.29%
-- BBCSound: (3/29975), 0.01%
-- Clotho: (8/24332), 0.03%
-## Training on Extracted Features
-We use Distributed Data Parallel (DDP) for training.
-First, specify the data path in `config/data/base.yaml`. If you used the default parameters in the scripts above to extract features for the example data, the `Example_video` and `Example_audio` items should already be correct.
-To run training on the example data, use the following command:
-```bash
-OMP_NUM_THREADS=4 torchrun --standalone --nproc_per_node=1 train.py exp_id=debug compile=False  debug=True example_train=True  batch_size=1
-```
-This will not train a useful model, but it will check if everything is set up correctly.
-For full training on the base model with two GPUs, use the following command:
-```bash
-OMP_NUM_THREADS=4 torchrun --standalone --nproc_per_node=2 train.py exp_id=exp_1 model=small_16k
-```
-Any outputs from training will be stored in `output/<exp_id>`.
-More configuration options can be found in `config/base_config.yaml` and `config/train_config.yaml`.
-For the medium and large models, specify `vgg_oversample_rate` to be `3` to reduce overfitting.
-## Checkpoints
-Model checkpoints, including optimizer states and the latest EMA weights, are available here: https://huggingface.co/hkchengrex/MMAudio
----
-Godspeed!

docs/index.html CHANGED Viewed

@@ -40,7 +40,7 @@
             <br>
             <div class="row text-center" style="font-size:28px">
                 <div class="col">
-                    CVPR 2025
                 </div>
             </div>
             <br>
@@ -83,21 +83,19 @@
             <br>
             <div class="h-100 row text-center justify-content-md-center" style="font-size:20px;">
-                <div class="col-sm-2">
-                    <a href="https://arxiv.org/abs/2412.15322">[Paper]</a>
-                </div>
-                <div class="col-sm-2">
-                    <a href="https://github.com/hkchengrex/MMAudio">[Code]</a>
-                </div>
                 <div class="col-sm-3">
-                    <a href="https://huggingface.co/spaces/hkchengrex/MMAudio">[Huggingface Demo]</a>
-                </div>
-                <div class="col-sm-2">
-                    <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">[Colab Demo]</a>
                 </div>
                 <div class="col-sm-3">
-                    <a href="https://replicate.com/zsxkib/mmaudio">[Replicate Demo]</a>
                 </div>
             </div>
             <br>

             <br>
             <div class="row text-center" style="font-size:28px">
                 <div class="col">
+                    arXiv 2024
                 </div>
             </div>
             <br>
             <br>
             <div class="h-100 row text-center justify-content-md-center" style="font-size:20px;">
+                <!-- <div class="col-sm-2">
+                    <a href="https://arxiv.org/abs/2310.12982">[arXiv]</a>
+                </div> -->
                 <div class="col-sm-3">
+                    <a href="">[Paper (being prepared)]</a>
                 </div>
                 <div class="col-sm-3">
+                    <a href="https://github.com/hkchengrex/MMAudio">[Code]</a>
                 </div>
+                <!-- <div class="col-sm-2">
+                    <a
+                        href="https://colab.research.google.com/drive/1yo43XTbjxuWA7XgCUO9qxAi7wBI6HzvP?usp=sharing">[Colab]</a>
+                </div> -->
             </div>
             <br>

gradio_demo.py DELETED Viewed

@@ -1,343 +0,0 @@
-import gc
-import logging
-from argparse import ArgumentParser
-from datetime import datetime
-from fractions import Fraction
-from pathlib import Path
-import gradio as gr
-import torch
-import torchaudio
-from mmaudio.eval_utils import (ModelConfig, VideoInfo, all_model_cfg, generate, load_image,
-                                load_video, make_video, setup_eval_logging)
-from mmaudio.model.flow_matching import FlowMatching
-from mmaudio.model.networks import MMAudio, get_my_mmaudio
-from mmaudio.model.sequence_config import SequenceConfig
-from mmaudio.model.utils.features_utils import FeaturesUtils
-torch.backends.cuda.matmul.allow_tf32 = True
-torch.backends.cudnn.allow_tf32 = True
-log = logging.getLogger()
-device = 'cpu'
-if torch.cuda.is_available():
-    device = 'cuda'
-elif torch.backends.mps.is_available():
-    device = 'mps'
-else:
-    log.warning('CUDA/MPS are not available, running on CPU')
-dtype = torch.bfloat16
-model: ModelConfig = all_model_cfg['large_44k_v2']
-model.download_if_needed()
-output_dir = Path('./output/gradio')
-setup_eval_logging()
-def get_model() -> tuple[MMAudio, FeaturesUtils, SequenceConfig]:
-    seq_cfg = model.seq_cfg
-    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
-    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
-    log.info(f'Loaded weights from {model.model_path}')
-    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
-                                  synchformer_ckpt=model.synchformer_ckpt,
-                                  enable_conditions=True,
-                                  mode=model.mode,
-                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
-                                  need_vae_encoder=False)
-    feature_utils = feature_utils.to(device, dtype).eval()
-    return net, feature_utils, seq_cfg
-net, feature_utils, seq_cfg = get_model()
-@torch.inference_mode()
-def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
-                   cfg_strength: float, duration: float):
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-    video_info = load_video(video, duration)
-    clip_frames = video_info.clip_frames
-    sync_frames = video_info.sync_frames
-    duration = video_info.duration_sec
-    clip_frames = clip_frames.unsqueeze(0)
-    sync_frames = sync_frames.unsqueeze(0)
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    output_dir.mkdir(exist_ok=True, parents=True)
-    video_save_path = output_dir / f'{current_time_string}.mp4'
-    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-    gc.collect()
-    return video_save_path
-@torch.inference_mode()
-def image_to_audio(image: gr.Image, prompt: str, negative_prompt: str, seed: int, num_steps: int,
-                   cfg_strength: float, duration: float):
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-    image_info = load_image(image)
-    clip_frames = image_info.clip_frames
-    sync_frames = image_info.sync_frames
-    clip_frames = clip_frames.unsqueeze(0)
-    sync_frames = sync_frames.unsqueeze(0)
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength,
-                      image_input=True)
-    audio = audios.float().cpu()[0]
-    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    output_dir.mkdir(exist_ok=True, parents=True)
-    video_save_path = output_dir / f'{current_time_string}.mp4'
-    video_info = VideoInfo.from_image_info(image_info, duration, fps=Fraction(1))
-    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-    gc.collect()
-    return video_save_path
-@torch.inference_mode()
-def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int, cfg_strength: float,
-                  duration: float):
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-    clip_frames = sync_frames = None
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-    current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    output_dir.mkdir(exist_ok=True, parents=True)
-    audio_save_path = output_dir / f'{current_time_string}.flac'
-    torchaudio.save(audio_save_path, audio, seq_cfg.sampling_rate)
-    gc.collect()
-    return audio_save_path
-video_to_audio_tab = gr.Interface(
-    fn=video_to_audio,
-    description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
-    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
-    NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side).
-    Doing so does not improve results.
-    """,
-    inputs=[
-        gr.Video(),
-        gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt', value='music'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
-    ],
-    outputs='playable_video',
-    cache_examples=False,
-    title='MMAudio — Video-to-Audio Synthesis',
-    examples=[
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_beach.mp4',
-            'waves, seagulls',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_serpent.mp4',
-            '',
-            'music',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_seahorse.mp4',
-            'bubbles',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_india.mp4',
-            'Indian holy music',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_galloping.mp4',
-            'galloping',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_kraken.mp4',
-            'waves, storm',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/mochi_storm.mp4',
-            'storm',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_spring.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_typing.mp4',
-            'typing',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_wake_up.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_nyc.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-    ])
-text_to_audio_tab = gr.Interface(
-    fn=text_to_audio,
-    description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
-    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
-    """,
-    inputs=[
-        gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
-    ],
-    outputs='audio',
-    cache_examples=False,
-    title='MMAudio — Text-to-Audio Synthesis',
-)
-image_to_audio_tab = gr.Interface(
-    fn=image_to_audio,
-    description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
-    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
-    NOTE: It takes longer to process high-resolution images (>384 px on the shorter side).
-    Doing so does not improve results.
-    """,
-    inputs=[
-        gr.Image(type='filepath'),
-        gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
-    ],
-    outputs='playable_video',
-    cache_examples=False,
-    title='MMAudio — Image-to-Audio Synthesis (experimental)',
-)
-if __name__ == "__main__":
-    parser = ArgumentParser()
-    parser.add_argument('--port', type=int, default=7860)
-    args = parser.parse_args()
-    gr.TabbedInterface([video_to_audio_tab, text_to_audio_tab, image_to_audio_tab],
-                       ['Video-to-Audio', 'Text-to-Audio', 'Image-to-Audio (experimental)']).launch(
-                           server_port=args.port, allowed_paths=[output_dir])

mmaudio/__pycache__/__init__.cpython-310.pyc DELETED Viewed

Binary file (187 Bytes)

mmaudio/__pycache__/__init__.cpython-38.pyc DELETED Viewed

Binary file (185 Bytes)

mmaudio/__pycache__/eval_utils.cpython-310.pyc DELETED Viewed

Binary file (7.07 kB)

mmaudio/__pycache__/eval_utils.cpython-38.pyc DELETED Viewed

Binary file (7.03 kB)

mmaudio/data/__pycache__/__init__.cpython-310.pyc DELETED Viewed

Binary file (192 Bytes)

mmaudio/data/__pycache__/__init__.cpython-38.pyc DELETED Viewed

Binary file (190 Bytes)

mmaudio/data/__pycache__/av_utils.cpython-310.pyc DELETED Viewed

Binary file (4.91 kB)

mmaudio/data/__pycache__/av_utils.cpython-38.pyc DELETED Viewed

Binary file (4.89 kB)

mmaudio/data/av_utils.py CHANGED Viewed

@@ -1,7 +1,7 @@
 from dataclasses import dataclass
 from fractions import Fraction
 from pathlib import Path
-from typing import Optional, List, Tuple
 import av
 import numpy as np
@@ -15,7 +15,7 @@ class VideoInfo:
     fps: Fraction
     clip_frames: torch.Tensor
     sync_frames: torch.Tensor
-    all_frames: Optional[List[np.ndarray]]
     @property
     def height(self):
@@ -25,35 +25,9 @@ class VideoInfo:
     def width(self):
         return self.all_frames[0].shape[1]
-    @classmethod
-    def from_image_info(cls, image_info: 'ImageInfo', duration_sec: float,
-                        fps: Fraction) -> 'VideoInfo':
-        num_frames = int(duration_sec * fps)
-        all_frames = [image_info.original_frame] * num_frames
-        return cls(duration_sec=duration_sec,
-                   fps=fps,
-                   clip_frames=image_info.clip_frames,
-                   sync_frames=image_info.sync_frames,
-                   all_frames=all_frames)
-@dataclass
-class ImageInfo:
-    clip_frames: torch.Tensor
-    sync_frames: torch.Tensor
-    original_frame: Optional[np.ndarray]
-    @property
-    def height(self):
-        return self.original_frame.shape[0]
-    @property
-    def width(self):
-        return self.original_frame.shape[1]
-def read_frames(video_path: Path, list_of_fps: List[float], start_sec: float, end_sec: float,
-                need_all_frames: bool) -> Tuple[List[np.ndarray], List[np.ndarray], Fraction]:
     output_frames = [[] for _ in list_of_fps]
     next_frame_time_for_each_fps = [0.0 for _ in list_of_fps]
     time_delta_for_each_fps = [1 / fps for fps in list_of_fps]

 from dataclasses import dataclass
 from fractions import Fraction
 from pathlib import Path
+from typing import Optional
 import av
 import numpy as np
     fps: Fraction
     clip_frames: torch.Tensor
     sync_frames: torch.Tensor
+    all_frames: Optional[list[np.ndarray]]
     @property
     def height(self):
     def width(self):
         return self.all_frames[0].shape[1]
+def read_frames(video_path: Path, list_of_fps: list[float], start_sec: float, end_sec: float,
+                need_all_frames: bool) -> tuple[list[np.ndarray], list[np.ndarray], Fraction]:
     output_frames = [[] for _ in list_of_fps]
     next_frame_time_for_each_fps = [0.0 for _ in list_of_fps]
     time_delta_for_each_fps = [1 / fps for fps in list_of_fps]

mmaudio/data/data_setup.py DELETED Viewed

@@ -1,174 +0,0 @@
-import logging
-import random
-import numpy as np
-import torch
-from omegaconf import DictConfig
-from torch.utils.data import DataLoader, Dataset
-from torch.utils.data.dataloader import default_collate
-from torch.utils.data.distributed import DistributedSampler
-from mmaudio.data.eval.audiocaps import AudioCapsData
-from mmaudio.data.eval.video_dataset import MovieGen, VGGSound
-from mmaudio.data.extracted_audio import ExtractedAudio
-from mmaudio.data.extracted_vgg import ExtractedVGG
-from mmaudio.data.mm_dataset import MultiModalDataset
-from mmaudio.utils.dist_utils import local_rank
-log = logging.getLogger()
-# Re-seed randomness every time we start a worker
-def worker_init_fn(worker_id: int):
-    worker_seed = torch.initial_seed() % (2**31) + worker_id + local_rank * 1000
-    np.random.seed(worker_seed)
-    random.seed(worker_seed)
-    log.debug(f'Worker {worker_id} re-seeded with seed {worker_seed} in rank {local_rank}')
-def load_vgg_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
-    dataset = ExtractedVGG(tsv_path=data_cfg.tsv,
-                           data_dim=cfg.data_dim,
-                           premade_mmap_dir=data_cfg.memmap_dir)
-    return dataset
-def load_audio_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
-    dataset = ExtractedAudio(tsv_path=data_cfg.tsv,
-                             data_dim=cfg.data_dim,
-                             premade_mmap_dir=data_cfg.memmap_dir)
-    return dataset
-def setup_training_datasets(cfg: DictConfig) -> tuple[Dataset, DistributedSampler, DataLoader]:
-    if cfg.mini_train:
-        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
-        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
-        dataset = MultiModalDataset([vgg], [audiocaps])
-    if cfg.example_train:
-        video = load_vgg_data(cfg, cfg.data.Example_video)
-        audio = load_audio_data(cfg, cfg.data.Example_audio)
-        dataset = MultiModalDataset([video], [audio])
-    else:
-        # load the largest one first
-        freesound = load_audio_data(cfg, cfg.data.FreeSound)
-        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG)
-        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
-        audioset_sl = load_audio_data(cfg, cfg.data.AudioSetSL)
-        bbcsound = load_audio_data(cfg, cfg.data.BBCSound)
-        clotho = load_audio_data(cfg, cfg.data.Clotho)
-        dataset = MultiModalDataset([vgg] * cfg.vgg_oversample_rate,
-                                    [audiocaps, audioset_sl, bbcsound, freesound, clotho])
-    batch_size = cfg.batch_size
-    num_workers = cfg.num_workers
-    pin_memory = cfg.pin_memory
-    sampler, loader = construct_loader(dataset,
-                                       batch_size,
-                                       num_workers,
-                                       shuffle=True,
-                                       drop_last=True,
-                                       pin_memory=pin_memory)
-    return dataset, sampler, loader
-def setup_test_datasets(cfg):
-    dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_test)
-    batch_size = cfg.batch_size
-    num_workers = cfg.num_workers
-    pin_memory = cfg.pin_memory
-    sampler, loader = construct_loader(dataset,
-                                       batch_size,
-                                       num_workers,
-                                       shuffle=False,
-                                       drop_last=False,
-                                       pin_memory=pin_memory)
-    return dataset, sampler, loader
-def setup_val_datasets(cfg: DictConfig) -> tuple[Dataset, DataLoader, DataLoader]:
-    if cfg.example_train:
-        dataset = load_vgg_data(cfg, cfg.data.Example_video)
-    else:
-        dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
-    val_batch_size = cfg.batch_size
-    val_eval_batch_size = cfg.eval_batch_size
-    num_workers = cfg.num_workers
-    pin_memory = cfg.pin_memory
-    _, val_loader = construct_loader(dataset,
-                                     val_batch_size,
-                                     num_workers,
-                                     shuffle=False,
-                                     drop_last=False,
-                                     pin_memory=pin_memory)
-    _, eval_loader = construct_loader(dataset,
-                                      val_eval_batch_size,
-                                      num_workers,
-                                      shuffle=False,
-                                      drop_last=False,
-                                      pin_memory=pin_memory)
-    return dataset, val_loader, eval_loader
-def setup_eval_dataset(dataset_name: str, cfg: DictConfig) -> tuple[Dataset, DataLoader]:
-    if dataset_name.startswith('audiocaps_full'):
-        dataset = AudioCapsData(cfg.eval_data.AudioCaps_full.audio_path,
-                                cfg.eval_data.AudioCaps_full.csv_path)
-    elif dataset_name.startswith('audiocaps'):
-        dataset = AudioCapsData(cfg.eval_data.AudioCaps.audio_path,
-                                cfg.eval_data.AudioCaps.csv_path)
-    elif dataset_name.startswith('moviegen'):
-        dataset = MovieGen(cfg.eval_data.MovieGen.video_path,
-                           cfg.eval_data.MovieGen.jsonl_path,
-                           duration_sec=cfg.duration_s)
-    elif dataset_name.startswith('vggsound'):
-        dataset = VGGSound(cfg.eval_data.VGGSound.video_path,
-                           cfg.eval_data.VGGSound.csv_path,
-                           duration_sec=cfg.duration_s)
-    else:
-        raise ValueError(f'Invalid dataset name: {dataset_name}')
-    batch_size = cfg.batch_size
-    num_workers = cfg.num_workers
-    pin_memory = cfg.pin_memory
-    _, loader = construct_loader(dataset,
-                                 batch_size,
-                                 num_workers,
-                                 shuffle=False,
-                                 drop_last=False,
-                                 pin_memory=pin_memory,
-                                 error_avoidance=True)
-    return dataset, loader
-def error_avoidance_collate(batch):
-    batch = list(filter(lambda x: x is not None, batch))
-    return default_collate(batch)
-def construct_loader(dataset: Dataset,
-                     batch_size: int,
-                     num_workers: int,
-                     *,
-                     shuffle: bool = True,
-                     drop_last: bool = True,
-                     pin_memory: bool = False,
-                     error_avoidance: bool = False) -> tuple[DistributedSampler, DataLoader]:
-    train_sampler = DistributedSampler(dataset, rank=local_rank, shuffle=shuffle)
-    train_loader = DataLoader(dataset,
-                              batch_size,
-                              sampler=train_sampler,
-                              num_workers=num_workers,
-                              worker_init_fn=worker_init_fn,
-                              drop_last=drop_last,
-                              persistent_workers=num_workers > 0,
-                              pin_memory=pin_memory,
-                              collate_fn=error_avoidance_collate if error_avoidance else None)
-    return train_sampler, train_loader

mmaudio/data/eval/__init__.py DELETED Viewed

File without changes

mmaudio/data/eval/audiocaps.py DELETED Viewed

@@ -1,39 +0,0 @@
-import logging
-import os
-from collections import defaultdict
-from pathlib import Path
-from typing import Union
-import pandas as pd
-import torch
-from torch.utils.data.dataset import Dataset
-log = logging.getLogger()
-class AudioCapsData(Dataset):
-    def __init__(self, audio_path: Union[str, Path], csv_path: Union[str, Path]):
-        df = pd.read_csv(csv_path).to_dict(orient='records')
-        audio_files = sorted(os.listdir(audio_path))
-        audio_files = set(
-            [Path(f).stem for f in audio_files if f.endswith('.wav') or f.endswith('.flac')])
-        self.data = []
-        for row in df:
-            self.data.append({
-                'name': row['name'],
-                'caption': row['caption'],
-            })
-        self.audio_path = Path(audio_path)
-        self.csv_path = Path(csv_path)
-        log.info(f'Found {len(self.data)} matching audio files in {self.audio_path}')
-    def __getitem__(self, idx: int) -> torch.Tensor:
-        return self.data[idx]
-    def __len__(self):
-        return len(self.data)

mmaudio/data/eval/moviegen.py DELETED Viewed

@@ -1,131 +0,0 @@
-import json
-import logging
-import os
-from pathlib import Path
-from typing import Union
-import torch
-from torch.utils.data.dataset import Dataset
-from torchvision.transforms import v2
-from torio.io import StreamingMediaDecoder
-from mmaudio.utils.dist_utils import local_rank
-log = logging.getLogger()
-_CLIP_SIZE = 384
-_CLIP_FPS = 8.0
-_SYNC_SIZE = 224
-_SYNC_FPS = 25.0
-class MovieGenData(Dataset):
-    def __init__(
-        self,
-        video_root: Union[str, Path],
-        sync_root: Union[str, Path],
-        jsonl_root: Union[str, Path],
-        *,
-        duration_sec: float = 10.0,
-        read_clip: bool = True,
-    ):
-        self.video_root = Path(video_root)
-        self.sync_root = Path(sync_root)
-        self.jsonl_root = Path(jsonl_root)
-        self.read_clip = read_clip
-        videos = sorted(os.listdir(self.video_root))
-        videos = [v[:-4] for v in videos]  # remove extensions
-        self.captions = {}
-        for v in videos:
-            with open(self.jsonl_root / (v + '.jsonl')) as f:
-                data = json.load(f)
-                self.captions[v] = data['audio_prompt']
-        if local_rank == 0:
-            log.info(f'{len(videos)} videos found in {video_root}')
-        self.duration_sec = duration_sec
-        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
-        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
-        self.clip_augment = v2.Compose([
-            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
-            v2.ToImage(),
-            v2.ToDtype(torch.float32, scale=True),
-        ])
-        self.sync_augment = v2.Compose([
-            v2.Resize((_SYNC_SIZE, _SYNC_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
-            v2.CenterCrop(_SYNC_SIZE),
-            v2.ToImage(),
-            v2.ToDtype(torch.float32, scale=True),
-            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        ])
-        self.videos = videos
-    def sample(self, idx: int) -> dict[str, torch.Tensor]:
-        video_id = self.videos[idx]
-        caption = self.captions[video_id]
-        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
-        reader.add_basic_video_stream(
-            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
-            frame_rate=_CLIP_FPS,
-            format='rgb24',
-        )
-        reader.add_basic_video_stream(
-            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
-            frame_rate=_SYNC_FPS,
-            format='rgb24',
-        )
-        reader.fill_buffer()
-        data_chunk = reader.pop_chunks()
-        clip_chunk = data_chunk[0]
-        sync_chunk = data_chunk[1]
-        if clip_chunk is None:
-            raise RuntimeError(f'CLIP video returned None {video_id}')
-        if clip_chunk.shape[0] < self.clip_expected_length:
-            raise RuntimeError(f'CLIP video too short {video_id}')
-        if sync_chunk is None:
-            raise RuntimeError(f'Sync video returned None {video_id}')
-        if sync_chunk.shape[0] < self.sync_expected_length:
-            raise RuntimeError(f'Sync video too short {video_id}')
-        # truncate the video
-        clip_chunk = clip_chunk[:self.clip_expected_length]
-        if clip_chunk.shape[0] != self.clip_expected_length:
-            raise RuntimeError(f'CLIP video wrong length {video_id}, '
-                               f'expected {self.clip_expected_length}, '
-                               f'got {clip_chunk.shape[0]}')
-        clip_chunk = self.clip_augment(clip_chunk)
-        sync_chunk = sync_chunk[:self.sync_expected_length]
-        if sync_chunk.shape[0] != self.sync_expected_length:
-            raise RuntimeError(f'Sync video wrong length {video_id}, '
-                               f'expected {self.sync_expected_length}, '
-                               f'got {sync_chunk.shape[0]}')
-        sync_chunk = self.sync_augment(sync_chunk)
-        data = {
-            'name': video_id,
-            'caption': caption,
-            'clip_video': clip_chunk,
-            'sync_video': sync_chunk,
-        }
-        return data
-    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
-        return self.sample(idx)
-    def __len__(self):
-        return len(self.captions)

mmaudio/data/eval/video_dataset.py DELETED Viewed

@@ -1,197 +0,0 @@
-import json
-import logging
-import os
-from pathlib import Path
-from typing import Union
-import pandas as pd
-import torch
-from torch.utils.data.dataset import Dataset
-from torchvision.transforms import v2
-from torio.io import StreamingMediaDecoder
-from mmaudio.utils.dist_utils import local_rank
-log = logging.getLogger()
-_CLIP_SIZE = 384
-_CLIP_FPS = 8.0
-_SYNC_SIZE = 224
-_SYNC_FPS = 25.0
-class VideoDataset(Dataset):
-    def __init__(
-        self,
-        video_root: Union[str, Path],
-        *,
-        duration_sec: float = 8.0,
-    ):
-        self.video_root = Path(video_root)
-        self.duration_sec = duration_sec
-        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
-        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
-        self.clip_transform = v2.Compose([
-            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
-            v2.ToImage(),
-            v2.ToDtype(torch.float32, scale=True),
-        ])
-        self.sync_transform = v2.Compose([
-            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
-            v2.CenterCrop(_SYNC_SIZE),
-            v2.ToImage(),
-            v2.ToDtype(torch.float32, scale=True),
-            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        ])
-        # to be implemented by subclasses
-        self.captions = {}
-        self.videos = sorted(list(self.captions.keys()))
-    def sample(self, idx: int) -> dict[str, torch.Tensor]:
-        video_id = self.videos[idx]
-        caption = self.captions[video_id]
-        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
-        reader.add_basic_video_stream(
-            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
-            frame_rate=_CLIP_FPS,
-            format='rgb24',
-        )
-        reader.add_basic_video_stream(
-            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
-            frame_rate=_SYNC_FPS,
-            format='rgb24',
-        )
-        reader.fill_buffer()
-        data_chunk = reader.pop_chunks()
-        clip_chunk = data_chunk[0]
-        sync_chunk = data_chunk[1]
-        if clip_chunk is None:
-            raise RuntimeError(f'CLIP video returned None {video_id}')
-        if clip_chunk.shape[0] < self.clip_expected_length:
-            raise RuntimeError(
-                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
-            )
-        if sync_chunk is None:
-            raise RuntimeError(f'Sync video returned None {video_id}')
-        if sync_chunk.shape[0] < self.sync_expected_length:
-            raise RuntimeError(
-                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
-            )
-        # truncate the video
-        clip_chunk = clip_chunk[:self.clip_expected_length]
-        if clip_chunk.shape[0] != self.clip_expected_length:
-            raise RuntimeError(f'CLIP video wrong length {video_id}, '
-                               f'expected {self.clip_expected_length}, '
-                               f'got {clip_chunk.shape[0]}')
-        clip_chunk = self.clip_transform(clip_chunk)
-        sync_chunk = sync_chunk[:self.sync_expected_length]
-        if sync_chunk.shape[0] != self.sync_expected_length:
-            raise RuntimeError(f'Sync video wrong length {video_id}, '
-                               f'expected {self.sync_expected_length}, '
-                               f'got {sync_chunk.shape[0]}')
-        sync_chunk = self.sync_transform(sync_chunk)
-        data = {
-            'name': video_id,
-            'caption': caption,
-            'clip_video': clip_chunk,
-            'sync_video': sync_chunk,
-        }
-        return data
-    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
-        try:
-            return self.sample(idx)
-        except Exception as e:
-            log.error(f'Error loading video {self.videos[idx]}: {e}')
-            return None
-    def __len__(self):
-        return len(self.captions)
-class VGGSound(VideoDataset):
-    def __init__(
-        self,
-        video_root: Union[str, Path],
-        csv_path: Union[str, Path],
-        *,
-        duration_sec: float = 8.0,
-    ):
-        super().__init__(video_root, duration_sec=duration_sec)
-        self.video_root = Path(video_root)
-        self.csv_path = Path(csv_path)
-        videos = sorted(os.listdir(self.video_root))
-        if local_rank == 0:
-            log.info(f'{len(videos)} videos found in {video_root}')
-        self.captions = {}
-        df = pd.read_csv(csv_path, header=None, names=['id', 'sec', 'caption',
-                                                       'split']).to_dict(orient='records')
-        videos_no_found = []
-        for row in df:
-            if row['split'] == 'test':
-                start_sec = int(row['sec'])
-                video_id = str(row['id'])
-                # this is how our videos are named
-                video_name = f'{video_id}_{start_sec:06d}'
-                if video_name + '.mp4' not in videos:
-                    videos_no_found.append(video_name)
-                    continue
-                self.captions[video_name] = row['caption']
-        if local_rank == 0:
-            log.info(f'{len(videos)} videos found in {video_root}')
-            log.info(f'{len(self.captions)} useable videos found')
-            if videos_no_found:
-                log.info(f'{len(videos_no_found)} found in {csv_path} but not in {video_root}')
-                log.info(
-                    'A small amount is expected, as not all videos are still available on YouTube')
-        self.videos = sorted(list(self.captions.keys()))
-class MovieGen(VideoDataset):
-    def __init__(
-        self,
-        video_root: Union[str, Path],
-        jsonl_root: Union[str, Path],
-        *,
-        duration_sec: float = 10.0,
-    ):
-        super().__init__(video_root, duration_sec=duration_sec)
-        self.video_root = Path(video_root)
-        self.jsonl_root = Path(jsonl_root)
-        videos = sorted(os.listdir(self.video_root))
-        videos = [v[:-4] for v in videos]  # remove extensions
-        self.captions = {}
-        for v in videos:
-            with open(self.jsonl_root / (v + '.jsonl')) as f:
-                data = json.load(f)
-                self.captions[v] = data['audio_prompt']
-        if local_rank == 0:
-            log.info(f'{len(videos)} videos found in {video_root}')
-        self.videos = videos

mmaudio/data/extracted_audio.py DELETED Viewed

@@ -1,88 +0,0 @@
-import logging
-from pathlib import Path
-from typing import Union
-import pandas as pd
-import torch
-from tensordict import TensorDict
-from torch.utils.data.dataset import Dataset
-from mmaudio.utils.dist_utils import local_rank
-log = logging.getLogger()
-class ExtractedAudio(Dataset):
-    def __init__(
-        self,
-        tsv_path: Union[str, Path],
-        *,
-        premade_mmap_dir: Union[str, Path],
-        data_dim: dict[str, int],
-    ):
-        super().__init__()
-        self.data_dim = data_dim
-        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
-        self.ids = [str(d['id']) for d in self.df_list]
-        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
-        # load precomputed memory mapped tensors
-        premade_mmap_dir = Path(premade_mmap_dir)
-        td = TensorDict.load_memmap(premade_mmap_dir)
-        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
-        self.mean = td['mean']
-        self.std = td['std']
-        self.text_features = td['text_features']
-        log.info(f'Loaded {len(self)} samples from {premade_mmap_dir}.')
-        log.info(f'Loaded mean: {self.mean.shape}.')
-        log.info(f'Loaded std: {self.std.shape}.')
-        log.info(f'Loaded text features: {self.text_features.shape}.')
-        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
-            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
-        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
-            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
-        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
-            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
-        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
-            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
-        self.fake_clip_features = torch.zeros(self.data_dim['clip_seq_len'],
-                                              self.data_dim['clip_dim'])
-        self.fake_sync_features = torch.zeros(self.data_dim['sync_seq_len'],
-                                              self.data_dim['sync_dim'])
-        self.video_exist = torch.tensor(0, dtype=torch.bool)
-        self.text_exist = torch.tensor(1, dtype=torch.bool)
-    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
-        latents = self.mean
-        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
-    def get_memory_mapped_tensor(self) -> TensorDict:
-        td = TensorDict({
-            'mean': self.mean,
-            'std': self.std,
-            'text_features': self.text_features,
-        })
-        return td
-    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
-        data = {
-            'id': str(self.df_list[idx]['id']),
-            'a_mean': self.mean[idx],
-            'a_std': self.std[idx],
-            'clip_features': self.fake_clip_features,
-            'sync_features': self.fake_sync_features,
-            'text_features': self.text_features[idx],
-            'caption': self.df_list[idx]['caption'],
-            'video_exist': self.video_exist,
-            'text_exist': self.text_exist,
-        }
-        return data
-    def __len__(self):
-        return len(self.ids)

mmaudio/data/extracted_vgg.py DELETED Viewed

@@ -1,101 +0,0 @@
-import logging
-from pathlib import Path
-from typing import Union
-import pandas as pd
-import torch
-from tensordict import TensorDict
-from torch.utils.data.dataset import Dataset
-from mmaudio.utils.dist_utils import local_rank
-log = logging.getLogger()
-class ExtractedVGG(Dataset):
-    def __init__(
-        self,
-        tsv_path: Union[str, Path],
-        *,
-        premade_mmap_dir: Union[str, Path],
-        data_dim: dict[str, int],
-    ):
-        super().__init__()
-        self.data_dim = data_dim
-        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
-        self.ids = [d['id'] for d in self.df_list]
-        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
-        # load precomputed memory mapped tensors
-        premade_mmap_dir = Path(premade_mmap_dir)
-        td = TensorDict.load_memmap(premade_mmap_dir)
-        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
-        self.mean = td['mean']
-        self.std = td['std']
-        self.clip_features = td['clip_features']
-        self.sync_features = td['sync_features']
-        self.text_features = td['text_features']
-        if local_rank == 0:
-            log.info(f'Loaded {len(self)} samples.')
-            log.info(f'Loaded mean: {self.mean.shape}.')
-            log.info(f'Loaded std: {self.std.shape}.')
-            log.info(f'Loaded clip_features: {self.clip_features.shape}.')
-            log.info(f'Loaded sync_features: {self.sync_features.shape}.')
-            log.info(f'Loaded text_features: {self.text_features.shape}.')
-        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
-            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
-        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
-            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
-        assert self.clip_features.shape[1] == self.data_dim['clip_seq_len'], \
-            f'{self.clip_features.shape[1]} != {self.data_dim["clip_seq_len"]}'
-        assert self.sync_features.shape[1] == self.data_dim['sync_seq_len'], \
-            f'{self.sync_features.shape[1]} != {self.data_dim["sync_seq_len"]}'
-        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
-            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
-        assert self.clip_features.shape[-1] == self.data_dim['clip_dim'], \
-            f'{self.clip_features.shape[-1]} != {self.data_dim["clip_dim"]}'
-        assert self.sync_features.shape[-1] == self.data_dim['sync_dim'], \
-            f'{self.sync_features.shape[-1]} != {self.data_dim["sync_dim"]}'
-        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
-            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
-        self.video_exist = torch.tensor(1, dtype=torch.bool)
-        self.text_exist = torch.tensor(1, dtype=torch.bool)
-    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
-        latents = self.mean
-        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
-    def get_memory_mapped_tensor(self) -> TensorDict:
-        td = TensorDict({
-            'mean': self.mean,
-            'std': self.std,
-            'clip_features': self.clip_features,
-            'sync_features': self.sync_features,
-            'text_features': self.text_features,
-        })
-        return td
-    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
-        data = {
-            'id': self.df_list[idx]['id'],
-            'a_mean': self.mean[idx],
-            'a_std': self.std[idx],
-            'clip_features': self.clip_features[idx],
-            'sync_features': self.sync_features[idx],
-            'text_features': self.text_features[idx],
-            'caption': self.df_list[idx]['label'],
-            'video_exist': self.video_exist,
-            'text_exist': self.text_exist,
-        }
-        return data
-    def __len__(self):
-        return len(self.ids)

mmaudio/data/extraction/__init__.py DELETED Viewed

File without changes

mmaudio/data/extraction/vgg_sound.py DELETED Viewed

@@ -1,193 +0,0 @@
-import logging
-import os
-from pathlib import Path
-from typing import Optional, Union
-import pandas as pd
-import torch
-import torchaudio
-from torch.utils.data.dataset import Dataset
-from torchvision.transforms import v2
-from torio.io import StreamingMediaDecoder
-from mmaudio.utils.dist_utils import local_rank
-log = logging.getLogger()
-_CLIP_SIZE = 384
-_CLIP_FPS = 8.0
-_SYNC_SIZE = 224
-_SYNC_FPS = 25.0
-class VGGSound(Dataset):
-    def __init__(
-        self,
-        root: Union[str, Path],
-        *,
-        tsv_path: Union[str, Path] = 'sets/vgg3-train.tsv',
-        sample_rate: int = 16_000,
-        duration_sec: float = 8.0,
-        audio_samples: Optional[int] = None,
-        normalize_audio: bool = False,
-    ):
-        self.root = Path(root)
-        self.normalize_audio = normalize_audio
-        if audio_samples is None:
-            self.audio_samples = int(sample_rate * duration_sec)
-        else:
-            self.audio_samples = audio_samples
-            effective_duration = audio_samples / sample_rate
-            # make sure the duration is close enough, within 15ms
-            assert abs(effective_duration - duration_sec) < 0.015, \
-                f'audio_samples {audio_samples} does not match duration_sec {duration_sec}'
-        videos = sorted(os.listdir(self.root))
-        videos = set([Path(v).stem for v in videos])  # remove extensions
-        self.labels = {}
-        self.videos = []
-        missing_videos = []
-        # read the tsv for subset information
-        df_list = pd.read_csv(tsv_path, sep='\t', dtype={'id': str}).to_dict('records')
-        for record in df_list:
-            id = record['id']
-            label = record['label']
-            if id in videos:
-                self.labels[id] = label
-                self.videos.append(id)
-            else:
-                missing_videos.append(id)
-        if local_rank == 0:
-            log.info(f'{len(videos)} videos found in {root}')
-            log.info(f'{len(self.videos)} videos found in {tsv_path}')
-            log.info(f'{len(missing_videos)} videos missing in {root}')
-        self.sample_rate = sample_rate
-        self.duration_sec = duration_sec
-        self.expected_audio_length = audio_samples
-        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
-        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
-        self.clip_transform = v2.Compose([
-            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
-            v2.ToImage(),
-            v2.ToDtype(torch.float32, scale=True),
-        ])
-        self.sync_transform = v2.Compose([
-            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
-            v2.CenterCrop(_SYNC_SIZE),
-            v2.ToImage(),
-            v2.ToDtype(torch.float32, scale=True),
-            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        ])
-        self.resampler = {}
-    def sample(self, idx: int) -> dict[str, torch.Tensor]:
-        video_id = self.videos[idx]
-        label = self.labels[video_id]
-        reader = StreamingMediaDecoder(self.root / (video_id + '.mp4'))
-        reader.add_basic_video_stream(
-            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
-            frame_rate=_CLIP_FPS,
-            format='rgb24',
-        )
-        reader.add_basic_video_stream(
-            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
-            frame_rate=_SYNC_FPS,
-            format='rgb24',
-        )
-        reader.add_basic_audio_stream(frames_per_chunk=2**30, )
-        reader.fill_buffer()
-        data_chunk = reader.pop_chunks()
-        clip_chunk = data_chunk[0]
-        sync_chunk = data_chunk[1]
-        audio_chunk = data_chunk[2]
-        if clip_chunk is None:
-            raise RuntimeError(f'CLIP video returned None {video_id}')
-        if clip_chunk.shape[0] < self.clip_expected_length:
-            raise RuntimeError(
-                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
-            )
-        if sync_chunk is None:
-            raise RuntimeError(f'Sync video returned None {video_id}')
-        if sync_chunk.shape[0] < self.sync_expected_length:
-            raise RuntimeError(
-                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
-            )
-        # process audio
-        sample_rate = int(reader.get_out_stream_info(2).sample_rate)
-        audio_chunk = audio_chunk.transpose(0, 1)
-        audio_chunk = audio_chunk.mean(dim=0)  # mono
-        if self.normalize_audio:
-            abs_max = audio_chunk.abs().max()
-            audio_chunk = audio_chunk / abs_max * 0.95
-            if abs_max <= 1e-6:
-                raise RuntimeError(f'Audio is silent {video_id}')
-        # resample
-        if sample_rate == self.sample_rate:
-            audio_chunk = audio_chunk
-        else:
-            if sample_rate not in self.resampler:
-                # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
-                self.resampler[sample_rate] = torchaudio.transforms.Resample(
-                    sample_rate,
-                    self.sample_rate,
-                    lowpass_filter_width=64,
-                    rolloff=0.9475937167399596,
-                    resampling_method='sinc_interp_kaiser',
-                    beta=14.769656459379492,
-                )
-            audio_chunk = self.resampler[sample_rate](audio_chunk)
-        if audio_chunk.shape[0] < self.expected_audio_length:
-            raise RuntimeError(f'Audio too short {video_id}')
-        audio_chunk = audio_chunk[:self.expected_audio_length]
-        # truncate the video
-        clip_chunk = clip_chunk[:self.clip_expected_length]
-        if clip_chunk.shape[0] != self.clip_expected_length:
-            raise RuntimeError(f'CLIP video wrong length {video_id}, '
-                               f'expected {self.clip_expected_length}, '
-                               f'got {clip_chunk.shape[0]}')
-        clip_chunk = self.clip_transform(clip_chunk)
-        sync_chunk = sync_chunk[:self.sync_expected_length]
-        if sync_chunk.shape[0] != self.sync_expected_length:
-            raise RuntimeError(f'Sync video wrong length {video_id}, '
-                               f'expected {self.sync_expected_length}, '
-                               f'got {sync_chunk.shape[0]}')
-        sync_chunk = self.sync_transform(sync_chunk)
-        data = {
-            'id': video_id,
-            'caption': label,
-            'audio': audio_chunk,
-            'clip_video': clip_chunk,
-            'sync_video': sync_chunk,
-        }
-        return data
-    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
-        try:
-            return self.sample(idx)
-        except Exception as e:
-            log.error(f'Error loading video {self.videos[idx]}: {e}')
-            return None
-    def __len__(self):
-        return len(self.labels)

mmaudio/data/extraction/wav_dataset.py DELETED Viewed

@@ -1,132 +0,0 @@
-import logging
-import os
-from pathlib import Path
-from typing import Union
-import open_clip
-import pandas as pd
-import torch
-import torchaudio
-from torch.utils.data.dataset import Dataset
-log = logging.getLogger()
-class WavTextClipsDataset(Dataset):
-    def __init__(
-        self,
-        root: Union[str, Path],
-        *,
-        captions_tsv: Union[str, Path],
-        clips_tsv: Union[str, Path],
-        sample_rate: int,
-        num_samples: int,
-        normalize_audio: bool = False,
-        reject_silent: bool = False,
-        tokenizer_id: str = 'ViT-H-14-378-quickgelu',
-    ):
-        self.root = Path(root)
-        self.sample_rate = sample_rate
-        self.num_samples = num_samples
-        self.normalize_audio = normalize_audio
-        self.reject_silent = reject_silent
-        self.tokenizer = open_clip.get_tokenizer(tokenizer_id)
-        audios = sorted(os.listdir(self.root))
-        audios = set([
-            Path(audio).stem for audio in audios
-            if audio.endswith('.wav') or audio.endswith('.flac')
-        ])
-        self.captions = {}
-        # read the caption tsv
-        df_list = pd.read_csv(captions_tsv, sep='\t', dtype={'id': str}).to_dict('records')
-        for record in df_list:
-            id = record['id']
-            caption = record['caption']
-            self.captions[id] = caption
-        # read the clip tsv
-        df_list = pd.read_csv(clips_tsv, sep='\t', dtype={
-            'id': str,
-            'name': str
-        }).to_dict('records')
-        self.clips = []
-        for record in df_list:
-            record['id'] = record['id']
-            record['name'] = record['name']
-            id = record['id']
-            name = record['name']
-            if name not in self.captions:
-                log.warning(f'Audio {name} not found in {captions_tsv}')
-                continue
-            record['caption'] = self.captions[name]
-            self.clips.append(record)
-        log.info(f'Found {len(self.clips)} audio files in {self.root}')
-        self.resampler = {}
-    def __getitem__(self, idx: int) -> torch.Tensor:
-        try:
-            clip = self.clips[idx]
-            audio_name = clip['name']
-            audio_id = clip['id']
-            caption = clip['caption']
-            start_sample = clip['start_sample']
-            end_sample = clip['end_sample']
-            audio_path = self.root / f'{audio_name}.flac'
-            if not audio_path.exists():
-                audio_path = self.root / f'{audio_name}.wav'
-                assert audio_path.exists()
-            audio_chunk, sample_rate = torchaudio.load(audio_path)
-            audio_chunk = audio_chunk.mean(dim=0)  # mono
-            abs_max = audio_chunk.abs().max()
-            if self.normalize_audio:
-                audio_chunk = audio_chunk / abs_max * 0.95
-            if self.reject_silent and abs_max < 1e-6:
-                log.warning(f'Rejecting silent audio')
-                return None
-            audio_chunk = audio_chunk[start_sample:end_sample]
-            # resample
-            if sample_rate == self.sample_rate:
-                audio_chunk = audio_chunk
-            else:
-                if sample_rate not in self.resampler:
-                    # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
-                    self.resampler[sample_rate] = torchaudio.transforms.Resample(
-                        sample_rate,
-                        self.sample_rate,
-                        lowpass_filter_width=64,
-                        rolloff=0.9475937167399596,
-                        resampling_method='sinc_interp_kaiser',
-                        beta=14.769656459379492,
-                    )
-                audio_chunk = self.resampler[sample_rate](audio_chunk)
-            if audio_chunk.shape[0] < self.num_samples:
-                raise ValueError('Audio is too short')
-            audio_chunk = audio_chunk[:self.num_samples]
-            tokens = self.tokenizer([caption])[0]
-            output = {
-                'waveform': audio_chunk,
-                'id': audio_id,
-                'caption': caption,
-                'tokens': tokens,
-            }
-            return output
-        except Exception as e:
-            log.error(f'Error reading {audio_path}: {e}')
-            return None
-    def __len__(self):
-        return len(self.clips)

mmaudio/data/mm_dataset.py DELETED Viewed

@@ -1,45 +0,0 @@
-import bisect
-import torch
-from torch.utils.data.dataset import Dataset
-# modified from https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#ConcatDataset
-class MultiModalDataset(Dataset):
-    datasets: list[Dataset]
-    cumulative_sizes: list[int]
-    @staticmethod
-    def cumsum(sequence):
-        r, s = [], 0
-        for e in sequence:
-            l = len(e)
-            r.append(l + s)
-            s += l
-        return r
-    def __init__(self, video_datasets: list[Dataset], audio_datasets: list[Dataset]):
-        super().__init__()
-        self.video_datasets = list(video_datasets)
-        self.audio_datasets = list(audio_datasets)
-        self.datasets = self.video_datasets + self.audio_datasets
-        self.cumulative_sizes = self.cumsum(self.datasets)
-    def __len__(self):
-        return self.cumulative_sizes[-1]
-    def __getitem__(self, idx):
-        if idx < 0:
-            if -idx > len(self):
-                raise ValueError("absolute value of index should not exceed dataset length")
-            idx = len(self) + idx
-        dataset_idx = bisect.bisect_right(self.cumulative_sizes, idx)
-        if dataset_idx == 0:
-            sample_idx = idx
-        else:
-            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
-        return self.datasets[dataset_idx][sample_idx]
-    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
-        return self.video_datasets[0].compute_latent_stats()

mmaudio/data/utils.py DELETED Viewed

@@ -1,148 +0,0 @@
-import logging
-import os
-import random
-import tempfile
-from pathlib import Path
-from typing import Any, Optional, Union
-import torch
-import torch.distributed as dist
-from tensordict import MemoryMappedTensor
-from torch.utils.data import DataLoader
-from torch.utils.data.dataset import Dataset
-from tqdm import tqdm
-from mmaudio.utils.dist_utils import local_rank, world_size
-scratch_path = Path(os.environ['SLURM_SCRATCH'] if 'SLURM_SCRATCH' in os.environ else '/dev/shm')
-shm_path = Path('/dev/shm')
-log = logging.getLogger()
-def reseed(seed):
-    random.seed(seed)
-    torch.manual_seed(seed)
-def local_scatter_torch(obj: Optional[Any]):
-    if world_size == 1:
-        # Just one worker. Do nothing.
-        return obj
-    array = [obj] * world_size
-    target_array = [None]
-    if local_rank == 0:
-        dist.scatter_object_list(target_array, scatter_object_input_list=array, src=0)
-    else:
-        dist.scatter_object_list(target_array, scatter_object_input_list=None, src=0)
-    return target_array[0]
-class ShardDataset(Dataset):
-    def __init__(self, root):
-        self.root = root
-        self.shards = sorted(os.listdir(root))
-    def __len__(self):
-        return len(self.shards)
-    def __getitem__(self, idx):
-        return torch.load(os.path.join(self.root, self.shards[idx]), weights_only=True)
-def get_tmp_dir(in_memory: bool) -> Path:
-    return shm_path if in_memory else scratch_path
-def load_shards_and_share(data_path: Union[str, Path], ids: list[int],
-                          in_memory: bool) -> MemoryMappedTensor:
-    if local_rank == 0:
-        with tempfile.NamedTemporaryFile(prefix='shared-tensor-', dir=get_tmp_dir(in_memory)) as f:
-            log.info(f'Loading shards from {data_path} into {f.name}...')
-            data = load_shards(data_path, ids=ids, tmp_file_path=f.name)
-            data = share_tensor_to_all(data)
-            torch.distributed.barrier()
-            f.close()  # why does the context manager not close the file for me?
-    else:
-        log.info('Waiting for the data to be shared with me...')
-        data = share_tensor_to_all(None)
-        torch.distributed.barrier()
-    return data
-def load_shards(
-    data_path: Union[str, Path],
-    ids: list[int],
-    *,
-    tmp_file_path: str,
-) -> Union[torch.Tensor, dict[str, torch.Tensor]]:
-    id_set = set(ids)
-    shards = sorted(os.listdir(data_path))
-    log.info(f'Found {len(shards)} shards in {data_path}.')
-    first_shard = torch.load(os.path.join(data_path, shards[0]), weights_only=True)
-    log.info(f'Rank {local_rank} created file {tmp_file_path}')
-    first_item = next(iter(first_shard.values()))
-    log.info(f'First item shape: {first_item.shape}')
-    mm_tensor = MemoryMappedTensor.empty(shape=(len(ids), *first_item.shape),
-                                         dtype=torch.float32,
-                                         filename=tmp_file_path,
-                                         existsok=True)
-    total_count = 0
-    used_index = set()
-    id_indexing = {i: idx for idx, i in enumerate(ids)}
-    # faster with no workers; otherwise we need to set_sharing_strategy('file_system')
-    loader = DataLoader(ShardDataset(data_path), batch_size=1, num_workers=0)
-    for data in tqdm(loader, desc='Loading shards'):
-        for i, v in data.items():
-            if i not in id_set:
-                continue
-            # tensor_index = ids.index(i)
-            tensor_index = id_indexing[i]
-            if tensor_index in used_index:
-                raise ValueError(f'Duplicate id {i} found in {data_path}.')
-            used_index.add(tensor_index)
-            mm_tensor[tensor_index] = v
-            total_count += 1
-    assert total_count == len(ids), f'Expected {len(ids)} tensors, got {total_count}.'
-    log.info(f'Loaded {total_count} tensors from {data_path}.')
-    return mm_tensor
-def share_tensor_to_all(x: Optional[MemoryMappedTensor]) -> MemoryMappedTensor:
-    """
-    x: the tensor to be shared; None if local_rank != 0
-    return: the shared tensor
-    """
-    # there is no need to share your stuff with anyone if you are alone; must be in memory
-    if world_size == 1:
-        return x
-    if local_rank == 0:
-        assert x is not None, 'x must not be None if local_rank == 0'
-    else:
-        assert x is None, 'x must be None if local_rank != 0'
-    if local_rank == 0:
-        filename = x.filename
-        meta_information = (filename, x.shape, x.dtype)
-    else:
-        meta_information = None
-    filename, data_shape, data_type = local_scatter_torch(meta_information)
-    if local_rank == 0:
-        data = x
-    else:
-        data = MemoryMappedTensor.from_filename(filename=filename,
-                                                dtype=data_type,
-                                                shape=data_shape)
-    return data

mmaudio/eval_utils.py CHANGED Viewed

@@ -1,18 +1,16 @@
 import dataclasses
 import logging
 from pathlib import Path
-from typing import Optional, Tuple, List, Dict
-import numpy as np
 import torch
 from colorlog import ColoredFormatter
-from PIL import Image
 from torchvision.transforms import v2
-from mmaudio.data.av_utils import ImageInfo, VideoInfo, read_frames, reencode_with_audio
 from mmaudio.model.flow_matching import FlowMatching
 from mmaudio.model.networks import MMAudio
-from mmaudio.model.sequence_config import CONFIG_16K, CONFIG_44K, SequenceConfig
 from mmaudio.model.utils.features_utils import FeaturesUtils
 from mmaudio.utils.download_utils import download_model_if_needed
@@ -26,7 +24,7 @@ class ModelConfig:
     vae_path: Path
     bigvgan_16k_path: Optional[Path]
     mode: str
-    synchformer_ckpt: Path = Path('./pretrained/v2a/mmaudio/ext_weights/synchformer_state_dict.pth')
     @property
     def seq_cfg(self) -> SequenceConfig:
@@ -44,31 +42,31 @@ class ModelConfig:
 small_16k = ModelConfig(model_name='small_16k',
-                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_small_16k.pth'),
-                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-16.pth'),
-                        bigvgan_16k_path=Path('./pretrained/v2a/mmaudio/ext_weights/best_netG.pt'),
                         mode='16k')
 small_44k = ModelConfig(model_name='small_44k',
-                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_small_44k.pth'),
-                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                         bigvgan_16k_path=None,
                         mode='44k')
 medium_44k = ModelConfig(model_name='medium_44k',
-                         model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_medium_44k.pth'),
-                         vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                          bigvgan_16k_path=None,
                          mode='44k')
 large_44k = ModelConfig(model_name='large_44k',
-                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_large_44k.pth'),
-                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                         bigvgan_16k_path=None,
                         mode='44k')
 large_44k_v2 = ModelConfig(model_name='large_44k_v2',
-                           model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_large_44k_v2.pth'),
-                           vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                            bigvgan_16k_path=None,
                            mode='44k')
-all_model_cfg: Dict[str, ModelConfig] = {
     'small_16k': small_16k,
     'small_44k': small_44k,
     'medium_44k': medium_44k,
@@ -80,9 +78,9 @@ all_model_cfg: Dict[str, ModelConfig] = {
 def generate(
     clip_video: Optional[torch.Tensor],
     sync_video: Optional[torch.Tensor],
-    text: Optional[List[str]],
     *,
-    negative_text: Optional[List[str]] = None,
     feature_utils: FeaturesUtils,
     net: MMAudio,
     fm: FlowMatching,
@@ -90,7 +88,6 @@ def generate(
     cfg_strength: float,
     clip_batch_size_multiplier: int = 40,
     sync_batch_size_multiplier: int = 40,
-    image_input: bool = False,
 ) -> torch.Tensor:
     device = feature_utils.device
     dtype = feature_utils.dtype
@@ -101,12 +98,10 @@ def generate(
         clip_features = feature_utils.encode_video_with_clip(clip_video,
                                                              batch_size=bs *
                                                              clip_batch_size_multiplier)
-        if image_input:
-            clip_features = clip_features.expand(-1, net.clip_seq_len, -1)
     else:
         clip_features = net.get_empty_clip_sequence(bs)
-    if sync_video is not None and not image_input:
         sync_video = sync_video.to(device, dtype, non_blocking=True)
         sync_features = feature_utils.encode_video_with_sync(sync_video,
                                                              batch_size=bs *
@@ -144,7 +139,7 @@ def generate(
     return audio
-LOGFORMAT = "[%(log_color)s%(levelname)-8s%(reset)s]: %(log_color)s%(message)s%(reset)s"
 def setup_eval_logging(log_level: int = logging.INFO):
@@ -158,14 +153,12 @@ def setup_eval_logging(log_level: int = logging.INFO):
     log.addHandler(stream)
-_CLIP_SIZE = 384
-_CLIP_FPS = 8.0
-_SYNC_SIZE = 224
-_SYNC_FPS = 25.0
 def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = True) -> VideoInfo:
     clip_transform = v2.Compose([
         v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
@@ -220,36 +213,5 @@ def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = Tr
     return video_info
-def load_image(image_path: Path) -> VideoInfo:
-    clip_transform = v2.Compose([
-        v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
-        v2.ToImage(),
-        v2.ToDtype(torch.float32, scale=True),
-    ])
-    sync_transform = v2.Compose([
-        v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
-        v2.CenterCrop(_SYNC_SIZE),
-        v2.ToImage(),
-        v2.ToDtype(torch.float32, scale=True),
-        v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-    ])
-    frame = np.array(Image.open(image_path))
-    clip_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
-    sync_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
-    clip_frames = clip_transform(clip_chunk)
-    sync_frames = sync_transform(sync_chunk)
-    video_info = ImageInfo(
-        clip_frames=clip_frames,
-        sync_frames=sync_frames,
-        original_frame=frame,
-    )
-    return video_info
 def make_video(video_info: VideoInfo, output_path: Path, audio: torch.Tensor, sampling_rate: int):
     reencode_with_audio(video_info, output_path, audio, sampling_rate)

 import dataclasses
 import logging
 from pathlib import Path
+from typing import Optional
 import torch
 from colorlog import ColoredFormatter
 from torchvision.transforms import v2
+from mmaudio.data.av_utils import VideoInfo, read_frames, reencode_with_audio
 from mmaudio.model.flow_matching import FlowMatching
 from mmaudio.model.networks import MMAudio
+from mmaudio.model.sequence_config import (CONFIG_16K, CONFIG_44K, SequenceConfig)
 from mmaudio.model.utils.features_utils import FeaturesUtils
 from mmaudio.utils.download_utils import download_model_if_needed
     vae_path: Path
     bigvgan_16k_path: Optional[Path]
     mode: str
+    synchformer_ckpt: Path = Path('./ext_weights/synchformer_state_dict.pth')
     @property
     def seq_cfg(self) -> SequenceConfig:
 small_16k = ModelConfig(model_name='small_16k',
+                        model_path=Path('./weights/mmaudio_small_16k.pth'),
+                        vae_path=Path('./ext_weights/v1-16.pth'),
+                        bigvgan_16k_path=Path('./ext_weights/best_netG.pt'),
                         mode='16k')
 small_44k = ModelConfig(model_name='small_44k',
+                        model_path=Path('./weights/mmaudio_small_44k.pth'),
+                        vae_path=Path('./ext_weights/v1-44.pth'),
                         bigvgan_16k_path=None,
                         mode='44k')
 medium_44k = ModelConfig(model_name='medium_44k',
+                         model_path=Path('./weights/mmaudio_medium_44k.pth'),
+                         vae_path=Path('./ext_weights/v1-44.pth'),
                          bigvgan_16k_path=None,
                          mode='44k')
 large_44k = ModelConfig(model_name='large_44k',
+                        model_path=Path('./weights/mmaudio_large_44k.pth'),
+                        vae_path=Path('./ext_weights/v1-44.pth'),
                         bigvgan_16k_path=None,
                         mode='44k')
 large_44k_v2 = ModelConfig(model_name='large_44k_v2',
+                           model_path=Path('./weights/mmaudio_large_44k_v2.pth'),
+                           vae_path=Path('./ext_weights/v1-44.pth'),
                            bigvgan_16k_path=None,
                            mode='44k')
+all_model_cfg: dict[str, ModelConfig] = {
     'small_16k': small_16k,
     'small_44k': small_44k,
     'medium_44k': medium_44k,
 def generate(
     clip_video: Optional[torch.Tensor],
     sync_video: Optional[torch.Tensor],
+    text: Optional[list[str]],
     *,
+    negative_text: Optional[list[str]] = None,
     feature_utils: FeaturesUtils,
     net: MMAudio,
     fm: FlowMatching,
     cfg_strength: float,
     clip_batch_size_multiplier: int = 40,
     sync_batch_size_multiplier: int = 40,
 ) -> torch.Tensor:
     device = feature_utils.device
     dtype = feature_utils.dtype
         clip_features = feature_utils.encode_video_with_clip(clip_video,
                                                              batch_size=bs *
                                                              clip_batch_size_multiplier)
     else:
         clip_features = net.get_empty_clip_sequence(bs)
+    if sync_video is not None:
         sync_video = sync_video.to(device, dtype, non_blocking=True)
         sync_features = feature_utils.encode_video_with_sync(sync_video,
                                                              batch_size=bs *
     return audio
+LOGFORMAT = "  %(log_color)s%(levelname)-8s%(reset)s | %(log_color)s%(message)s%(reset)s"
 def setup_eval_logging(log_level: int = logging.INFO):
     log.addHandler(stream)
 def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = True) -> VideoInfo:
+    _CLIP_SIZE = 384
+    _CLIP_FPS = 8.0
+    _SYNC_SIZE = 224
+    _SYNC_FPS = 25.0
     clip_transform = v2.Compose([
         v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
     return video_info
 def make_video(video_info: VideoInfo, output_path: Path, audio: torch.Tensor, sampling_rate: int):
     reencode_with_audio(video_info, output_path, audio, sampling_rate)

mmaudio/ext/__pycache__/__init__.cpython-310.pyc DELETED Viewed

Binary file (191 Bytes)

mmaudio/ext/__pycache__/__init__.cpython-38.pyc DELETED Viewed

Binary file (189 Bytes)

mmaudio/ext/__pycache__/mel_converter.cpython-310.pyc DELETED Viewed

Binary file (2.87 kB)

mmaudio/ext/__pycache__/mel_converter.cpython-38.pyc DELETED Viewed

Binary file (2.84 kB)

mmaudio/ext/__pycache__/rotary_embeddings.cpython-310.pyc DELETED Viewed

Binary file (1.48 kB)

mmaudio/ext/__pycache__/rotary_embeddings.cpython-38.pyc DELETED Viewed

Binary file (1.45 kB)

mmaudio/ext/autoencoder/__pycache__/__init__.cpython-310.pyc DELETED Viewed

Binary file (256 Bytes)

mmaudio/ext/autoencoder/__pycache__/__init__.cpython-38.pyc DELETED Viewed

Binary file (254 Bytes)

mmaudio/ext/autoencoder/__pycache__/autoencoder.cpython-310.pyc DELETED Viewed

Binary file (2.14 kB)