diff --git a/README.md b/README.md
index 4fd9618dde607663fcd69b94b821a8603a243b8b..20cf8a7bc5b236d5e1064470df8d11d56bb8c752 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,5 @@
 ---
 title: DeepSound-V1
-emoji: 🔊
 colorFrom: blue
 colorTo: indigo
 sdk: gradio
@@ -9,155 +8,160 @@ pinned: false
 ---
 
 
-# [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)
+<!-- # DeepSound-V1
+Official code for DeepSound-V1 -->
 
-[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)
 
-University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
+<div align="center">
+<p align="center">
+  <h2>DeepSound-V1</h2>
+  <!-- <a href="https://arxiv.org/abs/2412.15322">Paper</a> | <a href="https://hkchengrex.github.io/MMAudio">Webpage</a> | <a href="https://huggingface.co/hkchengrex/MMAudio/tree/main">Models</a> | <a href="https://huggingface.co/spaces/hkchengrex/MMAudio"> Huggingface Demo</a> | <a href="https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8?usp=sharing">Colab Demo</a> | <a href="https://replicate.com/zsxkib/mmaudio">Replicate Demo</a> -->
+  <a href="https://github.com/lym0302/DeepSound-V1">Paper</a> | <a href="https://github.com/lym0302/DeepSound-V1">Webpage</a> | <a href="https://github.com/lym0302/DeepSound-V1"> Huggingface Demo</a>
+</p>
+</div>
 
+## [DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos](https://github.com/lym0302/DeepSound-V1)
 
-[[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio)
+<!-- [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/) -->
 
+<!-- University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation -->
 
-**Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.**
+<!-- ICCV 2025 -->
 
 ## Highlight
 
-MMAudio generates synchronized audio given video and/or text inputs.
-Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
-Moreover, a synchronization module aligns the generated audio with the video frames.
+DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM).
 
-
-## Results
+<!-- ## Results
 
 (All audio from our algorithm MMAudio)
 
-Videos from Sora: 
+Videos from Sora:
 
 https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
 
+Videos from Veo 2:
+
+https://github.com/user-attachments/assets/8a11419e-fee2-46e0-9e67-dfb03c48d00e
 
-Videos from MovieGen/Hunyuan Video/VGGSound: 
+Videos from MovieGen/Hunyuan Video/VGGSound:
 
 https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
 
-For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
+For more results, visit https://hkchengrex.com/MMAudio/video_main.html. -->
+
 
 ## Installation
+```bash
+conda create -n deepsound-v1 python=3.10.16 -y
+conda activate deepsound-v1
+pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu120
+pip install flash-attn==2.5.8 --no-build-isolation
+pip install -e .
+pip install -r reqirments.txt
+```
+
 
-We have only tested this on Ubuntu.
+<!-- We have only tested this on Ubuntu.
 
 ### Prerequisites
 
 We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
 
-- Python 3.8+
-- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/)
-- ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`)
+- Python 3.9+
+- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
+<!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->
 
-**Clone our repository:**
+<!-- **1. Install prerequisite if not yet met:**
 
 ```bash
-git clone https://github.com/hkchengrex/MMAudio.git
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
 ```
 
-**Install with pip:**
+(Or any other CUDA versions that your GPUs/driver support) -->
 
-```bash
-cd MMAudio
-pip install -e .
+<!-- ```
+conda install -c conda-forge 'ffmpeg<7
 ```
+(Optional, if you use miniforge and don't already have the appropriate ffmpeg) -->
 
-(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
-
-**Pretrained models:**
-
-The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`
-
-| Model    | Download link | File size |
-| -------- | ------- | ------- |
-| Flow prediction network, small 16kHz | <a href="https://databank.illinois.edu/datafiles/k6jve/download" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
-| Flow prediction network, small 44.1kHz | <a href="https://databank.illinois.edu/datafiles/864ya/download" download="mmaudio_small_44k.pth">mmaudio_small_44k.pth</a> | 601M |
-| Flow prediction network, medium 44.1kHz | <a href="https://databank.illinois.edu/datafiles/pa94t/download" download="mmaudio_medium_44k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
-| Flow prediction network, large 44.1kHz **(recommended)** | <a href="https://databank.illinois.edu/datafiles/4jx76/download" download="mmaudio_large_44k.pth">mmaudio_large_44k.pth</a> | 3.9G |
-| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
-| 16kHz BigVGAN vocoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
-| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G | 
-| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
-
-The 44.1kHz vocoder will be downloaded automatically.
-
-The expected directory structure (full):
+<!-- **2. Clone our repository:**
 
 ```bash
-MMAudio
-├── ext_weights
-│   ├── best_netG.pt
-│   ├── synchformer_state_dict.pth
-│   ├── v1-16.pth
-│   └── v1-44.pth
-├── weights
-│   ├── mmaudio_small_16k.pth
-│   ├── mmaudio_small_44k.pth
-│   ├── mmaudio_medium_44k.pth
-│   └── mmaudio_large_44k.pth
-└── ...
+git clone https://github.com/lym0302/DeepSound-V1.git
 ```
 
-The expected directory structure (minimal, for the recommended model only):
+**3. Install with pip (install pytorch first before attempting this!):**
 
 ```bash
-MMAudio
-├── ext_weights
-│   ├── synchformer_state_dict.pth
-│   └── v1-44.pth
-├── weights
-│   └── mmaudio_large_44k.pth
-└── ...
+cd DeepSound-V1
+pip install -e .
 ```
 
+(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip) --> 
+
+
+<!-- The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
+The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
+See [MODELS.md](docs/MODELS.md) for more details. -->
+
 ## Demo
 
-By default, these scripts use the `large_44k` model. 
-In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
+### Pretrained models
+See [MODELS.md](docs/MODELS.md).
 
 ### Command-line interface
 
 With `demo.py`
+
 ```bash
-python demo.py --duration=8 --video=<path to video> --prompt "your prompt" 
+python demo.py -i <video_path>
 ```
-The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
+
+All training parameters are [here]().
+
+<!-- The output (audio in `.wav` format, and video in `.mp4` format) will be saved in `./output`.
 See the file for more options.
 Simply omit the `--video` option for text-to-audio synthesis.
-The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
+The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality. -->
 
-
-### Gradio interface
+<!-- ### Gradio interface
 
 Supports video-to-audio and text-to-audio synthesis.
+You can also try experimental image-to-audio synthesis which duplicates the input image to a video for processing. This might be interesting to some but it is not something MMAudio has been trained for.
+Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can specify with `--port`.
 
-```
+```bash
 python gradio_demo.py
-```
+``` -->
 
-### Known limitations
 
-1. The model sometimes generates undesired unintelligible human speech-like sounds
-2. The model sometimes generates undesired background music
-3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
 
-We believe all of these three limitations can be addressed with more high-quality training data.
+## Evaluation
+Refer [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
+See [EVAL.md](docs/EVAL.md).
 
-## Training
-Work in progress.
 
-## Evaluation
-Work in progress.
+## Citation
+
+<!-- ```bibtex
+@inproceedings{cheng2025taming,
+  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
+  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
+  booktitle={CVPR},
+  year={2025}
+}
+``` -->
+
+## Relevant Repositories
+
+- [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results.
+
 
 ## Acknowledgement
-Many thanks to:
-- [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model
-- [BigVGAN](https://github.com/NVIDIA/BigVGAN)
-- [Synchformer](https://github.com/v-iashin/Synchformer) 
 
+Many thanks to:
+- [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2) 
+- [MMAudio](https://github.com/hkchengrex/MMAudio) 
+- [FoleyCrafter](https://github.com/open-mmlab/FoleyCrafter)
+- [BS-RoFormer](https://github.com/ZFTurbo/Music-Source-Separation-Training) 
diff --git a/app.py b/app.py
index f0b95a36fcbd063edd1e513eda76754713d72acd..dffa1c92bf98355261be7e900e24074655658506 100644
--- a/app.py
+++ b/app.py
@@ -1,275 +1,162 @@
-import spaces
-import logging
-from datetime import datetime
-from pathlib import Path
-
-import gradio as gr
-import torch
-import torchaudio
 import os
+import sys
+import time
+import gradio as gr
+import subprocess
+from pathlib import Path
+import requests
+from moviepy.editor import AudioFileClip, VideoFileClip
+
+project_root = os.path.dirname(os.path.abspath(__file__))
+mmaudio_path = os.path.join(project_root, 'third_party', 'MMAudio')
+sys.path.append(mmaudio_path)
+
+from pipeline.pipeline import Pipeline
+from third_party.MMAudio.mmaudio.eval_utils import setup_eval_logging
+
+# # download model
+# os.makedirs("pretrained/mllm", exist_ok=True)
+# from huggingface_hub import snapshot_download
+# repo_local_path = snapshot_download(repo_id="lym0302/VideoLLaMA2.1-7B-AV-CoT", cache_dir='pretrained/mllm')
+
+# remove_vo_model_dir = "pretrained/remove_vo/checkpoints"
+# os.makedirs(remove_vo_model_dir, exist_ok=True)
+# urls = ["https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/model_bs_roformer_ep_317_sdr_12.9755.ckpt",
+#         "https://raw.githubusercontent.com/ZFTurbo/Music-Source-Separation-Training/main/configs/viperx/model_bs_roformer_ep_317_sdr_12.9755.yaml"]
+# for url in urls:
+#     file_name = url.split("/")[-1]  # Extract file name from URL
+#     file_path = os.path.join(remove_vo_model_dir, file_name)
+#     response = requests.get(url, stream=True)
+#     if response.status_code == 200:
+#         with open(file_path, "wb") as f:
+#             for chunk in response.iter_content(chunk_size=8192):  # Use a chunk size of 8 KB
+#                 f.write(chunk)
+#         print(f"File downloaded successfully and saved to {file_path}")
+#     else:
+#         print(f"Failed to download the file. Status code: {response.status_code}")
+
+# os.makedirs("pretrained/v2a/mmaudio", exist_ok=True)
 
-try:
-    import mmaudio
-except ImportError:
-    os.system("pip install -e .")
-    import mmaudio
-
-from mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video, make_video,
-                                setup_eval_logging)
-from mmaudio.model.flow_matching import FlowMatching
-from mmaudio.model.networks import MMAudio, get_my_mmaudio
-from mmaudio.model.sequence_config import SequenceConfig
-from mmaudio.model.utils.features_utils import FeaturesUtils
-import tempfile
-
-torch.backends.cuda.matmul.allow_tf32 = True
-torch.backends.cudnn.allow_tf32 = True
-
-log = logging.getLogger()
-
-device = 'cpu'
-dtype = torch.bfloat16
-
-model: ModelConfig = all_model_cfg['large_44k_v2']
-model.download_if_needed()
-output_dir = Path('./output/gradio')
 
 setup_eval_logging()
+pipeline = Pipeline(
+    step0_model_dir='pretrained/mllm/models--lym0302--VideoLLaMA2.1-7B-AV-CoT', 
+    step1_mode='mmaudio_medium_44k', 
+    step2_model_dir='pretrained/mllm/models--lym0302--VideoLLaMA2.1-7B-AV-CoT',
+    step2_mode='cot',
+    step3_mode='bs_roformer',
+)
 
-
-def get_model() -> tuple[MMAudio, FeaturesUtils, SequenceConfig]:
-    seq_cfg = model.seq_cfg
-
-    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
-    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
-    log.info(f'Loaded weights from {model.model_path}')
-
-    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
-                                  synchformer_ckpt=model.synchformer_ckpt,
-                                  enable_conditions=True,
-                                  mode=model.mode,
-                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
-                                  need_vae_encoder=False)
-    feature_utils = feature_utils.to(device, dtype).eval()
-
-    return net, feature_utils, seq_cfg
-
-
-net, feature_utils, seq_cfg = get_model()
-
-
-@spaces.GPU(duration=120)
-@torch.inference_mode()
-def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
-                   cfg_strength: float, duration: float):
-
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-
-    video_info = load_video(video, duration)
-    clip_frames = video_info.clip_frames
-    sync_frames = video_info.sync_frames
-    duration = video_info.duration_sec
-    clip_frames = clip_frames.unsqueeze(0)
-    sync_frames = sync_frames.unsqueeze(0)
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-
-    # current_time_string = datetime.now().strftime('%Y%m%d_%H%M%S')
-    video_save_path = tempfile.NamedTemporaryFile(delete=False, suffix='.mp4').name
-    # output_dir.mkdir(exist_ok=True, parents=True)
-    # video_save_path = output_dir / f'{current_time_string}.mp4'
-    make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-    log.info(f'Saved video to {video_save_path}')
-    return video_save_path
-
-
-@spaces.GPU(duration=120)
-@torch.inference_mode()
-def text_to_audio(prompt: str, negative_prompt: str, seed: int, num_steps: int, cfg_strength: float,
-                  duration: float):
-
-    rng = torch.Generator(device=device)
-    if seed >= 0:
-        rng.manual_seed(seed)
-    else:
-        rng.seed()
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-
-    clip_frames = sync_frames = None
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-
-    audio_save_path = tempfile.NamedTemporaryFile(delete=False, suffix='.flac').name
-    torchaudio.save(audio_save_path, audio, seq_cfg.sampling_rate)
-    log.info(f'Saved audio to {audio_save_path}')
-    return audio_save_path
+output_dir = "output_gradio"
+os.makedirs(output_dir, exist_ok=True)
+skip_final_video = False
+def video_to_audio(
+        video_input: gr.Video,
+        prompt: str='', 
+        negative_prompt: str='',
+        mode: str='s4',
+        postp_mode: str='neg',
+        duration: float=10,
+        seed: int=42,):
+
+    log_messages = []  # 用于存储日志
+    def log_info(msg):
+        log_messages.append(msg)
+        return "\n".join(log_messages)  # 每次返回完整的日志历史
+    
+    if not video_input:
+        yield None, log_info("Error: No video input provided.")
+        return
+    
+    yield None, log_info("Generate high-quality audio from video step-by-step...")  # 初始化日志
+
+    st_infer = time.time()
+    video_input = str(video_input)
+
+    for step_results in pipeline.run_for_gradio(
+        video_input=video_input, 
+        output_dir=output_dir,
+        mode=mode,
+        postp_mode=postp_mode,
+        prompt=prompt,
+        negative_prompt=negative_prompt,
+        duration=duration,
+        seed=seed
+    ):
+        if step_results['log'] == 'Finish step-by-step v2a.':
+            break
+        else:
+            yield None, log_info(step_results['log'])
+
+    
+    temp_final_audio_path = step_results["temp_final_audio_path"]
+    temp_final_video_path = step_results["temp_final_video_path"]
+
+    video_name_stem = Path(video_input).stem
+    final_audio_path = str(Path(output_dir) / f'{video_name_stem}.wav')
+    final_video_path = str(Path(output_dir) / f'{video_name_stem}.mp4')
+
+    if temp_final_audio_path is not None:
+        subprocess.run(['cp', str(temp_final_audio_path), final_audio_path], check=True)
+        step_results["final_audio_path"] = final_audio_path
+        
+        if skip_final_video:
+            step_results["final_video_path"] = None
+        else:
+            if temp_final_video_path is not None:
+                subprocess.run(['cp', str(temp_final_video_path), final_video_path], check=True)
+            else:
+                audio = AudioFileClip(final_audio_path)
+                video = VideoFileClip(video_input)
+                duration = min(audio.duration, video.duration)
+                audio = audio.subclip(0, duration)
+                video.audio = audio
+                video = video.subclip(0, duration)
+                video.write_videofile(final_video_path)
+            step_results["final_video_path"] = final_video_path
+
+    et_infer = time.time()
+    print(f"Inference time: {et_infer - st_infer:.2f} s.")
+    print("step_results: ", step_results)
+
+    yield (final_video_path if os.path.exists(final_video_path) else None), log_info(step_results['log'])
 
 
 video_to_audio_tab = gr.Interface(
     fn=video_to_audio,
+    # Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
     description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
-    Code: <a href="https://github.com/hkchengrex/MMAudio">https://github.com/hkchengrex/MMAudio</a><br>
+    Code: <a href="https://github.com/lym0302/DeepSound-V1">https://github.com/lym0302/DeepSound-V1</a><br>
 
     NOTE: It takes longer to process high-resolution videos (>384 px on the shorter side). 
     Doing so does not improve results.
 
-    The model has been trained on 8-second videos. Using much longer or shorter videos will degrade performance. Around 5s~12s should be fine.
+    This is a step-by-step v2a process and may take a long time. 
+    If Post Processing is set to 'rm', the generated video may be None.
     """,
     inputs=[
         gr.Video(),
         gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt', value='music'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
-    ],
-    outputs='playable_video',
-    cache_examples=False,
-    title='MMAudio — Video-to-Audio Synthesis',
-    examples=[
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_beach.mp4',
-            'waves, seagulls',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_serpent.mp4',
-            '',
-            'music',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_seahorse.mp4',
-            'bubbles',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_india.mp4',
-            'Indian holy music',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_galloping.mp4',
-            'galloping',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_kraken.mp4',
-            'waves, storm',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_nyc.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/mochi_storm.mp4',
-            'storm',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_spring.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_typing.mp4',
-            'typing',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-        [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/hunyuan_wake_up.mp4',
-            '',
-            '',
-            0,
-            25,
-            4.5,
-            10,
-        ],
-    ])
+        gr.Text(label='Negative prompt', value=''),
+        gr.Radio(["s3", "s4"], label="Mode", value="s4"),
+        gr.Radio(["rm", "rep", "neg"], label="Post Processing", value="neg"),
+        gr.Number(label='Duration (sec)', value=10, minimum=1),
+        gr.Number(label='Seed (42: random)', value=42, precision=0, minimum=-1),
 
-text_to_audio_tab = gr.Interface(
-    fn=text_to_audio,
-    inputs=[
-        gr.Text(label='Prompt'),
-        gr.Text(label='Negative prompt'),
-        gr.Number(label='Seed (-1: random)', value=-1, precision=0, minimum=-1),
-        gr.Number(label='Num steps', value=25, precision=0, minimum=1),
-        gr.Number(label='Guidance Strength', value=4.5, minimum=1),
-        gr.Number(label='Duration (sec)', value=8, minimum=1),
     ],
-    outputs='audio',
+    outputs=[gr.Video(label="Generated Video"), gr.Text(label="Logs"),],
     cache_examples=False,
-    title='MMAudio — Text-to-Audio Synthesis',
+    title='DeepSound-V1 — Video-to-Audio Synthesis',
 )
 
+
 if __name__ == "__main__":
-    gr.TabbedInterface([video_to_audio_tab, text_to_audio_tab],
-                       ['Video-to-Audio', 'Text-to-Audio']).launch(allowed_paths=[output_dir])
+    gr.TabbedInterface([video_to_audio_tab],
+                       ['Video-to-Audio']).launch(allowed_paths=[output_dir])
+
+
+# if __name__ == "__main__":
+#     port = 8000
+#     gr.TabbedInterface([video_to_audio_tab, ],
+#                        ['Video-to-Audio', ]).launch(
+#                            server_port=port, allowed_paths=[output_dir])
diff --git a/demo.py b/demo.py
deleted file mode 100644
index ab66f5bd3599b5960f2b7386d600173c6c541369..0000000000000000000000000000000000000000
--- a/demo.py
+++ /dev/null
@@ -1,135 +0,0 @@
-import logging
-from argparse import ArgumentParser
-from pathlib import Path
-
-import torch
-import torchaudio
-
-from mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video, make_video,
-                                setup_eval_logging)
-from mmaudio.model.flow_matching import FlowMatching
-from mmaudio.model.networks import MMAudio, get_my_mmaudio
-from mmaudio.model.utils.features_utils import FeaturesUtils
-
-torch.backends.cuda.matmul.allow_tf32 = True
-torch.backends.cudnn.allow_tf32 = True
-
-log = logging.getLogger()
-
-
-@torch.inference_mode()
-def main():
-    setup_eval_logging()
-
-    parser = ArgumentParser()
-    parser.add_argument('--variant',
-                        type=str,
-                        default='large_44k_v2',
-                        help='small_16k, small_44k, medium_44k, large_44k, large_44k_v2')
-    parser.add_argument('--video', type=Path, help='Path to the video file')
-    parser.add_argument('--prompt', type=str, help='Input prompt', default='')
-    parser.add_argument('--negative_prompt', type=str, help='Negative prompt', default='')
-    parser.add_argument('--duration', type=float, default=8.0)
-    parser.add_argument('--cfg_strength', type=float, default=4.5)
-    parser.add_argument('--num_steps', type=int, default=25)
-
-    parser.add_argument('--mask_away_clip', action='store_true')
-
-    parser.add_argument('--output', type=Path, help='Output directory', default='./output')
-    parser.add_argument('--seed', type=int, help='Random seed', default=42)
-    parser.add_argument('--skip_video_composite', action='store_true')
-    parser.add_argument('--full_precision', action='store_true')
-
-    args = parser.parse_args()
-
-    if args.variant not in all_model_cfg:
-        raise ValueError(f'Unknown model variant: {args.variant}')
-    model: ModelConfig = all_model_cfg[args.variant]
-    model.download_if_needed()
-    seq_cfg = model.seq_cfg
-
-    if args.video:
-        video_path: Path = Path(args.video).expanduser()
-    else:
-        video_path = None
-    prompt: str = args.prompt
-    negative_prompt: str = args.negative_prompt
-    output_dir: str = args.output.expanduser()
-    seed: int = args.seed
-    num_steps: int = args.num_steps
-    duration: float = args.duration
-    cfg_strength: float = args.cfg_strength
-    skip_video_composite: bool = args.skip_video_composite
-    mask_away_clip: bool = args.mask_away_clip
-
-    device = 'cuda'
-    dtype = torch.float32 if args.full_precision else torch.bfloat16
-
-    output_dir.mkdir(parents=True, exist_ok=True)
-
-    # load a pretrained model
-    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
-    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True))
-    log.info(f'Loaded weights from {model.model_path}')
-
-    # misc setup
-    rng = torch.Generator(device=device)
-    rng.manual_seed(seed)
-    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
-
-    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
-                                  synchformer_ckpt=model.synchformer_ckpt,
-                                  enable_conditions=True,
-                                  mode=model.mode,
-                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
-                                  need_vae_encoder=False)
-    feature_utils = feature_utils.to(device, dtype).eval()
-
-    if video_path is not None:
-        log.info(f'Using video {video_path}')
-        video_info = load_video(video_path, duration)
-        clip_frames = video_info.clip_frames
-        sync_frames = video_info.sync_frames
-        duration = video_info.duration_sec
-        if mask_away_clip:
-            clip_frames = None
-        else:
-            clip_frames = clip_frames.unsqueeze(0)
-        sync_frames = sync_frames.unsqueeze(0)
-    else:
-        log.info('No video provided -- text-to-audio mode')
-        clip_frames = sync_frames = None
-
-    seq_cfg.duration = duration
-    net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
-
-    log.info(f'Prompt: {prompt}')
-    log.info(f'Negative prompt: {negative_prompt}')
-
-    audios = generate(clip_frames,
-                      sync_frames, [prompt],
-                      negative_text=[negative_prompt],
-                      feature_utils=feature_utils,
-                      net=net,
-                      fm=fm,
-                      rng=rng,
-                      cfg_strength=cfg_strength)
-    audio = audios.float().cpu()[0]
-    if video_path is not None:
-        save_path = output_dir / f'{video_path.stem}.flac'
-    else:
-        safe_filename = prompt.replace(' ', '_').replace('/', '_').replace('.', '')
-        save_path = output_dir / f'{safe_filename}.flac'
-    torchaudio.save(save_path, audio, seq_cfg.sampling_rate)
-
-    log.info(f'Audio saved to {save_path}')
-    if video_path is not None and not skip_video_composite:
-        video_save_path = output_dir / f'{video_path.stem}.mp4'
-        make_video(video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
-        log.info(f'Video saved to {output_dir / video_save_path}')
-
-    log.info('Memory usage: %.2f GB', torch.cuda.max_memory_allocated() / (2**30))
-
-
-if __name__ == '__main__':
-    main()
diff --git a/docs/images/icon.png b/docs/images/icon.png
deleted file mode 100644
index c337eee9868e61173e61f583ef098668681555f5..0000000000000000000000000000000000000000
Binary files a/docs/images/icon.png and /dev/null differ
diff --git a/docs/index.html b/docs/index.html
deleted file mode 100644
index c2792c3baef3baad9e0220853d1232eebf3c266d..0000000000000000000000000000000000000000
--- a/docs/index.html
+++ /dev/null
@@ -1,147 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-
-    <link rel="preconnect" href="https://fonts.googleapis.com">
-    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
-    <link href="https://fonts.googleapis.com/css2?family=Source+Sans+3&display=swap" rel="stylesheet">
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-
-    <link rel="icon" type="image/png" href="images/icon.png">
-
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.1/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
-
-    <link rel="stylesheet" href="style.css">
-</head>
-<body>
-
-    <body>
-        <br><br><br><br>
-        <div class="container">
-            <div class="row text-center" style="font-size:38px">
-                <div class="col strong">
-                    Taming Multimodal Joint Training for High-Quality <br>Video-to-Audio Synthesis
-                </div>
-            </div>
-    
-            <br>
-            <div class="row text-center" style="font-size:28px">
-                <div class="col">
-                    arXiv 2024
-                </div>
-            </div>
-            <br>
-    
-            <div class="h-100 row text-center heavy justify-content-md-center" style="font-size:22px;">
-                <div class="col-sm-auto px-lg-2">
-                    <a href="https://hkchengrex.github.io/">Ho Kei Cheng<sup>1</sup></a>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ">Masato Ishii<sup>2</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://scholar.google.com/citations?user=sXAjHFIAAAAJ">Akio Hayakawa<sup>2</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://scholar.google.com/citations?user=XCRO260AAAAJ">Takashi Shibuya<sup>2</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <nobr><a href="https://www.alexander-schwing.de/">Alexander Schwing<sup>1</sup></a></nobr>
-                </div>
-                <div class="col-sm-auto px-lg-2" >
-                    <nobr><a href="https://www.yukimitsufuji.com/">Yuki Mitsufuji<sup>2,3</sup></a></nobr>
-                </div>
-            </div>
-
-            <div class="h-100 row text-center heavy justify-content-md-center" style="font-size:22px;">
-                <div class="col-sm-auto px-lg-2">
-                    <sup>1</sup>University of Illinois Urbana-Champaign
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <sup>2</sup>Sony AI
-                </div>
-                <div class="col-sm-auto px-lg-2">
-                    <sup>3</sup>Sony Group Corporation
-                </div>
-            </div>
-    
-            <br>
-    
-            <br>
-    
-            <div class="h-100 row text-center justify-content-md-center" style="font-size:20px;">
-                <!-- <div class="col-sm-2">
-                    <a href="https://arxiv.org/abs/2310.12982">[arXiv]</a>
-                </div> -->
-                <div class="col-sm-3">
-                    <a href="">[Paper (being prepared)]</a>
-                </div>
-                <div class="col-sm-3">
-                    <a href="https://github.com/hkchengrex/MMAudio">[Code]</a>
-                </div>
-                <!-- <div class="col-sm-2">
-                    <a
-                        href="https://colab.research.google.com/drive/1yo43XTbjxuWA7XgCUO9qxAi7wBI6HzvP?usp=sharing">[Colab]</a>
-                </div> -->
-            </div>
-    
-            <br>
-    
-            <hr>
-    
-            <div class="row" style="font-size:32px">
-                <div class="col strong">
-                    TL;DR
-                </div>
-            </div>
-            <br>
-            <div class="row">
-                <div class="col">
-                    <p class="light" style="text-align: left;">
-                        MMAudio generates synchronized audio given video and/or text inputs.
-                    </p>
-                </div>
-            </div>
-    
-            <br>
-            <hr>
-            <br>
-    
-            <div class="row" style="font-size:32px">
-                <div class="col strong">
-                    Demo
-                </div>
-            </div>
-            <br>
-            <div class="row" style="font-size:48px">
-                <div class="col strong text-center">
-                    <a href="video_main.html" style="text-decoration: underline;">&lt;More results&gt;</a>
-                </div>
-            </div>
-            <br>
-            <div class="video-container" style="text-align: center;">
-                <iframe src="https://youtube.com/embed/YElewUT2M4M"></iframe>
-                </div>
-
-            <br>
-    
-            <br><br>
-            <br><br>
-    
-        </div>
-
-</body>
-</html>
\ No newline at end of file
diff --git a/docs/style.css b/docs/style.css
deleted file mode 100644
index 4946ef1f17b794d2122351bf24e4eb08f19b9637..0000000000000000000000000000000000000000
--- a/docs/style.css
+++ /dev/null
@@ -1,78 +0,0 @@
-body {
-    font-family: 'Source Sans 3', sans-serif;
-    font-size: 18px;
-    margin-left: auto;
-    margin-right: auto;
-    font-weight: 400;
-    height: 100%;
-    max-width: 1000px;
-}
-
-table {
-    width: 100%;
-    border-collapse: collapse;
-}
-th, td {
-    border: 1px solid #ddd;
-    padding: 8px;
-    text-align: center;
-}
-th {
-    background-color: #f2f2f2;
-}
-video {
-    width: 100%;
-    height: auto;
-}
-p {
-    font-size: 28px;
-}
-h2 {
-    font-size: 36px;
-}
-
-.strong {
-    font-weight: 700;
-}
-
-.light {
-    font-weight: 100;
-}
-
-.heavy {
-    font-weight: 900;
-}
-
-.column {
-    float: left;
-}
-
-a:link,
-a:visited {
-    color: #05538f;
-    text-decoration: none;
-}
-
-a:hover {
-    color: #63cbdd;
-}
-
-hr {
-    border: 0;
-    height: 1px;
-    background-image: linear-gradient(to right, rgba(0, 0, 0, 0), rgba(0, 0, 0, 0.75), rgba(0, 0, 0, 0));
-}
-
-.video-container {
-    position: relative;
-    padding-bottom: 56.25%; /* 16:9 */
-    height: 0;
-  }
-  
-.video-container iframe {
-    position: absolute;
-    top: 0;
-    left: 0;
-    width: 100%;
-    height: 100%;
-}
\ No newline at end of file
diff --git a/docs/style_videos.css b/docs/style_videos.css
deleted file mode 100644
index 9d641122166e3c3fdd8f3e104628686ed5dc9258..0000000000000000000000000000000000000000
--- a/docs/style_videos.css
+++ /dev/null
@@ -1,52 +0,0 @@
-body {
-    font-family: 'Source Sans 3', sans-serif;
-    font-size: 1.5vh;
-    font-weight: 400;
-}
-
-table {
-    width: 100%;
-    border-collapse: collapse;
-}
-th, td {
-    border: 1px solid #ddd;
-    padding: 8px;
-    text-align: center;
-}
-th {
-    background-color: #f2f2f2;
-}
-video {
-    width: 100%;
-    height: auto;
-}
-p {
-    font-size: 1.5vh;
-    font-weight: bold;
-}
-h2 {
-    font-size: 2vh;
-    font-weight: bold;
-}
-
-.video-container {
-    position: relative;
-    padding-bottom: 56.25%; /* 16:9 */
-    height: 0;
-  }
-  
-.video-container iframe {
-    position: absolute;
-    top: 0;
-    left: 0;
-    width: 100%;
-    height: 100%;
-}
-
-.video-header {
-    background-color: #f2f2f2;
-    text-align: center;
-    font-size: 1.5vh;
-    font-weight: bold;
-    padding: 8px;
-}
\ No newline at end of file
diff --git a/docs/video_gen.html b/docs/video_gen.html
deleted file mode 100644
index da1a9d95393153b1264cc4c58ee78bda23379a2a..0000000000000000000000000000000000000000
--- a/docs/video_gen.html
+++ /dev/null
@@ -1,254 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-
-    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-
-    <link rel="icon" type="image/png" href="images/icon.png">
-
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.1/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
-
-    <link rel="stylesheet" href="style_videos.css">
-</head>
-<body>
-
-    <div id="moviegen_all">
-    <h2 id="moviegen" style="text-align: center;">Comparisons with Movie Gen Audio on Videos Generated by MovieGen</h2>
-    <p id="moviegen1" style="overflow: hidden;">
-        Example 1: Ice cracking with sharp snapping sound, and metal tool scraping against the ice surface. 
-        <span style="float: right;"><a href="#index">Back to index</a></span>
-    </p> 
-
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/d7Lb0ihtGcE"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/F4JoJ2r2m8U"></iframe>
-                </div> 
-        </div>
-    </div>
-    <br>
-
-    <!-- <p id="moviegen2">Example 2: Rhythmic splashing and lapping of water. <span style="float:right;"><a href="#index">Back to index</a></span> </p> 
-
-    <table>
-        <thead>
-            <tr>
-                <th>Movie Gen Audio</th>
-                <th>Ours</th>
-            </tr>
-        </thead>
-        <tbody>
-            <tr>
-                <td width="50%">
-                    <div class="video-container">
-                    <iframe src="https://youtube.com/embed/5gQNPK99CIk"></iframe>
-                    </div>
-                </td>
-                <td width="50%">
-                    <div class="video-container">
-                    <iframe src="https://youtube.com/embed/AbwnTzG-BpA"></iframe>
-                    </div>
-                </td>
-            </tr>
-        </tbody>
-    </table> -->
-
-    <p id="moviegen2" style="overflow: hidden;">
-        Example 2: Rhythmic splashing and lapping of water. 
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/5gQNPK99CIk"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/AbwnTzG-BpA"></iframe>
-                </div> 
-        </div>
-    </div>
-    <br>
-
-    <p id="moviegen3" style="overflow: hidden;">
-        Example 3: Shovel scrapes against dry earth. 
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/PUKGyEve7XQ"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/CNn7i8VNkdc"></iframe>
-            </div> 
-        </div>
-    </div>
-    <br>
-    
-
-    <p id="moviegen4" style="overflow: hidden;">
-        (Failure case) Example 4: Creamy sound of mashed potatoes being scooped. 
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Movie Gen Audio</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/PJv1zxR9JjQ"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Ours</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/c3-LJ1lNsPQ"></iframe>
-            </div> 
-        </div>
-    </div>
-    <br>
-
-    </div>
-
-    <div id="hunyuan_sora_all">
-
-    <h2 id="hunyuan" style="text-align: center;">Results on Videos Generated by Hunyuan</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Typing</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/8ln_9hhH_nk"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Water is rushing down a stream and pouring</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/5df1FZFQj30"></iframe>
-            </div> 
-        </div>
-    </div>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Waves on beach</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/7wQ9D5WgpFc"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Water droplet</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/q7M2nsalGjM"></iframe>
-            </div> 
-        </div>
-    </div>
-    <br>
-
-    <h2 id="sora" style="text-align: center;">Results on Videos Generated by Sora</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Ships riding waves</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/JbgQzHHytk8"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Train (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/xOW7zrjpWC8"></iframe>
-            </div> 
-        </div>
-    </div>
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Seashore (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/fIuw5Y8ZZ9E"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Surfing (failure: unprompted music)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/UcSTk-v0M_s"></iframe>
-            </div> 
-        </div>
-    </div>
-    <br>
-
-    <div id="mochi_ltx_all">
-    <h2 id="mochi" style="text-align: center;">Results on Videos Generated by Mochi 1</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Magical fire and lightning (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/tTlRZaSMNwY"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Storm (no text prompt given)</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/4hrZTMJUy3w"></iframe>
-            </div> 
-        </div>
-    </div>
-    <br>
-
-    <h2 id="ltx" style="text-align: center;">Results on Videos Generated by LTX-Video</h2>
-    <p style="overflow: hidden;">
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-    <div class="row g-1">
-        <div class="col-sm-6">
-            <div class="video-header">Firewood burning and cracking</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/P7_DDpgev0g"></iframe>
-            </div> 
-        </div>
-        <div class="col-sm-6">
-            <div class="video-header">Waterfall, water splashing</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/4MvjceYnIO0"></iframe>
-            </div> 
-        </div>
-    </div>
-    <br>
-
-    </div>
-
-</body>
-</html>
\ No newline at end of file
diff --git a/docs/video_main.html b/docs/video_main.html
deleted file mode 100644
index 36c3d996cb5bc0e9050fd217b2b1a056b085a88e..0000000000000000000000000000000000000000
--- a/docs/video_main.html
+++ /dev/null
@@ -1,98 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-
-    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-
-    <link rel="icon" type="image/png" href="images/icon.png">
-
-    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.1/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
-
-    <link rel="stylesheet" href="style_videos.css">
-
-    <script type="text/javascript">
-        $(document).ready(function(){
-            $("#content").load("video_gen.html #moviegen_all");
-            $("#load_moveigen").click(function(){
-                $("#content").load("video_gen.html #moviegen_all");
-            });
-            $("#load_hunyuan_sora").click(function(){
-                $("#content").load("video_gen.html #hunyuan_sora_all");
-            });
-            $("#load_mochi_ltx").click(function(){
-                $("#content").load("video_gen.html #mochi_ltx_all");
-            });
-            $("#load_vgg1").click(function(){
-                $("#content").load("video_vgg.html #vgg1");
-            });
-            $("#load_vgg2").click(function(){
-                $("#content").load("video_vgg.html #vgg2");
-            });
-            $("#load_vgg3").click(function(){
-                $("#content").load("video_vgg.html #vgg3");
-            });
-            $("#load_vgg4").click(function(){
-                $("#content").load("video_vgg.html #vgg4");
-            });
-            $("#load_vgg5").click(function(){
-                $("#content").load("video_vgg.html #vgg5");
-            });
-            $("#load_vgg6").click(function(){
-                $("#content").load("video_vgg.html #vgg6");
-            });
-            $("#load_vgg_extra").click(function(){
-                $("#content").load("video_vgg.html #vgg_extra");
-            });
-        });
-    </script>
-</head>
-<body>
-    <h1 id="index" style="text-align: center;">Index</h1>
-    <p><b>(Click on the links to load the corresponding videos)</b> <span style="float:right;"><a href="index.html">Back to project page</a></span></p>
-
-    <ol>
-        <li>
-            <a href="#" id="load_moveigen">Comparisons with Movie Gen Audio on Videos Generated by MovieGen</a>
-        </li>
-        <li>
-            <a href="#" id="load_hunyuan_sora">Results on Videos Generated by Hunyuan and Sora</a>
-        </li>
-        <li>
-            <a href="#" id="load_mochi_ltx">Results on Videos Generated by Mochi 1 and LTX-Video</a>
-        </li>
-        <li>
-            On VGGSound
-            <ol>
-                <li><a id='load_vgg1' href="#">Example 1: Wolf howling</a></li>
-                <li><a id='load_vgg2' href="#">Example 2: Striking a golf ball</a></li>
-                <li><a id='load_vgg3' href="#">Example 3: Hitting a drum</a></li>
-                <li><a id='load_vgg4' href="#">Example 4: Dog barking</a></li>
-                <li><a id='load_vgg5' href="#">Example 5: Playing a string instrument</a></li>
-                <li><a id='load_vgg6' href="#">Example 6: A group of people playing tambourines</a></li>
-                <li><a id='load_vgg_extra' href="#">Extra results & failure cases</a></li>
-            </ol>
-        </li>
-    </ol>
-
-    <div id="content" class="container-fluid">
-
-    </div>
-    <br>
-    <br>
-
-</body>
-</html>
\ No newline at end of file
diff --git a/docs/video_vgg.html b/docs/video_vgg.html
deleted file mode 100644
index 945b33660ed46c3f7acad3157c4181219b248533..0000000000000000000000000000000000000000
--- a/docs/video_vgg.html
+++ /dev/null
@@ -1,452 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <!-- Google tag (gtag.js) -->
-    <script async src="https://www.googletagmanager.com/gtag/js?id=G-0JKBJ3WRJZ"></script>
-    <script>
-    window.dataLayer = window.dataLayer || [];
-    function gtag(){dataLayer.push(arguments);}
-    gtag('js', new Date());
-    gtag('config', 'G-0JKBJ3WRJZ');
-    </script>
-
-    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro' rel='stylesheet' type='text/css'>
-    <meta charset="UTF-8">
-    <title>MMAudio</title>
-
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <!-- CSS only -->
-    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.1/dist/css/bootstrap.min.css" rel="stylesheet"
-        integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
-    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
-
-    <link rel="stylesheet" href="style_videos.css">
-</head>
-<body>
-
-    <div id="vgg1">
-    <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-    <p style="overflow: hidden;">
-        Example 1: Wolf howling. 
-        <span style="float:right;"><a href="#index">Back to index</a></span> 
-    </p> 
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9J_V74gqMUA"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/P6O8IpjErPc"></iframe>
-                    </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/w-5eyqepvTk"></iframe>
-                    </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/VOLfoZlRkzo"></iframe>
-                    </div> 
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/49owKyA5Pa8"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/QVtrFgbeGDM"></iframe>
-                    </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/8r0uEfSNjvI"></iframe>
-                    </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/bn-sLg2qulk"></iframe>
-                    </div> 
-            </div>
-        </div>
-    </div>
-
-    <div id="vgg2">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 2: Striking a golf ball.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/1hwSu42kkho"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/kZibDoDCNxI"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/jgKfLBLhh7Y"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/Lfsx8mOPcJo"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/tz-LpbB0MBc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/RTDUHMi08n4"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/N-3TDOsPnZQ"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/QnsHnLn4gB0"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-
-    <div id="vgg3">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 3: Hitting a drum. 
-            <span style="float:right;"><a href="#index">Back to index</a></span> 
-        </p> 
-
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/0oeIwq77w0Q"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/-UtPV9ohuIM"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9yivkgN-zwc"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/kkCsXPOlBvY"></iframe>
-                </div> 
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/MbNKsVsuvig"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/2yYviBjrpBw"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9yivkgN-zwc"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/6dnyQt4Fuhs"></iframe>
-                </div> 
-            </div>
-        </div>
-    </div>
-    </div>
-
-    <div id="vgg4">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 4: Dog barking. 
-            <span style="float:right;"><a href="#index">Back to index</a></span> 
-        </p> 
-
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/ckaqvTyMYAw"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/_aRndFZzZ-I"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/mNCISP3LBl0"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/phZBQ3L7foE"></iframe>
-                </div> 
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/Sb5Mg1-ORao"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/eHmAGOmtDDg"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/NEGa3krBrm0"></iframe>
-                </div> 
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/aO0EAXlwE7A"></iframe>
-                </div> 
-            </div>
-        </div>
-    </div>
-    
-    <div id="vgg5">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 5: Playing a string instrument.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/KP1QhWauIOc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/ovaJhWSquYE"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/N723FS9lcy8"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/t0N4ZAAXo58"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/8YSRs03QNNA"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/vOpMz55J1kY"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9JHC75vr9h0"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/9w0JckNzXmY"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-    
-    <div id="vgg6">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            Example 6: A group of people playing tambourines.
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Ground-truth</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/mx6JLxzUkRc"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Ours</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/oLirHhP9Su8"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V2A-Mapper</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/HkLkHMqptv0"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">FoleyCrafter</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/rpHiiODjmNU"></iframe>
-                </div>
-            </div>
-        </div>
-        <div class="row g-1">
-            <div class="col-sm-3">
-                <div class="video-header">Frieren</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/1mVD3fJ0LpM"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">VATT</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/yjVFnJiEJlw"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">V-AURA</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/neVeMSWtRkU"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-3">
-                <div class="video-header">Seeing and Hearing</div>
-                <div class="video-container">
-                    <iframe src="https://youtube.com/embed/EUE7YwyVWz8"></iframe>
-                </div>
-            </div>
-        </div>
-    </div>
-    
-    <div id="vgg_extra">
-        <h2 style="text-align: center;">Comparisons with state-of-the-art methods in VGGSound</h2>
-        <p style="overflow: hidden;">
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-        </p>
-
-        <div class="row g-1">
-            <div class="col-sm-3">
-            <div class="video-header">Moving train</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/Ta6H45rBzJc"></iframe>
-            </div>
-            </div>
-            <div class="col-sm-3">
-            <div class="video-header">Water splashing</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/hl6AtgHXpb4"></iframe>
-            </div>
-            </div>
-            <div class="col-sm-3">
-            <div class="video-header">Skateboarding</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/n4sCNi_9buI"></iframe>
-            </div>
-            </div>
-            <div class="col-sm-3">
-            <div class="video-header">Synchronized clapping</div>
-            <div class="video-container">
-                <iframe src="https://youtube.com/embed/oxexfpLn7FE"></iframe>
-            </div>
-            </div>
-        </div>
-
-        <br><br>
-    
-        <div id="extra-failure">
-            <h2 style="text-align: center;">Failure cases</h2>
-            <p style="overflow: hidden;">
-            <span style="float:right;"><a href="#index">Back to index</a></span>
-            </p>
-
-            <div class="row g-1">
-            <div class="col-sm-6">
-                <div class="video-header">Human speech</div>
-                <div class="video-container">
-                <iframe src="https://youtube.com/embed/nx0CyrDu70Y"></iframe>
-                </div>
-            </div>
-            <div class="col-sm-6">
-                <div class="video-header">Unfamiliar vision input</div>
-                <div class="video-container">
-                <iframe src="https://youtube.com/embed/hfnAqmK3X7w"></iframe>
-                </div>
-            </div>
-            </div>
-        </div>
-        </div>
-
-</body>
-</html>
\ No newline at end of file
diff --git a/mmaudio/__init__.py b/pipeline/__init__.py
similarity index 100%
rename from mmaudio/__init__.py
rename to pipeline/__init__.py
diff --git a/pipeline/__pycache__/__init__.cpython-310.pyc b/pipeline/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..27c645b9e977c6466bd93033dc0b3238218ee0a8
Binary files /dev/null and b/pipeline/__pycache__/__init__.cpython-310.pyc differ
diff --git a/pipeline/__pycache__/__init__.cpython-38.pyc b/pipeline/__pycache__/__init__.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f8a471e1eabb5f1693534a146f8221d4d1f99ae6
Binary files /dev/null and b/pipeline/__pycache__/__init__.cpython-38.pyc differ
diff --git a/pipeline/__pycache__/pipeline.cpython-310.pyc b/pipeline/__pycache__/pipeline.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6883284b060817ae41c215b0f6cc3f06fce9794b
Binary files /dev/null and b/pipeline/__pycache__/pipeline.cpython-310.pyc differ
diff --git a/pipeline/__pycache__/pipeline.cpython-38.pyc b/pipeline/__pycache__/pipeline.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5089252ee24dcbe73680ff3e1946c7263f737e78
Binary files /dev/null and b/pipeline/__pycache__/pipeline.cpython-38.pyc differ
diff --git a/pipeline/__pycache__/step0.cpython-310.pyc b/pipeline/__pycache__/step0.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..fd6e53e8f0e44f211852b2013dd0c169a80553eb
Binary files /dev/null and b/pipeline/__pycache__/step0.cpython-310.pyc differ
diff --git a/pipeline/__pycache__/step0.cpython-38.pyc b/pipeline/__pycache__/step0.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d1e778f54e7c344fb5dc054814af66923e11bbe5
Binary files /dev/null and b/pipeline/__pycache__/step0.cpython-38.pyc differ
diff --git a/pipeline/__pycache__/step1.cpython-310.pyc b/pipeline/__pycache__/step1.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..02f91be3fbdf746fef1772f83b51fbc847a5e2cf
Binary files /dev/null and b/pipeline/__pycache__/step1.cpython-310.pyc differ
diff --git a/pipeline/__pycache__/step1.cpython-38.pyc b/pipeline/__pycache__/step1.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7a190ed47a0e4453a63a8aa145674b2f4982b4ee
Binary files /dev/null and b/pipeline/__pycache__/step1.cpython-38.pyc differ
diff --git a/pipeline/__pycache__/step2.cpython-310.pyc b/pipeline/__pycache__/step2.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e2cc54c6c3b5538af0b2ba2eaaa0acc0e369a27a
Binary files /dev/null and b/pipeline/__pycache__/step2.cpython-310.pyc differ
diff --git a/pipeline/__pycache__/step2.cpython-38.pyc b/pipeline/__pycache__/step2.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..79cae7189ede70e20c51e42596a6a5805981658a
Binary files /dev/null and b/pipeline/__pycache__/step2.cpython-38.pyc differ
diff --git a/pipeline/__pycache__/step3.cpython-310.pyc b/pipeline/__pycache__/step3.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b262d52dbd05c03954a51fd21a842ec82ee6e133
Binary files /dev/null and b/pipeline/__pycache__/step3.cpython-310.pyc differ
diff --git a/pipeline/__pycache__/step3.cpython-38.pyc b/pipeline/__pycache__/step3.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..04937460c4d2b393f09bd1dc3d2ca93a8fe06cca
Binary files /dev/null and b/pipeline/__pycache__/step3.cpython-38.pyc differ
diff --git a/pipeline/__pycache__/step4.cpython-310.pyc b/pipeline/__pycache__/step4.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..74e6b7f15b48fbeb441d84a6311a6dba9896562c
Binary files /dev/null and b/pipeline/__pycache__/step4.cpython-310.pyc differ
diff --git a/pipeline/__pycache__/step4.cpython-38.pyc b/pipeline/__pycache__/step4.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6332e0bcd38b0c4eb270418fae097299f8190193
Binary files /dev/null and b/pipeline/__pycache__/step4.cpython-38.pyc differ
diff --git a/pipeline/pipeline.py b/pipeline/pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f89add9f42b30cb03f19f5e69095002a13b3345
--- /dev/null
+++ b/pipeline/pipeline.py
@@ -0,0 +1,175 @@
+# coding=utf-8
+
+from .step0 import Step0
+from .step1 import Step1
+from .step2 import Step2
+from .step3 import Step3
+from .step4 import Step4
+import logging
+import re
+import os
+
+class Pipeline:
+    def __init__(self, step0_model_dir, step1_mode, step2_model_dir, step2_mode, step3_mode):
+        self.step0 = Step0(step0_model_dir)
+        self.step1 = Step1(step1_mode)
+        self.step2 = Step2(step2_model_dir, step2_mode)
+        self.step3 = Step3(model_type=step3_mode)
+        self.step4 = Step4()
+        self.step_processors = [self.step1, self.step2, self.step3, self.step4]
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        
+
+    def run(self, video_input, output_dir, mode='s4', postp_mode='rep', prompt='', negative_prompt='', duration=10, seed=42):
+        step0_resp = self.step0.run(video_input)
+        step0_resp_list = re.findall(r'(Step\d:.*?)(?=Step\d:|$)', step0_resp, re.DOTALL)
+        step_infos = [step_info.strip().split("\n")[0] for step_info in step0_resp_list]
+        step3_temp_dir = os.path.join(output_dir, "remove_vo")
+        
+        step_results = {"temp_final_audio_path": None, "temp_final_video_path": None}
+        for step_info in step_infos:
+            self.log.info(f"Start to {step_info}")
+            if step_info == 'Step1: Generate audio from video.':
+                step1_audio_path, step1_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt, duration=duration, seed=seed)
+                step_results["step1_audio_path"] = step1_audio_path
+                step_results["step1_video_path"] = step1_video_path
+
+            elif step_info == 'Step2: Given a video and its generated audio, determine whether the audio contains voice-over.':
+                is_vo = self.step2.run(str(step_results["step1_video_path"]))
+                step_results["is_vo"] = is_vo
+                if not step_results["is_vo"]: # not voice-over
+                    step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                    step_results["temp_final_video_path"] = step_results["step1_video_path"]
+                    return step_results
+
+            elif step_info == 'Step3: Remove voice-over from audio.':
+                step3_audio_path = self.step3.run(input_audio_path=step_results["step1_audio_path"],
+                                temp_store_dir=step3_temp_dir,
+                                output_dir=output_dir)
+                step_results["step3_audio_path"] = step3_audio_path
+                if mode == 's3':
+                    step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+                    return step_results
+
+            elif step_info == 'Step4: Determine whether the audio is silent.':
+                is_silent = self.step4.run(step_results["step3_audio_path"])
+                step_results["is_silent"] = is_silent
+            
+            else:
+                self.log.error(f"Step-by-Step Error !!!!!!!!!")
+                return step_results
+
+        if not step_results["is_silent"]:  #  not silent
+            step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+        else:
+            self.log.info(f"Start to post process, use mode: {postp_mode}")
+            if postp_mode == "rm":
+                step_results["temp_final_audio_path"] = None
+            elif postp_mode == "rep":
+                step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                step_results["temp_final_video_path"] = step_results["step1_video_path"]
+            elif postp_mode == "neg":
+                neg_audio_path, neg_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt='human voice', duration=duration, seed=seed, is_postp=True)
+                step_results["temp_final_audio_path"] = neg_audio_path
+                step_results["temp_final_video_path"] = neg_video_path
+            else:
+                self.log.error(f"Error postp_mode: {postp_mode}")
+    
+            self.log.info(f"After post-processing, audio is {step_results['temp_final_audio_path']} and video is {step_results['temp_final_video_path']}")
+            self.log.info(f"Finish Post-Process successfully.\n")
+        
+        return step_results
+    
+
+
+    def run_for_gradio(self, video_input, output_dir, mode='s4', postp_mode='rep', prompt='', negative_prompt='', duration=10, seed=42):
+        step_results = {"temp_final_audio_path": None, 
+                        "temp_final_video_path": None, 
+                        'log': ''}
+
+        step0_resp = self.step0.run(video_input)
+        step0_resp_list = re.findall(r'(Step\d:.*?)(?=Step\d:|$)', step0_resp, re.DOTALL)
+        step_infos = [step_info.strip().split("\n")[0] for step_info in step0_resp_list]
+        step3_temp_dir = os.path.join(output_dir, "remove_vo")
+        
+        
+        for step_info in step_infos:
+            self.log.info(f"Start to {step_info}")
+            step_results['log'] = f"Start to {step_info}"
+            yield step_results
+
+            if step_info == 'Step1: Generate audio from video.':
+                step1_audio_path, step1_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt, duration=duration, seed=seed)
+                step_results["step1_audio_path"] = step1_audio_path
+                step_results["step1_video_path"] = step1_video_path
+                step_results['log'] = "Step1 completed."
+                yield step_results
+
+            elif step_info == 'Step2: Given a video and its generated audio, determine whether the audio contains voice-over.':
+                is_vo = self.step2.run(str(step_results["step1_video_path"]))
+                step_results["is_vo"] = is_vo
+                step_results['log'] = f"Step2 completed. Contain voice-over? {'Yes' if is_vo else 'No'}"
+                yield step_results
+                if not step_results["is_vo"]: # not voice-over
+                    step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                    step_results["temp_final_video_path"] = step_results["step1_video_path"]
+                    step_results['log'] = "Finish step-by-step v2a."
+                    yield step_results
+
+            elif step_info == 'Step3: Remove voice-over from audio.':
+                step3_audio_path = self.step3.run(input_audio_path=step_results["step1_audio_path"],
+                                temp_store_dir=step3_temp_dir,
+                                output_dir=output_dir)
+                step_results["step3_audio_path"] = step3_audio_path
+                step_results['log'] = f"Step3 completed."
+                yield step_results
+                if mode == 's3':
+                    step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+                    step_results['log'] = "Finish step-by-step v2a."
+                    yield step_results
+
+            elif step_info == 'Step4: Determine whether the audio is silent.':
+                is_silent = self.step4.run(step_results["step3_audio_path"])
+                step_results["is_silent"] = is_silent
+                step_results['log'] = f"Step4 completed. Silent? {'Yes' if is_silent else 'No'}"
+                yield step_results
+            
+            else:
+                self.log.error(f"Step-by-Step Error !!!!!!!!!")
+                step_results['log'] = f"Step-by-Step Error !!!!!!!!!"
+                yield step_results
+                step_results['log'] = "Finish step-by-step v2a."
+                yield step_results
+
+        if not step_results["is_silent"]:  #  not silent
+            step_results["temp_final_audio_path"] = step_results["step3_audio_path"]
+            step_results['log'] = "Finish step-by-step v2a."
+            yield step_results
+            
+        else:
+            step_results['log'] = f"Post-processing with mode: {postp_mode}"
+            yield step_results
+            self.log.info(f"Start to post process, use mode: {postp_mode}")
+            
+            if postp_mode == "rm":
+                step_results["temp_final_audio_path"] = None
+            elif postp_mode == "rep":
+                step_results["temp_final_audio_path"] = step_results["step1_audio_path"]
+                step_results["temp_final_video_path"] = step_results["step1_video_path"]
+            elif postp_mode == "neg":
+                neg_audio_path, neg_video_path = self.step1.run(video_input, output_dir, prompt, negative_prompt='human voice', duration=duration, seed=seed, is_postp=True)
+                step_results["temp_final_audio_path"] = neg_audio_path
+                step_results["temp_final_video_path"] = neg_video_path
+            else:
+                self.log.error(f"Error postp_mode: {postp_mode}")
+    
+            self.log.info(f"After post-processing, audio is {step_results['temp_final_audio_path']} and video is {step_results['temp_final_video_path']}")
+            self.log.info(f"Finish Post-Process successfully.\n")
+            step_results['log'] = f"Post-processing completed."
+            yield step_results
+            
+            
+        step_results['log'] = "Finish step-by-step v2a."
+        yield step_results
+
diff --git a/pipeline/step0.py b/pipeline/step0.py
new file mode 100644
index 0000000000000000000000000000000000000000..0448cc6242d161716ec9e6b447cd17efa6437b9e
--- /dev/null
+++ b/pipeline/step0.py
@@ -0,0 +1,39 @@
+# coding=utf-8
+# CoT generate step-by-step
+
+from third_party.VideoLLaMA2.videollama2 import model_init, mm_infer
+import logging
+    
+class Step0:
+    def __init__(self, model_path, modal_type='v'):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+
+        self.model, self.processor, self.tokenizer = model_init(model_path)
+        self.modal_type=modal_type
+        if modal_type == "a":
+            self.model.model.vision_tower = None
+        elif modal_type == "v":
+            self.model.model.audio_tower = None
+        elif modal_type == "av":
+            pass
+        else:
+            raise NotImplementedError
+        self.modal = 'audio' if modal_type == "a" else "video"
+        self.question = f"Generate high-quality audio from video step-by-step."
+        self.preprocess = self.processor[self.modal]
+
+    def run(self, video_path):
+        self.log.info("######################################################################################################")
+        self.log.info("Generate high-quality audio from video step-by-step...")
+        audio_video_tensor = self.preprocess(video_path, va=False)
+        output = mm_infer(
+            audio_video_tensor,
+            self.question,
+            model=self.model,
+            tokenizer=self.tokenizer,
+            modal=self.modal,
+            do_sample=False,
+        )
+
+        return output
diff --git a/pipeline/step1.py b/pipeline/step1.py
new file mode 100644
index 0000000000000000000000000000000000000000..a926d305cf8e6f83a8b687ff64999eccc3fc4a6f
--- /dev/null
+++ b/pipeline/step1.py
@@ -0,0 +1,36 @@
+# coding=utf-8
+# V2A
+import logging
+
+
+class Step1:
+    def __init__(self, step1_mode):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+
+        if step1_mode.startswith('mmaudio'):
+            from v2a_models.v2a_mmaudio import V2A_MMAudio
+            variant = step1_mode.replace("mmaudio_", "")
+            self.v2a_model = V2A_MMAudio(variant)
+        elif step1_mode == "foleycrafter":
+            from v2a_models.v2a_foleycrafter import V2A_FoleyCrafter
+            self.v2a_model = V2A_FoleyCrafter()
+        else:
+            self.log.error(f"Error step1_mode: {step1_mode}")
+
+
+
+    def run(self, video_path, output_dir, prompt='', negative_prompt='', duration=10, seed=42, is_postp=False,):
+        # self.log.info("Step1: Generate audio from video.")
+        step1_audio_path, step1_video_path = self.v2a_model.generate_audio(
+            video_path=video_path,
+            output_dir=output_dir,
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            duration=duration,
+            seed=seed,
+            is_postp=is_postp)
+        
+        self.log.info(f"The audio generated by Step1 is in {step1_audio_path}, and the video is in {step1_video_path}")
+        self.log.info("Finish Step1 successfully.\n")
+        return step1_audio_path, step1_video_path
diff --git a/pipeline/step2.py b/pipeline/step2.py
new file mode 100644
index 0000000000000000000000000000000000000000..416a7f8e99d2ff8332c9d866df37345d90e301b9
--- /dev/null
+++ b/pipeline/step2.py
@@ -0,0 +1,52 @@
+# coding=utf-8
+# judge voice-over
+
+from third_party.VideoLLaMA2.videollama2 import model_init, mm_infer
+import logging
+    
+class Step2:
+    def __init__(self, model_path, step2_mode, modal_type="av"):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+
+        self.model, self.processor, self.tokenizer = model_init(model_path)
+        self.modal_type=modal_type
+        if modal_type == "a":
+            self.model.model.vision_tower = None
+        elif modal_type == "v":
+            self.model.model.audio_tower = None
+        elif modal_type == "av":
+            pass
+        else:
+            raise NotImplementedError
+        self.modal = 'audio' if modal_type == "a" else "video"
+        
+        self.question = f"Given a video and its corresponding audio, determine whether the audio contains voice-over? Options: A. Yes, B. No. Choose A or B."
+        self.preprocess = self.processor[self.modal]
+
+        self.step2_mode = step2_mode
+
+    def run(self, video_audio_path):
+        # self.log.info("Step2: Given a video and its generated audio, determine whether the audio contains voice-over.")
+        audio_video_tensor = self.preprocess(video_audio_path, va=True)
+        output = mm_infer(
+            audio_video_tensor,
+            self.question,
+            model=self.model,
+            tokenizer=self.tokenizer,
+            modal=self.modal,
+            do_sample=False,
+        )
+        # print("oooooooooooooooooooooo: ", output)
+        
+        if self.step2_mode == "cot":
+            output = output.split("<CONCLUSION>")[-1][1]
+        print("1111111111111111111111111: ", output)
+        output = (output == "A")
+
+        if output:
+            self.log.info(f"The video generated by Step1 ({video_audio_path}) contains voice-over.")
+        else:
+            self.log.info(f"The video generated by Step1 ({video_audio_path}) does not contain voice-over.")
+        self.log.info("Finish Step2 successfully.\n")
+        return output
diff --git a/pipeline/step3.py b/pipeline/step3.py
new file mode 100644
index 0000000000000000000000000000000000000000..f31fda44de1e66a3bb76e730059ff762e48467ef
--- /dev/null
+++ b/pipeline/step3.py
@@ -0,0 +1,129 @@
+# coding=utf-8
+# Remove voice-over
+import logging
+import argparse
+import subprocess
+import librosa
+import os
+import torch
+import soundfile as sf
+import numpy as np
+
+
+# Using the embedded version of Python can also correctly import the utils module.
+# current_dir = os.path.dirname(os.path.abspath(__file__))
+# sys.path.append(current_dir)
+
+from third_party.MusicSourceSeparationTraining.utils import demix, load_config, normalize_audio, denormalize_audio, draw_spectrogram
+from third_party.MusicSourceSeparationTraining.utils import prefer_target_instrument, apply_tta, load_start_checkpoint
+from third_party.MusicSourceSeparationTraining.models.bs_roformer import BSRoformer
+import warnings
+
+warnings.filterwarnings("ignore")
+
+model_base_dir = "pretrained/remove_vo/checkpoints"
+MODEL_PATHS = {"bs_roformer": [f"{model_base_dir}/model_bs_roformer_ep_317_sdr_12.9755.ckpt", f"{model_base_dir}/model_bs_roformer_ep_317_sdr_12.9755.yaml"]}
+
+
+class Step3:
+    def __init__(self, model_type="bs_roformer"):
+        model_path, config_path = MODEL_PATHS[model_type]
+        
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        self.device = 'cpu'
+        if torch.cuda.is_available():
+            self.device = 'cuda'
+        elif torch.backends.mps.is_available():
+            self.device = 'mps'
+        else:
+            self.log.warning('CUDA/MPS are not available, running on CPU')
+        
+        self.model_type = model_type
+
+        # self.model, self.config = get_model_from_config(model_type, config_path)
+        self.config = load_config(model_type, config_path)
+        self.model = BSRoformer(**dict(self.config.model))
+        args = argparse.Namespace()
+        args.start_check_point = model_path
+        args.model_type = model_type
+        args.lora_checkpoint = ''
+        load_start_checkpoint(args, self.model, type_='inference')
+        self.model = self.model.to(self.device)
+        self.sample_rate = getattr(self.config.audio, 'sample_rate', 44100)
+
+        
+    def run(self,
+            input_audio_path,
+            temp_store_dir,  # for remove result dir
+            output_dir,  # for final dir
+            disable_detailed_pbar: bool=False,
+            use_tta: bool= False,
+            extract_instrumental: bool=True,
+            codec="wav",
+            subtype="FLOAT",
+            draw_spectro=0,
+            ):
+        
+        # self.log.info("Step3: Remove voice-over from audio.")
+        
+        os.makedirs(output_dir, exist_ok=True)
+        
+        if disable_detailed_pbar:
+            detailed_pbar = False
+        else:
+            detailed_pbar = True
+
+        instruments = prefer_target_instrument(self.config)[:]
+        
+        mix, sr = librosa.load(input_audio_path, sr=self.sample_rate, mono=False)
+        # If mono audio we must adjust it depending on model
+        if len(mix.shape) == 1:
+            mix = np.expand_dims(mix, axis=0)
+            if 'num_channels' in self.config.audio:
+                if self.config.audio['num_channels'] == 2:
+                    print(f'Convert mono track to stereo...')
+                    mix = np.concatenate([mix, mix], axis=0)
+
+        mix_orig = mix.copy()
+        if 'normalize' in self.config.inference:
+            if self.config.inference['normalize'] is True:
+                mix, norm_params = normalize_audio(mix)
+
+        waveforms_orig = demix(self.config, self.model, mix, self.device, model_type=self.model_type, pbar=detailed_pbar)
+        if use_tta:
+            waveforms_orig = apply_tta(self.config, self.model, mix, waveforms_orig, self.device, self.model_type)
+
+        if extract_instrumental:
+            instr = 'vocals' if 'vocals' in instruments else instruments[0]
+            waveforms_orig['instrumental'] = mix_orig - waveforms_orig[instr]
+            if 'instrumental' not in instruments:
+                instruments.append('instrumental')
+
+        file_name = os.path.splitext(os.path.basename(input_audio_path))[0].replace(".step1", "")
+        temp_output_dir = os.path.join(temp_store_dir, file_name)
+        os.makedirs(temp_output_dir, exist_ok=True)
+
+        for instr in instruments:
+            estimates = waveforms_orig[instr]
+            if 'normalize' in self.config.inference:
+                if self.config.inference['normalize'] is True:
+                    estimates = denormalize_audio(estimates, norm_params)
+
+            output_path = os.path.join(temp_output_dir, f"{instr}.{codec}")
+            sf.write(output_path, estimates.T, sr, subtype=subtype)
+            if draw_spectro > 0:
+                output_img_path = os.path.join(temp_output_dir, f"{instr}.jpg")
+                draw_spectrogram(estimates.T, sr, draw_spectro, output_img_path)
+
+
+        instrumental_file = os.path.join(temp_output_dir, 'instrumental.wav')
+        step3_audio_path = f"{output_dir}/{file_name}.step3.wav"
+        subprocess.run(['cp', instrumental_file, step3_audio_path])
+
+        self.log.info(f"The voice-over has been removed, and the audio is saved in {step3_audio_path}")
+        self.log.info("Finish Step3 successfully.\n")
+        return step3_audio_path
+
+
+
diff --git a/pipeline/step4.py b/pipeline/step4.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fc5c0d848b21e7c45ae6cd99cc08863a83bab13
--- /dev/null
+++ b/pipeline/step4.py
@@ -0,0 +1,31 @@
+# coding=utf-8
+# Silence detection 
+import logging
+import librosa
+import numpy as np
+
+
+class Step4:
+    def __init__(self):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+    
+
+    def run(self, 
+            audio_path,
+            silence_thresh=-50,
+            duration_thresh=0.9):
+        # self.log.info("Step4: Determine whether the audio is silent.")
+        y, sr = librosa.load(audio_path, sr=None)
+        energy = librosa.feature.rms(y=y)[0]
+        energy_db = librosa.amplitude_to_db(energy)
+        silent_ratio = np.sum(energy_db < silence_thresh) / len(energy_db)
+        is_silent = silent_ratio > duration_thresh
+
+        if is_silent:
+            self.log.info(f"The audio after removing the voiceover ({audio_path}) is silent.")
+        else:
+            self.log.info(f"The audio after removing the voiceover ({audio_path}) is not silent.")
+        self.log.info("Finish Step4 successfully.\n")
+        
+        return is_silent
diff --git a/pyproject.toml b/pyproject.toml
deleted file mode 100644
index 160d9d00777a11dafb4b56f553f76c1be06213a6..0000000000000000000000000000000000000000
--- a/pyproject.toml
+++ /dev/null
@@ -1,52 +0,0 @@
-[build-system]
-requires = ["hatchling"]
-build-backend = "hatchling.build"
-
-[tool.hatch.metadata]
-allow-direct-references = true
-
-[tool.yapf]
-based_on_style = "pep8"
-indent_width = 4
-column_limit = 100
-
-[project]
-name = "mmaudio"
-version = "1.0.0"
-authors = [{ name = "Rex Cheng", email = "hkchengrex@gmail.com" }]
-description = ""
-readme = "README.md"
-requires-python = ">=3.9"
-classifiers = [
-  "Programming Language :: Python :: 3",
-  "Operating System :: OS Independent",
-]
-dependencies = [
-  'torch >= 2.5.1',
-  'python-dotenv',
-  'cython',
-  'gitpython >= 3.1',
-  'tensorboard >= 2.11',
-  'numpy >= 1.21, <2.1',
-  'Pillow >= 9.5',
-  'opencv-python >= 4.8',
-  'scipy >= 1.7',
-  'tqdm >= 4.66.1',
-  'gradio >= 3.34',
-  'einops >= 0.6',
-  'hydra-core >= 1.3.2',
-  'requests',
-  'torchdiffeq',
-  'librosa >= 0.8.1',
-  'nitrous-ema',
-  'safetensors',
-  'auraloss',
-  'hydra_colorlog',
-  'tensordict',
-  'colorlog',
-  'open_clip_torch',
-  'soundfile',
-]
-
-[tool.hatch.build.targets.wheel]
-packages = ["mmaudio"]
diff --git a/requirements.txt.bak b/requirements.txt.bak
deleted file mode 100644
index 9e461d6d33dbcfd8c06c060bae9752beda85d428..0000000000000000000000000000000000000000
--- a/requirements.txt.bak
+++ /dev/null
@@ -1,27 +0,0 @@
-torch == 2.4.0
-torchvision
-torchaudio
-python-dotenv
-cython
-gitpython >= 3.1
-tensorboard >= 2.11
-numpy >= 1.21, <2.1
-Pillow >= 9.5
-opencv-python >= 4.8
-scipy >= 1.7
-tqdm >= 4.66.1
-gradio >= 3.34
-einops >= 0.6
-hydra-core >= 1.3.2
-requests
-torchdiffeq
-librosa >= 0.8.1
-nitrous-ema
-safetensors
-auraloss
-hydra_colorlog
-tensordict
-colorlog
-open_clip_torch
-soundfile
-av
\ No newline at end of file
diff --git a/third_party/MMAudio/.gitignore b/third_party/MMAudio/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..f732c933876160a511e008af864962edd1ae5620
--- /dev/null
+++ b/third_party/MMAudio/.gitignore
@@ -0,0 +1,146 @@
+run_*.sh
+log/
+saves
+saves/
+weights/
+weights
+output/
+output
+pretrained/
+workspace
+workspace/
+ext_weights/
+ext_weights
+.checkpoints/
+.vscode/
+training/example_output/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
diff --git a/third_party/MMAudio/LICENSE b/third_party/MMAudio/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..0ea89b1e3b1b756f25d9a9995a9b5a137647ebf4
--- /dev/null
+++ b/third_party/MMAudio/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Sony Research Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/mmaudio/data/__init__.py b/third_party/MMAudio/mmaudio/__init__.py
similarity index 100%
rename from mmaudio/data/__init__.py
rename to third_party/MMAudio/mmaudio/__init__.py
diff --git a/mmaudio/ext/bigvgan_v2/__init__.py b/third_party/MMAudio/mmaudio/data/__init__.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/__init__.py
rename to third_party/MMAudio/mmaudio/data/__init__.py
diff --git a/mmaudio/data/av_utils.py b/third_party/MMAudio/mmaudio/data/av_utils.py
similarity index 81%
rename from mmaudio/data/av_utils.py
rename to third_party/MMAudio/mmaudio/data/av_utils.py
index 7d4945b9658b8208f039e72d78e1dac45ae5e12d..39e23349b12dcd90f3c2b530b4c95c88e94d122a 100644
--- a/mmaudio/data/av_utils.py
+++ b/third_party/MMAudio/mmaudio/data/av_utils.py
@@ -1,7 +1,7 @@
 from dataclasses import dataclass
 from fractions import Fraction
 from pathlib import Path
-from typing import Optional
+from typing import Optional, List, Tuple
 
 import av
 import numpy as np
@@ -15,7 +15,7 @@ class VideoInfo:
     fps: Fraction
     clip_frames: torch.Tensor
     sync_frames: torch.Tensor
-    all_frames: Optional[list[np.ndarray]]
+    all_frames: Optional[List[np.ndarray]]
 
     @property
     def height(self):
@@ -25,9 +25,35 @@ class VideoInfo:
     def width(self):
         return self.all_frames[0].shape[1]
 
+    @classmethod
+    def from_image_info(cls, image_info: 'ImageInfo', duration_sec: float,
+                        fps: Fraction) -> 'VideoInfo':
+        num_frames = int(duration_sec * fps)
+        all_frames = [image_info.original_frame] * num_frames
+        return cls(duration_sec=duration_sec,
+                   fps=fps,
+                   clip_frames=image_info.clip_frames,
+                   sync_frames=image_info.sync_frames,
+                   all_frames=all_frames)
 
-def read_frames(video_path: Path, list_of_fps: list[float], start_sec: float, end_sec: float,
-                need_all_frames: bool) -> tuple[list[np.ndarray], list[np.ndarray], Fraction]:
+
+@dataclass
+class ImageInfo:
+    clip_frames: torch.Tensor
+    sync_frames: torch.Tensor
+    original_frame: Optional[np.ndarray]
+
+    @property
+    def height(self):
+        return self.original_frame.shape[0]
+
+    @property
+    def width(self):
+        return self.original_frame.shape[1]
+
+
+def read_frames(video_path: Path, list_of_fps: List[float], start_sec: float, end_sec: float,
+                need_all_frames: bool) -> Tuple[List[np.ndarray], List[np.ndarray], Fraction]:
     output_frames = [[] for _ in list_of_fps]
     next_frame_time_for_each_fps = [0.0 for _ in list_of_fps]
     time_delta_for_each_fps = [1 / fps for fps in list_of_fps]
diff --git a/third_party/MMAudio/mmaudio/data/data_setup.py b/third_party/MMAudio/mmaudio/data/data_setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8ebcea712012d811aa39ae28be628fd94c8bd13
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/data_setup.py
@@ -0,0 +1,174 @@
+import logging
+import random
+
+import numpy as np
+import torch
+from omegaconf import DictConfig
+from torch.utils.data import DataLoader, Dataset
+from torch.utils.data.dataloader import default_collate
+from torch.utils.data.distributed import DistributedSampler
+
+from mmaudio.data.eval.audiocaps import AudioCapsData
+from mmaudio.data.eval.video_dataset import MovieGen, VGGSound
+from mmaudio.data.extracted_audio import ExtractedAudio
+from mmaudio.data.extracted_vgg import ExtractedVGG
+from mmaudio.data.mm_dataset import MultiModalDataset
+from mmaudio.utils.dist_utils import local_rank
+
+log = logging.getLogger()
+
+
+# Re-seed randomness every time we start a worker
+def worker_init_fn(worker_id: int):
+    worker_seed = torch.initial_seed() % (2**31) + worker_id + local_rank * 1000
+    np.random.seed(worker_seed)
+    random.seed(worker_seed)
+    log.debug(f'Worker {worker_id} re-seeded with seed {worker_seed} in rank {local_rank}')
+
+
+def load_vgg_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedVGG(tsv_path=data_cfg.tsv,
+                           data_dim=cfg.data_dim,
+                           premade_mmap_dir=data_cfg.memmap_dir)
+
+    return dataset
+
+
+def load_audio_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedAudio(tsv_path=data_cfg.tsv,
+                             data_dim=cfg.data_dim,
+                             premade_mmap_dir=data_cfg.memmap_dir)
+
+    return dataset
+
+
+def setup_training_datasets(cfg: DictConfig) -> tuple[Dataset, DistributedSampler, DataLoader]:
+    if cfg.mini_train:
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        dataset = MultiModalDataset([vgg], [audiocaps])
+    if cfg.example_train:
+        video = load_vgg_data(cfg, cfg.data.Example_video)
+        audio = load_audio_data(cfg, cfg.data.Example_audio)
+        dataset = MultiModalDataset([video], [audio])
+    else:
+        # load the largest one first
+        freesound = load_audio_data(cfg, cfg.data.FreeSound)
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        audioset_sl = load_audio_data(cfg, cfg.data.AudioSetSL)
+        bbcsound = load_audio_data(cfg, cfg.data.BBCSound)
+        clotho = load_audio_data(cfg, cfg.data.Clotho)
+        dataset = MultiModalDataset([vgg] * cfg.vgg_oversample_rate,
+                                    [audiocaps, audioset_sl, bbcsound, freesound, clotho])
+
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=True,
+                                       drop_last=True,
+                                       pin_memory=pin_memory)
+
+    return dataset, sampler, loader
+
+
+def setup_test_datasets(cfg):
+    dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_test)
+
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=False,
+                                       drop_last=False,
+                                       pin_memory=pin_memory)
+
+    return dataset, sampler, loader
+
+
+def setup_val_datasets(cfg: DictConfig) -> tuple[Dataset, DataLoader, DataLoader]:
+    if cfg.example_train:
+        dataset = load_vgg_data(cfg, cfg.data.Example_video)
+    else:
+        dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+
+    val_batch_size = cfg.batch_size
+    val_eval_batch_size = cfg.eval_batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, val_loader = construct_loader(dataset,
+                                     val_batch_size,
+                                     num_workers,
+                                     shuffle=False,
+                                     drop_last=False,
+                                     pin_memory=pin_memory)
+    _, eval_loader = construct_loader(dataset,
+                                      val_eval_batch_size,
+                                      num_workers,
+                                      shuffle=False,
+                                      drop_last=False,
+                                      pin_memory=pin_memory)
+
+    return dataset, val_loader, eval_loader
+
+
+def setup_eval_dataset(dataset_name: str, cfg: DictConfig) -> tuple[Dataset, DataLoader]:
+    if dataset_name.startswith('audiocaps_full'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps_full.audio_path,
+                                cfg.eval_data.AudioCaps_full.csv_path)
+    elif dataset_name.startswith('audiocaps'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps.audio_path,
+                                cfg.eval_data.AudioCaps.csv_path)
+    elif dataset_name.startswith('moviegen'):
+        dataset = MovieGen(cfg.eval_data.MovieGen.video_path,
+                           cfg.eval_data.MovieGen.jsonl_path,
+                           duration_sec=cfg.duration_s)
+    elif dataset_name.startswith('vggsound'):
+        dataset = VGGSound(cfg.eval_data.VGGSound.video_path,
+                           cfg.eval_data.VGGSound.csv_path,
+                           duration_sec=cfg.duration_s)
+    else:
+        raise ValueError(f'Invalid dataset name: {dataset_name}')
+
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, loader = construct_loader(dataset,
+                                 batch_size,
+                                 num_workers,
+                                 shuffle=False,
+                                 drop_last=False,
+                                 pin_memory=pin_memory,
+                                 error_avoidance=True)
+    return dataset, loader
+
+
+def error_avoidance_collate(batch):
+    batch = list(filter(lambda x: x is not None, batch))
+    return default_collate(batch)
+
+
+def construct_loader(dataset: Dataset,
+                     batch_size: int,
+                     num_workers: int,
+                     *,
+                     shuffle: bool = True,
+                     drop_last: bool = True,
+                     pin_memory: bool = False,
+                     error_avoidance: bool = False) -> tuple[DistributedSampler, DataLoader]:
+    train_sampler = DistributedSampler(dataset, rank=local_rank, shuffle=shuffle)
+    train_loader = DataLoader(dataset,
+                              batch_size,
+                              sampler=train_sampler,
+                              num_workers=num_workers,
+                              worker_init_fn=worker_init_fn,
+                              drop_last=drop_last,
+                              persistent_workers=num_workers > 0,
+                              pin_memory=pin_memory,
+                              collate_fn=error_avoidance_collate if error_avoidance else None)
+    return train_sampler, train_loader
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/__init__.py b/third_party/MMAudio/mmaudio/data/eval/__init__.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/__init__.py
rename to third_party/MMAudio/mmaudio/data/eval/__init__.py
diff --git a/third_party/MMAudio/mmaudio/data/eval/audiocaps.py b/third_party/MMAudio/mmaudio/data/eval/audiocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..35f4fd9e1e300503b0100825e698f82edfd735d1
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/eval/audiocaps.py
@@ -0,0 +1,39 @@
+import logging
+import os
+from collections import defaultdict
+from pathlib import Path
+from typing import Union
+
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+
+log = logging.getLogger()
+
+
+class AudioCapsData(Dataset):
+
+    def __init__(self, audio_path: Union[str, Path], csv_path: Union[str, Path]):
+        df = pd.read_csv(csv_path).to_dict(orient='records')
+
+        audio_files = sorted(os.listdir(audio_path))
+        audio_files = set(
+            [Path(f).stem for f in audio_files if f.endswith('.wav') or f.endswith('.flac')])
+
+        self.data = []
+        for row in df:
+            self.data.append({
+                'name': row['name'],
+                'caption': row['caption'],
+            })
+
+        self.audio_path = Path(audio_path)
+        self.csv_path = Path(csv_path)
+
+        log.info(f'Found {len(self.data)} matching audio files in {self.audio_path}')
+
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        return self.data[idx]
+
+    def __len__(self):
+        return len(self.data)
diff --git a/third_party/MMAudio/mmaudio/data/eval/moviegen.py b/third_party/MMAudio/mmaudio/data/eval/moviegen.py
new file mode 100644
index 0000000000000000000000000000000000000000..97969d68385f70eb49e8eb25fc6c3733a0cedda8
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/eval/moviegen.py
@@ -0,0 +1,131 @@
+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+
+from mmaudio.utils.dist_utils import local_rank
+
+log = logging.getLogger()
+
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+
+
+class MovieGenData(Dataset):
+
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        sync_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+        read_clip: bool = True,
+    ):
+        self.video_root = Path(video_root)
+        self.sync_root = Path(sync_root)
+        self.jsonl_root = Path(jsonl_root)
+        self.read_clip = read_clip
+
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+
+        self.duration_sec = duration_sec
+
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+
+        self.clip_augment = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+
+        self.sync_augment = v2.Compose([
+            v2.Resize((_SYNC_SIZE, _SYNC_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+
+        self.videos = videos
+
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(f'CLIP video too short {video_id}')
+
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(f'Sync video too short {video_id}')
+
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_augment(clip_chunk)
+
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_augment(sync_chunk)
+
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+
+        return data
+
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        return self.sample(idx)
+
+    def __len__(self):
+        return len(self.captions)
diff --git a/third_party/MMAudio/mmaudio/data/eval/video_dataset.py b/third_party/MMAudio/mmaudio/data/eval/video_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b84a963e6da0c31984a3105dc87a6e9a1918c62
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/eval/video_dataset.py
@@ -0,0 +1,197 @@
+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+
+from mmaudio.utils.dist_utils import local_rank
+
+log = logging.getLogger()
+
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+
+
+class VideoDataset(Dataset):
+
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        self.video_root = Path(video_root)
+
+        self.duration_sec = duration_sec
+
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+
+        # to be implemented by subclasses
+        self.captions = {}
+        self.videos = sorted(list(self.captions.keys()))
+
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+
+        return data
+
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+
+    def __len__(self):
+        return len(self.captions)
+
+
+class VGGSound(VideoDataset):
+
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        csv_path: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.csv_path = Path(csv_path)
+
+        videos = sorted(os.listdir(self.video_root))
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.captions = {}
+
+        df = pd.read_csv(csv_path, header=None, names=['id', 'sec', 'caption',
+                                                       'split']).to_dict(orient='records')
+
+        videos_no_found = []
+        for row in df:
+            if row['split'] == 'test':
+                start_sec = int(row['sec'])
+                video_id = str(row['id'])
+                # this is how our videos are named
+                video_name = f'{video_id}_{start_sec:06d}'
+                if video_name + '.mp4' not in videos:
+                    videos_no_found.append(video_name)
+                    continue
+
+                self.captions[video_name] = row['caption']
+
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+            log.info(f'{len(self.captions)} useable videos found')
+            if videos_no_found:
+                log.info(f'{len(videos_no_found)} found in {csv_path} but not in {video_root}')
+                log.info(
+                    'A small amount is expected, as not all videos are still available on YouTube')
+
+        self.videos = sorted(list(self.captions.keys()))
+
+
+class MovieGen(VideoDataset):
+
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.jsonl_root = Path(jsonl_root)
+
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+
+        self.videos = videos
diff --git a/third_party/MMAudio/mmaudio/data/extracted_audio.py b/third_party/MMAudio/mmaudio/data/extracted_audio.py
new file mode 100644
index 0000000000000000000000000000000000000000..d23fd6890fedb8d24c167fadbee338a417b0f6a3
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/extracted_audio.py
@@ -0,0 +1,88 @@
+import logging
+from pathlib import Path
+from typing import Union
+
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+
+from mmaudio.utils.dist_utils import local_rank
+
+log = logging.getLogger()
+
+
+class ExtractedAudio(Dataset):
+
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [str(d['id']) for d in self.df_list]
+
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.text_features = td['text_features']
+
+        log.info(f'Loaded {len(self)} samples from {premade_mmap_dir}.')
+        log.info(f'Loaded mean: {self.mean.shape}.')
+        log.info(f'Loaded std: {self.std.shape}.')
+        log.info(f'Loaded text features: {self.text_features.shape}.')
+
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+
+        self.fake_clip_features = torch.zeros(self.data_dim['clip_seq_len'],
+                                              self.data_dim['clip_dim'])
+        self.fake_sync_features = torch.zeros(self.data_dim['sync_seq_len'],
+                                              self.data_dim['sync_dim'])
+        self.video_exist = torch.tensor(0, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'text_features': self.text_features,
+        })
+        return td
+
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': str(self.df_list[idx]['id']),
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.fake_clip_features,
+            'sync_features': self.fake_sync_features,
+            'text_features': self.text_features[idx],
+            'caption': self.df_list[idx]['caption'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+        }
+        return data
+
+    def __len__(self):
+        return len(self.ids)
diff --git a/third_party/MMAudio/mmaudio/data/extracted_vgg.py b/third_party/MMAudio/mmaudio/data/extracted_vgg.py
new file mode 100644
index 0000000000000000000000000000000000000000..39c8e4b7e72e0bae5dd8d0ba802abb47738ebc4f
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/extracted_vgg.py
@@ -0,0 +1,101 @@
+import logging
+from pathlib import Path
+from typing import Union
+
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+
+from mmaudio.utils.dist_utils import local_rank
+
+log = logging.getLogger()
+
+
+class ExtractedVGG(Dataset):
+
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [d['id'] for d in self.df_list]
+
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.clip_features = td['clip_features']
+        self.sync_features = td['sync_features']
+        self.text_features = td['text_features']
+
+        if local_rank == 0:
+            log.info(f'Loaded {len(self)} samples.')
+            log.info(f'Loaded mean: {self.mean.shape}.')
+            log.info(f'Loaded std: {self.std.shape}.')
+            log.info(f'Loaded clip_features: {self.clip_features.shape}.')
+            log.info(f'Loaded sync_features: {self.sync_features.shape}.')
+            log.info(f'Loaded text_features: {self.text_features.shape}.')
+
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+
+        assert self.clip_features.shape[1] == self.data_dim['clip_seq_len'], \
+            f'{self.clip_features.shape[1]} != {self.data_dim["clip_seq_len"]}'
+        assert self.sync_features.shape[1] == self.data_dim['sync_seq_len'], \
+            f'{self.sync_features.shape[1]} != {self.data_dim["sync_seq_len"]}'
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+
+        assert self.clip_features.shape[-1] == self.data_dim['clip_dim'], \
+            f'{self.clip_features.shape[-1]} != {self.data_dim["clip_dim"]}'
+        assert self.sync_features.shape[-1] == self.data_dim['sync_dim'], \
+            f'{self.sync_features.shape[-1]} != {self.data_dim["sync_dim"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+
+        self.video_exist = torch.tensor(1, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'clip_features': self.clip_features,
+            'sync_features': self.sync_features,
+            'text_features': self.text_features,
+        })
+        return td
+
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': self.df_list[idx]['id'],
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.clip_features[idx],
+            'sync_features': self.sync_features[idx],
+            'text_features': self.text_features[idx],
+            'caption': self.df_list[idx]['label'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+        }
+
+        return data
+
+    def __len__(self):
+        return len(self.ids)
diff --git a/mmaudio/model/__init__.py b/third_party/MMAudio/mmaudio/data/extraction/__init__.py
similarity index 100%
rename from mmaudio/model/__init__.py
rename to third_party/MMAudio/mmaudio/data/extraction/__init__.py
diff --git a/third_party/MMAudio/mmaudio/data/extraction/vgg_sound.py b/third_party/MMAudio/mmaudio/data/extraction/vgg_sound.py
new file mode 100644
index 0000000000000000000000000000000000000000..116710d1fac2518807611564e1a1dc32dbd0bf07
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/extraction/vgg_sound.py
@@ -0,0 +1,193 @@
+import logging
+import os
+from pathlib import Path
+from typing import Optional, Union
+
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+
+from mmaudio.utils.dist_utils import local_rank
+
+log = logging.getLogger()
+
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+
+
+class VGGSound(Dataset):
+
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        tsv_path: Union[str, Path] = 'sets/vgg3-train.tsv',
+        sample_rate: int = 16_000,
+        duration_sec: float = 8.0,
+        audio_samples: Optional[int] = None,
+        normalize_audio: bool = False,
+    ):
+        self.root = Path(root)
+        self.normalize_audio = normalize_audio
+        if audio_samples is None:
+            self.audio_samples = int(sample_rate * duration_sec)
+        else:
+            self.audio_samples = audio_samples
+            effective_duration = audio_samples / sample_rate
+            # make sure the duration is close enough, within 15ms
+            assert abs(effective_duration - duration_sec) < 0.015, \
+                f'audio_samples {audio_samples} does not match duration_sec {duration_sec}'
+
+        videos = sorted(os.listdir(self.root))
+        videos = set([Path(v).stem for v in videos])  # remove extensions
+        self.labels = {}
+        self.videos = []
+        missing_videos = []
+
+        # read the tsv for subset information
+        df_list = pd.read_csv(tsv_path, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            label = record['label']
+            if id in videos:
+                self.labels[id] = label
+                self.videos.append(id)
+            else:
+                missing_videos.append(id)
+
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {root}')
+            log.info(f'{len(self.videos)} videos found in {tsv_path}')
+            log.info(f'{len(missing_videos)} videos missing in {root}')
+
+        self.sample_rate = sample_rate
+        self.duration_sec = duration_sec
+
+        self.expected_audio_length = audio_samples
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+
+        self.resampler = {}
+
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        label = self.labels[video_id]
+
+        reader = StreamingMediaDecoder(self.root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_audio_stream(frames_per_chunk=2**30, )
+
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        audio_chunk = data_chunk[2]
+
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+
+        # process audio
+        sample_rate = int(reader.get_out_stream_info(2).sample_rate)
+        audio_chunk = audio_chunk.transpose(0, 1)
+        audio_chunk = audio_chunk.mean(dim=0)  # mono
+        if self.normalize_audio:
+            abs_max = audio_chunk.abs().max()
+            audio_chunk = audio_chunk / abs_max * 0.95
+            if abs_max <= 1e-6:
+                raise RuntimeError(f'Audio is silent {video_id}')
+
+        # resample
+        if sample_rate == self.sample_rate:
+            audio_chunk = audio_chunk
+        else:
+            if sample_rate not in self.resampler:
+                # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                    sample_rate,
+                    self.sample_rate,
+                    lowpass_filter_width=64,
+                    rolloff=0.9475937167399596,
+                    resampling_method='sinc_interp_kaiser',
+                    beta=14.769656459379492,
+                )
+            audio_chunk = self.resampler[sample_rate](audio_chunk)
+
+        if audio_chunk.shape[0] < self.expected_audio_length:
+            raise RuntimeError(f'Audio too short {video_id}')
+        audio_chunk = audio_chunk[:self.expected_audio_length]
+
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+
+        data = {
+            'id': video_id,
+            'caption': label,
+            'audio': audio_chunk,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+
+        return data
+
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+
+    def __len__(self):
+        return len(self.labels)
diff --git a/third_party/MMAudio/mmaudio/data/extraction/wav_dataset.py b/third_party/MMAudio/mmaudio/data/extraction/wav_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..95bfbb3d7dea50ad9c8822e4626dda9582d7cd55
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/extraction/wav_dataset.py
@@ -0,0 +1,132 @@
+import logging
+import os
+from pathlib import Path
+from typing import Union
+
+import open_clip
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+
+log = logging.getLogger()
+
+
+class WavTextClipsDataset(Dataset):
+
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        captions_tsv: Union[str, Path],
+        clips_tsv: Union[str, Path],
+        sample_rate: int,
+        num_samples: int,
+        normalize_audio: bool = False,
+        reject_silent: bool = False,
+        tokenizer_id: str = 'ViT-H-14-378-quickgelu',
+    ):
+        self.root = Path(root)
+        self.sample_rate = sample_rate
+        self.num_samples = num_samples
+        self.normalize_audio = normalize_audio
+        self.reject_silent = reject_silent
+        self.tokenizer = open_clip.get_tokenizer(tokenizer_id)
+
+        audios = sorted(os.listdir(self.root))
+        audios = set([
+            Path(audio).stem for audio in audios
+            if audio.endswith('.wav') or audio.endswith('.flac')
+        ])
+        self.captions = {}
+
+        # read the caption tsv
+        df_list = pd.read_csv(captions_tsv, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            caption = record['caption']
+            self.captions[id] = caption
+
+        # read the clip tsv
+        df_list = pd.read_csv(clips_tsv, sep='\t', dtype={
+            'id': str,
+            'name': str
+        }).to_dict('records')
+        self.clips = []
+        for record in df_list:
+            record['id'] = record['id']
+            record['name'] = record['name']
+            id = record['id']
+            name = record['name']
+            if name not in self.captions:
+                log.warning(f'Audio {name} not found in {captions_tsv}')
+                continue
+            record['caption'] = self.captions[name]
+            self.clips.append(record)
+
+        log.info(f'Found {len(self.clips)} audio files in {self.root}')
+
+        self.resampler = {}
+
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        try:
+            clip = self.clips[idx]
+            audio_name = clip['name']
+            audio_id = clip['id']
+            caption = clip['caption']
+            start_sample = clip['start_sample']
+            end_sample = clip['end_sample']
+
+            audio_path = self.root / f'{audio_name}.flac'
+            if not audio_path.exists():
+                audio_path = self.root / f'{audio_name}.wav'
+                assert audio_path.exists()
+
+            audio_chunk, sample_rate = torchaudio.load(audio_path)
+            audio_chunk = audio_chunk.mean(dim=0)  # mono
+            abs_max = audio_chunk.abs().max()
+            if self.normalize_audio:
+                audio_chunk = audio_chunk / abs_max * 0.95
+
+            if self.reject_silent and abs_max < 1e-6:
+                log.warning(f'Rejecting silent audio')
+                return None
+
+            audio_chunk = audio_chunk[start_sample:end_sample]
+
+            # resample
+            if sample_rate == self.sample_rate:
+                audio_chunk = audio_chunk
+            else:
+                if sample_rate not in self.resampler:
+                    # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                    self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                        sample_rate,
+                        self.sample_rate,
+                        lowpass_filter_width=64,
+                        rolloff=0.9475937167399596,
+                        resampling_method='sinc_interp_kaiser',
+                        beta=14.769656459379492,
+                    )
+                audio_chunk = self.resampler[sample_rate](audio_chunk)
+
+            if audio_chunk.shape[0] < self.num_samples:
+                raise ValueError('Audio is too short')
+            audio_chunk = audio_chunk[:self.num_samples]
+
+            tokens = self.tokenizer([caption])[0]
+
+            output = {
+                'waveform': audio_chunk,
+                'id': audio_id,
+                'caption': caption,
+                'tokens': tokens,
+            }
+
+            return output
+        except Exception as e:
+            log.error(f'Error reading {audio_path}: {e}')
+            return None
+
+    def __len__(self):
+        return len(self.clips)
diff --git a/third_party/MMAudio/mmaudio/data/mm_dataset.py b/third_party/MMAudio/mmaudio/data/mm_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9c7d3d02fc0534592e7c990d19be2d6b378b56c
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/mm_dataset.py
@@ -0,0 +1,45 @@
+import bisect
+
+import torch
+from torch.utils.data.dataset import Dataset
+
+
+# modified from https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#ConcatDataset
+class MultiModalDataset(Dataset):
+    datasets: list[Dataset]
+    cumulative_sizes: list[int]
+
+    @staticmethod
+    def cumsum(sequence):
+        r, s = [], 0
+        for e in sequence:
+            l = len(e)
+            r.append(l + s)
+            s += l
+        return r
+
+    def __init__(self, video_datasets: list[Dataset], audio_datasets: list[Dataset]):
+        super().__init__()
+        self.video_datasets = list(video_datasets)
+        self.audio_datasets = list(audio_datasets)
+        self.datasets = self.video_datasets + self.audio_datasets
+
+        self.cumulative_sizes = self.cumsum(self.datasets)
+
+    def __len__(self):
+        return self.cumulative_sizes[-1]
+
+    def __getitem__(self, idx):
+        if idx < 0:
+            if -idx > len(self):
+                raise ValueError("absolute value of index should not exceed dataset length")
+            idx = len(self) + idx
+        dataset_idx = bisect.bisect_right(self.cumulative_sizes, idx)
+        if dataset_idx == 0:
+            sample_idx = idx
+        else:
+            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
+        return self.datasets[dataset_idx][sample_idx]
+
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        return self.video_datasets[0].compute_latent_stats()
diff --git a/third_party/MMAudio/mmaudio/data/utils.py b/third_party/MMAudio/mmaudio/data/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f782ceaf4a506c6f81886981cd55492fd0a5cccf
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/data/utils.py
@@ -0,0 +1,148 @@
+import logging
+import os
+import random
+import tempfile
+from pathlib import Path
+from typing import Any, Optional, Union
+
+import torch
+import torch.distributed as dist
+from tensordict import MemoryMappedTensor
+from torch.utils.data import DataLoader
+from torch.utils.data.dataset import Dataset
+from tqdm import tqdm
+
+from mmaudio.utils.dist_utils import local_rank, world_size
+
+scratch_path = Path(os.environ['SLURM_SCRATCH'] if 'SLURM_SCRATCH' in os.environ else '/dev/shm')
+shm_path = Path('/dev/shm')
+
+log = logging.getLogger()
+
+
+def reseed(seed):
+    random.seed(seed)
+    torch.manual_seed(seed)
+
+
+def local_scatter_torch(obj: Optional[Any]):
+    if world_size == 1:
+        # Just one worker. Do nothing.
+        return obj
+
+    array = [obj] * world_size
+    target_array = [None]
+    if local_rank == 0:
+        dist.scatter_object_list(target_array, scatter_object_input_list=array, src=0)
+    else:
+        dist.scatter_object_list(target_array, scatter_object_input_list=None, src=0)
+    return target_array[0]
+
+
+class ShardDataset(Dataset):
+
+    def __init__(self, root):
+        self.root = root
+        self.shards = sorted(os.listdir(root))
+
+    def __len__(self):
+        return len(self.shards)
+
+    def __getitem__(self, idx):
+        return torch.load(os.path.join(self.root, self.shards[idx]), weights_only=True)
+
+
+def get_tmp_dir(in_memory: bool) -> Path:
+    return shm_path if in_memory else scratch_path
+
+
+def load_shards_and_share(data_path: Union[str, Path], ids: list[int],
+                          in_memory: bool) -> MemoryMappedTensor:
+    if local_rank == 0:
+        with tempfile.NamedTemporaryFile(prefix='shared-tensor-', dir=get_tmp_dir(in_memory)) as f:
+            log.info(f'Loading shards from {data_path} into {f.name}...')
+            data = load_shards(data_path, ids=ids, tmp_file_path=f.name)
+            data = share_tensor_to_all(data)
+            torch.distributed.barrier()
+            f.close()  # why does the context manager not close the file for me?
+    else:
+        log.info('Waiting for the data to be shared with me...')
+        data = share_tensor_to_all(None)
+        torch.distributed.barrier()
+
+    return data
+
+
+def load_shards(
+    data_path: Union[str, Path],
+    ids: list[int],
+    *,
+    tmp_file_path: str,
+) -> Union[torch.Tensor, dict[str, torch.Tensor]]:
+
+    id_set = set(ids)
+    shards = sorted(os.listdir(data_path))
+    log.info(f'Found {len(shards)} shards in {data_path}.')
+    first_shard = torch.load(os.path.join(data_path, shards[0]), weights_only=True)
+
+    log.info(f'Rank {local_rank} created file {tmp_file_path}')
+    first_item = next(iter(first_shard.values()))
+    log.info(f'First item shape: {first_item.shape}')
+    mm_tensor = MemoryMappedTensor.empty(shape=(len(ids), *first_item.shape),
+                                         dtype=torch.float32,
+                                         filename=tmp_file_path,
+                                         existsok=True)
+    total_count = 0
+    used_index = set()
+    id_indexing = {i: idx for idx, i in enumerate(ids)}
+    # faster with no workers; otherwise we need to set_sharing_strategy('file_system')
+    loader = DataLoader(ShardDataset(data_path), batch_size=1, num_workers=0)
+    for data in tqdm(loader, desc='Loading shards'):
+        for i, v in data.items():
+            if i not in id_set:
+                continue
+
+            # tensor_index = ids.index(i)
+            tensor_index = id_indexing[i]
+            if tensor_index in used_index:
+                raise ValueError(f'Duplicate id {i} found in {data_path}.')
+            used_index.add(tensor_index)
+            mm_tensor[tensor_index] = v
+            total_count += 1
+
+    assert total_count == len(ids), f'Expected {len(ids)} tensors, got {total_count}.'
+    log.info(f'Loaded {total_count} tensors from {data_path}.')
+
+    return mm_tensor
+
+
+def share_tensor_to_all(x: Optional[MemoryMappedTensor]) -> MemoryMappedTensor:
+    """
+    x: the tensor to be shared; None if local_rank != 0
+    return: the shared tensor
+    """
+
+    # there is no need to share your stuff with anyone if you are alone; must be in memory
+    if world_size == 1:
+        return x
+
+    if local_rank == 0:
+        assert x is not None, 'x must not be None if local_rank == 0'
+    else:
+        assert x is None, 'x must be None if local_rank != 0'
+
+    if local_rank == 0:
+        filename = x.filename
+        meta_information = (filename, x.shape, x.dtype)
+    else:
+        meta_information = None
+
+    filename, data_shape, data_type = local_scatter_torch(meta_information)
+    if local_rank == 0:
+        data = x
+    else:
+        data = MemoryMappedTensor.from_filename(filename=filename,
+                                                dtype=data_type,
+                                                shape=data_shape)
+
+    return data
diff --git a/mmaudio/eval_utils.py b/third_party/MMAudio/mmaudio/eval_utils.py
similarity index 70%
rename from mmaudio/eval_utils.py
rename to third_party/MMAudio/mmaudio/eval_utils.py
index a5c9291f2687855b10b63b3f6e67e299c86cbbbe..74357cbd70aced61563798a88c38d3d486b206c5 100644
--- a/mmaudio/eval_utils.py
+++ b/third_party/MMAudio/mmaudio/eval_utils.py
@@ -1,16 +1,18 @@
 import dataclasses
 import logging
 from pathlib import Path
-from typing import Optional
+from typing import Optional, Tuple, List, Dict
 
+import numpy as np
 import torch
 from colorlog import ColoredFormatter
+from PIL import Image
 from torchvision.transforms import v2
 
-from mmaudio.data.av_utils import VideoInfo, read_frames, reencode_with_audio
+from mmaudio.data.av_utils import ImageInfo, VideoInfo, read_frames, reencode_with_audio
 from mmaudio.model.flow_matching import FlowMatching
 from mmaudio.model.networks import MMAudio
-from mmaudio.model.sequence_config import (CONFIG_16K, CONFIG_44K, SequenceConfig)
+from mmaudio.model.sequence_config import CONFIG_16K, CONFIG_44K, SequenceConfig
 from mmaudio.model.utils.features_utils import FeaturesUtils
 from mmaudio.utils.download_utils import download_model_if_needed
 
@@ -24,7 +26,7 @@ class ModelConfig:
     vae_path: Path
     bigvgan_16k_path: Optional[Path]
     mode: str
-    synchformer_ckpt: Path = Path('./ext_weights/synchformer_state_dict.pth')
+    synchformer_ckpt: Path = Path('./pretrained/v2a/mmaudio/ext_weights/synchformer_state_dict.pth')
 
     @property
     def seq_cfg(self) -> SequenceConfig:
@@ -42,31 +44,31 @@ class ModelConfig:
 
 
 small_16k = ModelConfig(model_name='small_16k',
-                        model_path=Path('./weights/mmaudio_small_16k.pth'),
-                        vae_path=Path('./ext_weights/v1-16.pth'),
-                        bigvgan_16k_path=Path('./ext_weights/best_netG.pt'),
+                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_small_16k.pth'),
+                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-16.pth'),
+                        bigvgan_16k_path=Path('./pretrained/v2a/mmaudio/ext_weights/best_netG.pt'),
                         mode='16k')
 small_44k = ModelConfig(model_name='small_44k',
-                        model_path=Path('./weights/mmaudio_small_44k.pth'),
-                        vae_path=Path('./ext_weights/v1-44.pth'),
+                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_small_44k.pth'),
+                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                         bigvgan_16k_path=None,
                         mode='44k')
 medium_44k = ModelConfig(model_name='medium_44k',
-                         model_path=Path('./weights/mmaudio_medium_44k.pth'),
-                         vae_path=Path('./ext_weights/v1-44.pth'),
+                         model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_medium_44k.pth'),
+                         vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                          bigvgan_16k_path=None,
                          mode='44k')
 large_44k = ModelConfig(model_name='large_44k',
-                        model_path=Path('./weights/mmaudio_large_44k.pth'),
-                        vae_path=Path('./ext_weights/v1-44.pth'),
+                        model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_large_44k.pth'),
+                        vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                         bigvgan_16k_path=None,
                         mode='44k')
 large_44k_v2 = ModelConfig(model_name='large_44k_v2',
-                           model_path=Path('./weights/mmaudio_large_44k_v2.pth'),
-                           vae_path=Path('./ext_weights/v1-44.pth'),
+                           model_path=Path('./pretrained/v2a/mmaudio/weights/mmaudio_large_44k_v2.pth'),
+                           vae_path=Path('./pretrained/v2a/mmaudio/ext_weights/v1-44.pth'),
                            bigvgan_16k_path=None,
                            mode='44k')
-all_model_cfg: dict[str, ModelConfig] = {
+all_model_cfg: Dict[str, ModelConfig] = {
     'small_16k': small_16k,
     'small_44k': small_44k,
     'medium_44k': medium_44k,
@@ -78,9 +80,9 @@ all_model_cfg: dict[str, ModelConfig] = {
 def generate(
     clip_video: Optional[torch.Tensor],
     sync_video: Optional[torch.Tensor],
-    text: Optional[list[str]],
+    text: Optional[List[str]],
     *,
-    negative_text: Optional[list[str]] = None,
+    negative_text: Optional[List[str]] = None,
     feature_utils: FeaturesUtils,
     net: MMAudio,
     fm: FlowMatching,
@@ -88,6 +90,7 @@ def generate(
     cfg_strength: float,
     clip_batch_size_multiplier: int = 40,
     sync_batch_size_multiplier: int = 40,
+    image_input: bool = False,
 ) -> torch.Tensor:
     device = feature_utils.device
     dtype = feature_utils.dtype
@@ -98,10 +101,12 @@ def generate(
         clip_features = feature_utils.encode_video_with_clip(clip_video,
                                                              batch_size=bs *
                                                              clip_batch_size_multiplier)
+        if image_input:
+            clip_features = clip_features.expand(-1, net.clip_seq_len, -1)
     else:
         clip_features = net.get_empty_clip_sequence(bs)
 
-    if sync_video is not None:
+    if sync_video is not None and not image_input:
         sync_video = sync_video.to(device, dtype, non_blocking=True)
         sync_features = feature_utils.encode_video_with_sync(sync_video,
                                                              batch_size=bs *
@@ -139,7 +144,7 @@ def generate(
     return audio
 
 
-LOGFORMAT = "  %(log_color)s%(levelname)-8s%(reset)s | %(log_color)s%(message)s%(reset)s"
+LOGFORMAT = "[%(log_color)s%(levelname)-8s%(reset)s]: %(log_color)s%(message)s%(reset)s"
 
 
 def setup_eval_logging(log_level: int = logging.INFO):
@@ -153,12 +158,14 @@ def setup_eval_logging(log_level: int = logging.INFO):
     log.addHandler(stream)
 
 
-def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = True) -> VideoInfo:
-    _CLIP_SIZE = 384
-    _CLIP_FPS = 8.0
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
 
-    _SYNC_SIZE = 224
-    _SYNC_FPS = 25.0
+
+def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = True) -> VideoInfo:
 
     clip_transform = v2.Compose([
         v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
@@ -213,5 +220,36 @@ def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = Tr
     return video_info
 
 
+def load_image(image_path: Path) -> VideoInfo:
+    clip_transform = v2.Compose([
+        v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+    ])
+
+    sync_transform = v2.Compose([
+        v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+        v2.CenterCrop(_SYNC_SIZE),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+        v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+    ])
+
+    frame = np.array(Image.open(image_path))
+
+    clip_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
+    sync_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
+
+    clip_frames = clip_transform(clip_chunk)
+    sync_frames = sync_transform(sync_chunk)
+
+    video_info = ImageInfo(
+        clip_frames=clip_frames,
+        sync_frames=sync_frames,
+        original_frame=frame,
+    )
+    return video_info
+
+
 def make_video(video_info: VideoInfo, output_path: Path, audio: torch.Tensor, sampling_rate: int):
     reencode_with_audio(video_info, output_path, audio, sampling_rate)
diff --git a/mmaudio/ext/__init__.py b/third_party/MMAudio/mmaudio/ext/__init__.py
similarity index 100%
rename from mmaudio/ext/__init__.py
rename to third_party/MMAudio/mmaudio/ext/__init__.py
diff --git a/mmaudio/ext/autoencoder/__init__.py b/third_party/MMAudio/mmaudio/ext/autoencoder/__init__.py
similarity index 100%
rename from mmaudio/ext/autoencoder/__init__.py
rename to third_party/MMAudio/mmaudio/ext/autoencoder/__init__.py
diff --git a/mmaudio/ext/autoencoder/autoencoder.py b/third_party/MMAudio/mmaudio/ext/autoencoder/autoencoder.py
similarity index 96%
rename from mmaudio/ext/autoencoder/autoencoder.py
rename to third_party/MMAudio/mmaudio/ext/autoencoder/autoencoder.py
index 5b444656112f9c4e5d9493c8fce40c118a2e31d5..b77db4ebc0b68a68dfe23a28b2ff83537fd0e619 100644
--- a/mmaudio/ext/autoencoder/autoencoder.py
+++ b/third_party/MMAudio/mmaudio/ext/autoencoder/autoencoder.py
@@ -20,7 +20,7 @@ class AutoEncoderModule(nn.Module):
         super().__init__()
         self.vae: VAE = get_my_vae(mode).eval()
         vae_state_dict = torch.load(vae_ckpt_path, weights_only=True, map_location='cpu')
-        self.vae.load_state_dict(vae_state_dict, strict=False)
+        self.vae.load_state_dict(vae_state_dict)
         self.vae.remove_weight_norm()
 
         if mode == '16k':
diff --git a/mmaudio/ext/autoencoder/edm2_utils.py b/third_party/MMAudio/mmaudio/ext/autoencoder/edm2_utils.py
similarity index 100%
rename from mmaudio/ext/autoencoder/edm2_utils.py
rename to third_party/MMAudio/mmaudio/ext/autoencoder/edm2_utils.py
diff --git a/mmaudio/ext/autoencoder/vae.py b/third_party/MMAudio/mmaudio/ext/autoencoder/vae.py
similarity index 92%
rename from mmaudio/ext/autoencoder/vae.py
rename to third_party/MMAudio/mmaudio/ext/autoencoder/vae.py
index 204c2e01cf9fc89eb718f8aa266a1c6a7e443312..b69ed59d2fcffb3d69ed29c6306b9f103dbadc93 100644
--- a/mmaudio/ext/autoencoder/vae.py
+++ b/third_party/MMAudio/mmaudio/ext/autoencoder/vae.py
@@ -1,5 +1,5 @@
 import logging
-from typing import Optional
+from typing import Optional, Tuple, List
 
 import torch
 import torch.nn as nn
@@ -75,15 +75,11 @@ class VAE(nn.Module):
         super().__init__()
 
         if data_dim == 80:
-            # self.data_mean = torch.tensor(DATA_MEAN_80D, dtype=torch.float32).cuda()
-            # self.data_std = torch.tensor(DATA_STD_80D, dtype=torch.float32).cuda()
-            self.register_buffer('data_mean', torch.tensor(DATA_MEAN_80D, dtype=torch.float32))
-            self.register_buffer('data_std', torch.tensor(DATA_STD_80D, dtype=torch.float32))
+            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_80D, dtype=torch.float32))
+            self.data_std = nn.Buffer(torch.tensor(DATA_STD_80D, dtype=torch.float32))
         elif data_dim == 128:
-            # torch.tensor(DATA_MEAN_128D, dtype=torch.float32).cuda()
-            # self.data_std = torch.tensor(DATA_STD_128D, dtype=torch.float32).cuda()
-            self.register_buffer('data_mean', torch.tensor(DATA_MEAN_128D, dtype=torch.float32))
-            self.register_buffer('data_std', torch.tensor(DATA_STD_128D, dtype=torch.float32))
+            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_128D, dtype=torch.float32))
+            self.data_std = nn.Buffer(torch.tensor(DATA_STD_128D, dtype=torch.float32))
 
         self.data_mean = self.data_mean.view(1, -1, 1)
         self.data_std = self.data_std.view(1, -1, 1)
@@ -143,7 +139,7 @@ class VAE(nn.Module):
         rng: Optional[torch.Generator] = None,
         normalize: bool = True,
         unnormalize: bool = True,
-    ) -> tuple[torch.Tensor, DiagonalGaussianDistribution]:
+    ) -> Tuple[torch.Tensor, DiagonalGaussianDistribution]:
 
         posterior = self.encode(x, normalize=normalize)
         if sample_posterior:
@@ -176,10 +172,10 @@ class Encoder1D(nn.Module):
     def __init__(self,
                  *,
                  dim: int,
-                 ch_mult: tuple[int] = (1, 2, 4, 8),
+                 ch_mult: Tuple[int] = (1, 2, 4, 8),
                  num_res_blocks: int,
-                 attn_layers: list[int] = [],
-                 down_layers: list[int] = [],
+                 attn_layers: List[int] = [],
+                 down_layers: List[int] = [],
                  resamp_with_conv: bool = True,
                  in_dim: int,
                  embed_dim: int,
@@ -273,10 +269,10 @@ class Decoder1D(nn.Module):
                  *,
                  dim: int,
                  out_dim: int,
-                 ch_mult: tuple[int] = (1, 2, 4, 8),
+                 ch_mult: Tuple[int] = (1, 2, 4, 8),
                  num_res_blocks: int,
-                 attn_layers: list[int] = [],
-                 down_layers: list[int] = [],
+                 attn_layers: List[int] = [],
+                 down_layers: List[int] = [],
                  kernel_size: int = 3,
                  resamp_with_conv: bool = True,
                  in_dim: int,
diff --git a/mmaudio/ext/autoencoder/vae_modules.py b/third_party/MMAudio/mmaudio/ext/autoencoder/vae_modules.py
similarity index 100%
rename from mmaudio/ext/autoencoder/vae_modules.py
rename to third_party/MMAudio/mmaudio/ext/autoencoder/vae_modules.py
diff --git a/mmaudio/ext/bigvgan/LICENSE b/third_party/MMAudio/mmaudio/ext/bigvgan/LICENSE
similarity index 100%
rename from mmaudio/ext/bigvgan/LICENSE
rename to third_party/MMAudio/mmaudio/ext/bigvgan/LICENSE
diff --git a/mmaudio/ext/bigvgan/__init__.py b/third_party/MMAudio/mmaudio/ext/bigvgan/__init__.py
similarity index 100%
rename from mmaudio/ext/bigvgan/__init__.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/__init__.py
diff --git a/mmaudio/ext/bigvgan/activations.py b/third_party/MMAudio/mmaudio/ext/bigvgan/activations.py
similarity index 100%
rename from mmaudio/ext/bigvgan/activations.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/activations.py
diff --git a/mmaudio/ext/bigvgan/alias_free_torch/__init__.py b/third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/__init__.py
similarity index 100%
rename from mmaudio/ext/bigvgan/alias_free_torch/__init__.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/__init__.py
diff --git a/mmaudio/ext/bigvgan/alias_free_torch/act.py b/third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/act.py
similarity index 100%
rename from mmaudio/ext/bigvgan/alias_free_torch/act.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/act.py
diff --git a/mmaudio/ext/bigvgan/alias_free_torch/filter.py b/third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/filter.py
similarity index 100%
rename from mmaudio/ext/bigvgan/alias_free_torch/filter.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/filter.py
diff --git a/mmaudio/ext/bigvgan/alias_free_torch/resample.py b/third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/resample.py
similarity index 100%
rename from mmaudio/ext/bigvgan/alias_free_torch/resample.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/alias_free_torch/resample.py
diff --git a/mmaudio/ext/bigvgan/bigvgan.py b/third_party/MMAudio/mmaudio/ext/bigvgan/bigvgan.py
similarity index 100%
rename from mmaudio/ext/bigvgan/bigvgan.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/bigvgan.py
diff --git a/mmaudio/ext/bigvgan/bigvgan_vocoder.yml b/third_party/MMAudio/mmaudio/ext/bigvgan/bigvgan_vocoder.yml
similarity index 100%
rename from mmaudio/ext/bigvgan/bigvgan_vocoder.yml
rename to third_party/MMAudio/mmaudio/ext/bigvgan/bigvgan_vocoder.yml
diff --git a/mmaudio/ext/bigvgan/env.py b/third_party/MMAudio/mmaudio/ext/bigvgan/env.py
similarity index 100%
rename from mmaudio/ext/bigvgan/env.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/env.py
diff --git a/mmaudio/ext/bigvgan/incl_licenses/LICENSE_1 b/third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_1
similarity index 100%
rename from mmaudio/ext/bigvgan/incl_licenses/LICENSE_1
rename to third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_1
diff --git a/mmaudio/ext/bigvgan/incl_licenses/LICENSE_2 b/third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_2
similarity index 100%
rename from mmaudio/ext/bigvgan/incl_licenses/LICENSE_2
rename to third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_2
diff --git a/mmaudio/ext/bigvgan/incl_licenses/LICENSE_3 b/third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_3
similarity index 100%
rename from mmaudio/ext/bigvgan/incl_licenses/LICENSE_3
rename to third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_3
diff --git a/mmaudio/ext/bigvgan/incl_licenses/LICENSE_4 b/third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_4
similarity index 100%
rename from mmaudio/ext/bigvgan/incl_licenses/LICENSE_4
rename to third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_4
diff --git a/mmaudio/ext/bigvgan/incl_licenses/LICENSE_5 b/third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_5
similarity index 100%
rename from mmaudio/ext/bigvgan/incl_licenses/LICENSE_5
rename to third_party/MMAudio/mmaudio/ext/bigvgan/incl_licenses/LICENSE_5
diff --git a/mmaudio/ext/bigvgan/models.py b/third_party/MMAudio/mmaudio/ext/bigvgan/models.py
similarity index 100%
rename from mmaudio/ext/bigvgan/models.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/models.py
diff --git a/mmaudio/ext/bigvgan/utils.py b/third_party/MMAudio/mmaudio/ext/bigvgan/utils.py
similarity index 100%
rename from mmaudio/ext/bigvgan/utils.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan/utils.py
diff --git a/mmaudio/ext/bigvgan_v2/LICENSE b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/LICENSE
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/LICENSE
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/LICENSE
diff --git a/mmaudio/model/utils/__init__.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/__init__.py
similarity index 100%
rename from mmaudio/model/utils/__init__.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/__init__.py
diff --git a/mmaudio/ext/bigvgan_v2/activations.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/activations.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/activations.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/activations.py
diff --git a/mmaudio/utils/__init__.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/__init__.py
similarity index 100%
rename from mmaudio/utils/__init__.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/__init__.py
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/activation1d.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/activation1d.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/activation1d.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/activation1d.py
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation.cpp b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation.cpp
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation.cpp
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation.cpp
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation_cuda.cu b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation_cuda.cu
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation_cuda.cu
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/anti_alias_activation_cuda.cu
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/compat.h b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/compat.h
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/compat.h
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/compat.h
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/load.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/load.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/load.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/load.py
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/type_shim.h b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/type_shim.h
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/type_shim.h
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/cuda/type_shim.h
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/__init__.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/__init__.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/torch/__init__.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/__init__.py
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/act.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/act.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/torch/act.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/act.py
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/filter.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/filter.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/torch/filter.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/filter.py
diff --git a/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/resample.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/resample.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/alias_free_activation/torch/resample.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/alias_free_activation/torch/resample.py
diff --git a/mmaudio/ext/bigvgan_v2/bigvgan.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/bigvgan.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/bigvgan.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/bigvgan.py
diff --git a/mmaudio/ext/bigvgan_v2/env.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/env.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/env.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/env.py
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_1 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_1
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_1
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_1
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_2 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_2
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_2
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_2
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_3 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_3
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_3
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_3
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_4 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_4
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_4
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_4
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_5 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_5
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_5
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_5
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_6 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_6
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_6
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_6
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_7 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_7
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_7
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_7
diff --git a/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_8 b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_8
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_8
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/incl_licenses/LICENSE_8
diff --git a/mmaudio/ext/bigvgan_v2/utils.py b/third_party/MMAudio/mmaudio/ext/bigvgan_v2/utils.py
similarity index 100%
rename from mmaudio/ext/bigvgan_v2/utils.py
rename to third_party/MMAudio/mmaudio/ext/bigvgan_v2/utils.py
diff --git a/mmaudio/ext/mel_converter.py b/third_party/MMAudio/mmaudio/ext/mel_converter.py
similarity index 67%
rename from mmaudio/ext/mel_converter.py
rename to third_party/MMAudio/mmaudio/ext/mel_converter.py
index 6fc589c9468e077fc580965db250fd502e229672..15266d22fb95176229643597a5fea8304888007d 100644
--- a/mmaudio/ext/mel_converter.py
+++ b/third_party/MMAudio/mmaudio/ext/mel_converter.py
@@ -1,11 +1,12 @@
 # Reference: # https://github.com/bytedance/Make-An-Audio-2
+from typing import Literal
 
 import torch
 import torch.nn as nn
 from librosa.filters import mel as librosa_mel_fn
 
 
-def dynamic_range_compression_torch(x, C=1, clip_val=1e-5, norm_fn=torch.log10):
+def dynamic_range_compression_torch(x, C=1, clip_val=1e-5, *, norm_fn):
     return norm_fn(torch.clamp(x, min=clip_val) * C)
 
 
@@ -19,14 +20,14 @@ class MelConverter(nn.Module):
     def __init__(
         self,
         *,
-        sampling_rate: float = 16_000,
-        n_fft: int = 1024,
-        num_mels: int = 80,
-        hop_size: int = 256,
-        win_size: int = 1024,
-        fmin: float = 0,
-        fmax: float = 8_000,
-        norm_fn=torch.log10,
+        sampling_rate: float,
+        n_fft: int,
+        num_mels: int,
+        hop_size: int,
+        win_size: int,
+        fmin: float,
+        fmax: float,
+        norm_fn,
     ):
         super().__init__()
         self.sampling_rate = sampling_rate
@@ -80,3 +81,26 @@ class MelConverter(nn.Module):
         spec = spectral_normalize_torch(spec, self.norm_fn)
 
         return spec
+
+
+def get_mel_converter(mode: Literal['16k', '44k']) -> MelConverter:
+    if mode == '16k':
+        return MelConverter(sampling_rate=16_000,
+                            n_fft=1024,
+                            num_mels=80,
+                            hop_size=256,
+                            win_size=1024,
+                            fmin=0,
+                            fmax=8_000,
+                            norm_fn=torch.log10)
+    elif mode == '44k':
+        return MelConverter(sampling_rate=44_100,
+                            n_fft=2048,
+                            num_mels=128,
+                            hop_size=512,
+                            win_size=2048,
+                            fmin=0,
+                            fmax=44100 / 2,
+                            norm_fn=torch.log)
+    else:
+        raise ValueError(f'Unknown mode: {mode}')
diff --git a/mmaudio/ext/rotary_embeddings.py b/third_party/MMAudio/mmaudio/ext/rotary_embeddings.py
similarity index 74%
rename from mmaudio/ext/rotary_embeddings.py
rename to third_party/MMAudio/mmaudio/ext/rotary_embeddings.py
index 16a9cf813a9cf24e35019986bcd1c38b25564c4e..41a7555424720ff971dad5ef0cff69d471f3828d 100644
--- a/mmaudio/ext/rotary_embeddings.py
+++ b/third_party/MMAudio/mmaudio/ext/rotary_embeddings.py
@@ -1,4 +1,4 @@
-from typing import Union
+from typing import Union, Tuple
 
 import torch
 from einops import rearrange
@@ -7,7 +7,7 @@ from torch import Tensor
 # Ref: https://github.com/black-forest-labs/flux/blob/main/src/flux/math.py
 # Ref: https://github.com/lucidrains/rotary-embedding-torch
 
-DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+
 def compute_rope_rotations(length: int,
                            dim: int,
                            theta: int,
@@ -16,8 +16,7 @@ def compute_rope_rotations(length: int,
                            device: Union[torch.device, str] = 'cpu') -> Tensor:
     assert dim % 2 == 0
 
-    # with torch.amp.autocast(device_type='cuda', enabled=False):
-    with torch.amp.autocast(device_type=DEVICE, enabled=False):
+    with torch.amp.autocast(device_type='cuda', enabled=False):
         pos = torch.arange(length, dtype=torch.float32, device=device)
         freqs = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float32, device=device) / dim))
         freqs *= freq_scaling
@@ -28,9 +27,8 @@ def compute_rope_rotations(length: int,
         return rot
 
 
-def apply_rope(x: Tensor, rot: Tensor) -> tuple[Tensor, Tensor]:
-    # with torch.amp.autocast(device_type='cuda', enabled=False):
-    with torch.amp.autocast(device_type=DEVICE, enabled=False):
+def apply_rope(x: Tensor, rot: Tensor) -> Tuple[Tensor, Tensor]:
+    with torch.amp.autocast(device_type='cuda', enabled=False):
         _x = x.float()
         _x = _x.view(*_x.shape[:-1], -1, 1, 2)
         x_out = rot[..., 0] * _x[..., 0] + rot[..., 1] * _x[..., 1]
diff --git a/mmaudio/ext/stft_converter.py b/third_party/MMAudio/mmaudio/ext/stft_converter.py
similarity index 100%
rename from mmaudio/ext/stft_converter.py
rename to third_party/MMAudio/mmaudio/ext/stft_converter.py
diff --git a/mmaudio/ext/stft_converter_mel.py b/third_party/MMAudio/mmaudio/ext/stft_converter_mel.py
similarity index 100%
rename from mmaudio/ext/stft_converter_mel.py
rename to third_party/MMAudio/mmaudio/ext/stft_converter_mel.py
diff --git a/mmaudio/ext/synchformer/LICENSE b/third_party/MMAudio/mmaudio/ext/synchformer/LICENSE
similarity index 100%
rename from mmaudio/ext/synchformer/LICENSE
rename to third_party/MMAudio/mmaudio/ext/synchformer/LICENSE
diff --git a/mmaudio/ext/synchformer/__init__.py b/third_party/MMAudio/mmaudio/ext/synchformer/__init__.py
similarity index 100%
rename from mmaudio/ext/synchformer/__init__.py
rename to third_party/MMAudio/mmaudio/ext/synchformer/__init__.py
diff --git a/mmaudio/ext/synchformer/divided_224_16x4.yaml b/third_party/MMAudio/mmaudio/ext/synchformer/divided_224_16x4.yaml
similarity index 100%
rename from mmaudio/ext/synchformer/divided_224_16x4.yaml
rename to third_party/MMAudio/mmaudio/ext/synchformer/divided_224_16x4.yaml
diff --git a/mmaudio/ext/synchformer/motionformer.py b/third_party/MMAudio/mmaudio/ext/synchformer/motionformer.py
similarity index 100%
rename from mmaudio/ext/synchformer/motionformer.py
rename to third_party/MMAudio/mmaudio/ext/synchformer/motionformer.py
diff --git a/mmaudio/ext/synchformer/synchformer.py b/third_party/MMAudio/mmaudio/ext/synchformer/synchformer.py
similarity index 84%
rename from mmaudio/ext/synchformer/synchformer.py
rename to third_party/MMAudio/mmaudio/ext/synchformer/synchformer.py
index dcaeda1a5a529a2b465a4be464384bdbfded2f75..80871f004d6f4c57f48594d90195f84f89d7cb0a 100644
--- a/mmaudio/ext/synchformer/synchformer.py
+++ b/third_party/MMAudio/mmaudio/ext/synchformer/synchformer.py
@@ -41,14 +41,14 @@ class Synchformer(nn.Module):
         return super().load_state_dict(sd, strict)
 
 
-# if __name__ == "__main__":
-#     model = Synchformer().cuda().eval()
-#     sd = torch.load('./ext_weights/synchformer_state_dict.pth', weights_only=True)
-#     model.load_state_dict(sd)
-
-#     vid = torch.randn(2, 7, 16, 3, 224, 224).cuda()
-#     features = model.extract_vfeats(vid, for_loop=False).detach().cpu()
-#     print(features.shape)
+if __name__ == "__main__":
+    model = Synchformer().cuda().eval()
+    sd = torch.load('./ext_weights/synchformer_state_dict.pth', weights_only=True)
+    model.load_state_dict(sd)
+
+    vid = torch.randn(2, 7, 16, 3, 224, 224).cuda()
+    features = model.extract_vfeats(vid, for_loop=False).detach().cpu()
+    print(features.shape)
 
     # extract and save the state dict only
     # sd = torch.load('./ext_weights/sync_model_audioset.pt')['model']
diff --git a/mmaudio/ext/synchformer/utils.py b/third_party/MMAudio/mmaudio/ext/synchformer/utils.py
similarity index 100%
rename from mmaudio/ext/synchformer/utils.py
rename to third_party/MMAudio/mmaudio/ext/synchformer/utils.py
diff --git a/mmaudio/ext/synchformer/video_model_builder.py b/third_party/MMAudio/mmaudio/ext/synchformer/video_model_builder.py
similarity index 100%
rename from mmaudio/ext/synchformer/video_model_builder.py
rename to third_party/MMAudio/mmaudio/ext/synchformer/video_model_builder.py
diff --git a/mmaudio/ext/synchformer/vit_helper.py b/third_party/MMAudio/mmaudio/ext/synchformer/vit_helper.py
similarity index 100%
rename from mmaudio/ext/synchformer/vit_helper.py
rename to third_party/MMAudio/mmaudio/ext/synchformer/vit_helper.py
diff --git a/third_party/MMAudio/mmaudio/model/__init__.py b/third_party/MMAudio/mmaudio/model/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/mmaudio/model/embeddings.py b/third_party/MMAudio/mmaudio/model/embeddings.py
similarity index 89%
rename from mmaudio/model/embeddings.py
rename to third_party/MMAudio/mmaudio/model/embeddings.py
index d447a98f941f1231d1b1dac716db3047a6a8eb88..297feb4d2c79d306771f5436dbd4ada1a976b3bc 100644
--- a/mmaudio/model/embeddings.py
+++ b/third_party/MMAudio/mmaudio/model/embeddings.py
@@ -21,11 +21,12 @@ class TimestepEmbedder(nn.Module):
         assert dim % 2 == 0, 'dim must be even.'
 
         with torch.autocast('cuda', enabled=False):
-            self.freqs = (
+            self.freqs = nn.Buffer(
                 1.0 / (10000**(torch.arange(0, frequency_embedding_size, 2, dtype=torch.float32) /
-                               frequency_embedding_size)))
+                               frequency_embedding_size)),
+                persistent=False)
             freq_scale = 10000 / max_period
-            self.freqs = nn.Parameter(freq_scale * self.freqs)
+            self.freqs = freq_scale * self.freqs
 
     def timestep_embedding(self, t):
         """
diff --git a/mmaudio/model/flow_matching.py b/third_party/MMAudio/mmaudio/model/flow_matching.py
similarity index 72%
rename from mmaudio/model/flow_matching.py
rename to third_party/MMAudio/mmaudio/model/flow_matching.py
index a04510ab888c0c3c3398360f97b8b7e3c55998ad..3aec539041fd285ade004d17390e45296950a57b 100644
--- a/mmaudio/model/flow_matching.py
+++ b/third_party/MMAudio/mmaudio/model/flow_matching.py
@@ -1,11 +1,9 @@
 import logging
-from typing import Callable, Iterable, Optional
+from typing import Callable, Optional, List, Tuple
 
 import torch
 from torchdiffeq import odeint
 
-# from torchcfm.conditional_flow_matching import ExactOptimalTransportConditionalFlowMatcher
-
 log = logging.getLogger()
 
 
@@ -42,15 +40,11 @@ class FlowMatching:
         self,
         x1: torch.Tensor,
         t: torch.Tensor,
-        Cs: list[torch.Tensor],
+        Cs: List[torch.Tensor],
         generator: Optional[torch.Generator] = None
-    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        # x0 = torch.randn_like(x1, generator=generator)
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
         x0 = torch.empty_like(x1).normal_(generator=generator)
 
-        # find mini-batch optimal transport
-        # x0, x1, _, Cs = self.fm.ot_sampler.sample_plan_with_labels(x0, x1, None, Cs, replace=True)
-
         xt = self.get_conditional_flow(x0, x1, t)
         return x0, x1, xt, Cs
 
@@ -74,15 +68,4 @@ class FlowMatching:
                 dt = next_t - t
                 x = x + dt * flow
 
-            # return odeint(fn,
-            #               x0,
-            #               torch.tensor([t0, t1], device=x0.device, dtype=x0.dtype),
-            #               method='rk4',
-            #               options=dict(step_size=(t1 - t0) / self.num_steps))[-1]
-            # return odeint(fn,
-            #               x0,
-            #               torch.tensor([t0, t1], device=x0.device, dtype=x0.dtype),
-            #               method='euler',
-            #               options=dict(step_size=(t1 - t0) / self.num_steps))[-1]
-
         return x
diff --git a/mmaudio/model/low_level.py b/third_party/MMAudio/mmaudio/model/low_level.py
similarity index 100%
rename from mmaudio/model/low_level.py
rename to third_party/MMAudio/mmaudio/model/low_level.py
diff --git a/mmaudio/model/networks.py b/third_party/MMAudio/mmaudio/model/networks.py
similarity index 98%
rename from mmaudio/model/networks.py
rename to third_party/MMAudio/mmaudio/model/networks.py
index e60e309c89d92cec70e7e673a4e842cc6716fae9..d38d0623c95e77653dd8ad9a4284738b8faa4d0d 100644
--- a/mmaudio/model/networks.py
+++ b/third_party/MMAudio/mmaudio/model/networks.py
@@ -166,10 +166,8 @@ class MMAudio(nn.Module):
                                           self._clip_seq_len,
                                           device=self.device)
 
-        # self.latent_rot = latent_rot.to(self.device)
-        # self.clip_rot = clip_rot.to(self.device)
-        self.register_buffer('latent_rot', latent_rot)
-        self.register_buffer('clip_rot', clip_rot)
+        self.latent_rot = nn.Buffer(latent_rot, persistent=False)
+        self.clip_rot = nn.Buffer(clip_rot, persistent=False)
 
     def update_seq_lengths(self, latent_seq_len: int, clip_seq_len: int, sync_seq_len: int) -> None:
         self._latent_seq_len = latent_seq_len
@@ -285,6 +283,7 @@ class MMAudio(nn.Module):
         for block in self.fused_blocks:
             latent = block(latent, extended_c, self.latent_rot)
 
+        # should be extended_c; this is a minor implementation error #55
         flow = self.final_layer(latent, global_c)  # (B, N, out_dim), remove t
         return flow
 
@@ -348,7 +347,7 @@ class MMAudio(nn.Module):
         if 'clip_rot' in src_dict:
             del src_dict['clip_rot']
 
-        self.load_state_dict(src_dict, strict=False)
+        self.load_state_dict(src_dict, strict=True)
 
     @property
     def device(self) -> torch.device:
diff --git a/mmaudio/model/sequence_config.py b/third_party/MMAudio/mmaudio/model/sequence_config.py
similarity index 100%
rename from mmaudio/model/sequence_config.py
rename to third_party/MMAudio/mmaudio/model/sequence_config.py
diff --git a/mmaudio/model/transformer_layers.py b/third_party/MMAudio/mmaudio/model/transformer_layers.py
similarity index 96%
rename from mmaudio/model/transformer_layers.py
rename to third_party/MMAudio/mmaudio/model/transformer_layers.py
index 3ca02ec3b6c00b9c39624d97d55a211cdd2e427d..6264d1debe1b3e29925ee4ad1811597ad147d2f0 100644
--- a/mmaudio/model/transformer_layers.py
+++ b/third_party/MMAudio/mmaudio/model/transformer_layers.py
@@ -1,11 +1,10 @@
-from typing import Optional
+from typing import Optional, Tuple
 
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from einops import rearrange
 from einops.layers.torch import Rearrange
-from torch.nn.attention import SDPBackend, sdpa_kernel
 
 from mmaudio.ext.rotary_embeddings import apply_rope
 from mmaudio.model.low_level import MLP, ChannelLastConv1d, ConvMLP
@@ -45,7 +44,7 @@ class SelfAttention(nn.Module):
 
     def pre_attention(
             self, x: torch.Tensor,
-            rot: Optional[torch.Tensor]) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+            rot: Optional[torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
         # x: batch_size * n_tokens * n_channels
         qkv = self.qkv(x)
         q, k, v = self.split_into_heads(qkv).chunk(3, dim=-1)
@@ -118,7 +117,7 @@ class MMDitSingleBlock(nn.Module):
         q, k, v = self.attn.pre_attention(x, rot)
         return (q, k, v), (gate_msa, shift_mlp, scale_mlp, gate_mlp)
 
-    def post_attention(self, x: torch.Tensor, attn_out: torch.Tensor, c: tuple[torch.Tensor]):
+    def post_attention(self, x: torch.Tensor, attn_out: torch.Tensor, c: Tuple[torch.Tensor]):
         if self.pre_only:
             return x
 
@@ -161,7 +160,7 @@ class JointBlock(nn.Module):
 
     def forward(self, latent: torch.Tensor, clip_f: torch.Tensor, text_f: torch.Tensor,
                 global_c: torch.Tensor, extended_c: torch.Tensor, latent_rot: torch.Tensor,
-                clip_rot: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+                clip_rot: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
         # latent: BS * N1 * D
         # clip_f: BS * N2 * D
         # c: BS * (1/N) * D
diff --git a/third_party/MMAudio/mmaudio/model/utils/__init__.py b/third_party/MMAudio/mmaudio/model/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/mmaudio/model/utils/distributions.py b/third_party/MMAudio/mmaudio/model/utils/distributions.py
similarity index 100%
rename from mmaudio/model/utils/distributions.py
rename to third_party/MMAudio/mmaudio/model/utils/distributions.py
diff --git a/mmaudio/model/utils/features_utils.py b/third_party/MMAudio/mmaudio/model/utils/features_utils.py
similarity index 96%
rename from mmaudio/model/utils/features_utils.py
rename to third_party/MMAudio/mmaudio/model/utils/features_utils.py
index 8b5ebcf685d98d9f024ce29df239e93312418bae..5a059e135d3b4ea697f875c12feef5e1882ae656 100644
--- a/mmaudio/model/utils/features_utils.py
+++ b/third_party/MMAudio/mmaudio/model/utils/features_utils.py
@@ -1,4 +1,4 @@
-from typing import Literal, Optional
+from typing import Literal, Optional, Tuple, List
 
 import open_clip
 import torch
@@ -9,7 +9,7 @@ from open_clip import create_model_from_pretrained
 from torchvision.transforms import Normalize
 
 from mmaudio.ext.autoencoder import AutoEncoderModule
-from mmaudio.ext.mel_converter import MelConverter
+from mmaudio.ext.mel_converter import get_mel_converter
 from mmaudio.ext.synchformer import Synchformer
 from mmaudio.model.utils.distributions import DiagonalGaussianDistribution
 
@@ -63,13 +63,13 @@ class FeaturesUtils(nn.Module):
             self.tokenizer = None
 
         if tod_vae_ckpt is not None:
+            self.mel_converter = get_mel_converter(mode)
             self.tod = AutoEncoderModule(vae_ckpt_path=tod_vae_ckpt,
                                          vocoder_ckpt_path=bigvgan_vocoder_ckpt,
                                          mode=mode,
                                          need_vae_encoder=need_vae_encoder)
         else:
             self.tod = None
-        self.mel_converter = MelConverter()
 
     def compile(self):
         if self.clip_model is not None:
@@ -129,7 +129,7 @@ class FeaturesUtils(nn.Module):
         return x
 
     @torch.inference_mode()
-    def encode_text(self, text: list[str]) -> torch.Tensor:
+    def encode_text(self, text: List[str]) -> torch.Tensor:
         assert self.clip_model is not None, 'CLIP is not loaded'
         assert self.tokenizer is not None, 'Tokenizer is not loaded'
         # x: (B, L)
diff --git a/mmaudio/model/utils/parameter_groups.py b/third_party/MMAudio/mmaudio/model/utils/parameter_groups.py
similarity index 100%
rename from mmaudio/model/utils/parameter_groups.py
rename to third_party/MMAudio/mmaudio/model/utils/parameter_groups.py
diff --git a/mmaudio/model/utils/sample_utils.py b/third_party/MMAudio/mmaudio/model/utils/sample_utils.py
similarity index 100%
rename from mmaudio/model/utils/sample_utils.py
rename to third_party/MMAudio/mmaudio/model/utils/sample_utils.py
diff --git a/third_party/MMAudio/mmaudio/runner.py b/third_party/MMAudio/mmaudio/runner.py
new file mode 100644
index 0000000000000000000000000000000000000000..755ee76bea7de3f31a14a5512710c39743dc9239
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/runner.py
@@ -0,0 +1,609 @@
+"""
+trainer.py - wrapper and utility functions for network training
+Compute loss, back-prop, update parameters, logging, etc.
+"""
+import os
+from pathlib import Path
+from typing import Optional, Union
+
+import torch
+import torch.distributed
+import torch.optim as optim
+from av_bench.evaluate import evaluate
+from av_bench.extract import extract
+from nitrous_ema import PostHocEMA
+from omegaconf import DictConfig
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import get_my_mmaudio
+from mmaudio.model.sequence_config import CONFIG_16K, CONFIG_44K
+from mmaudio.model.utils.features_utils import FeaturesUtils
+from mmaudio.model.utils.parameter_groups import get_parameter_groups
+from mmaudio.model.utils.sample_utils import log_normal_sample
+from mmaudio.utils.dist_utils import (info_if_rank_zero, local_rank, string_if_rank_zero)
+from mmaudio.utils.log_integrator import Integrator
+from mmaudio.utils.logger import TensorboardLogger
+from mmaudio.utils.time_estimator import PartialTimeEstimator, TimeEstimator
+from mmaudio.utils.video_joiner import VideoJoiner
+
+
+class Runner:
+
+    def __init__(self,
+                 cfg: DictConfig,
+                 log: TensorboardLogger,
+                 run_path: Union[str, Path],
+                 for_training: bool = True,
+                 latent_mean: Optional[torch.Tensor] = None,
+                 latent_std: Optional[torch.Tensor] = None):
+        self.exp_id = cfg.exp_id
+        self.use_amp = cfg.amp
+        self.enable_grad_scaler = cfg.enable_grad_scaler
+        self.for_training = for_training
+        self.cfg = cfg
+
+        if cfg.model.endswith('16k'):
+            self.seq_cfg = CONFIG_16K
+            mode = '16k'
+        elif cfg.model.endswith('44k'):
+            self.seq_cfg = CONFIG_44K
+            mode = '44k'
+        else:
+            raise ValueError(f'Unknown model: {cfg.model}')
+
+        self.sample_rate = self.seq_cfg.sampling_rate
+        self.duration_sec = self.seq_cfg.duration
+
+        # setting up the model
+        empty_string_feat = torch.load('./ext_weights/empty_string.pth', weights_only=True)[0]
+        self.network = DDP(get_my_mmaudio(cfg.model,
+                                          latent_mean=latent_mean,
+                                          latent_std=latent_std,
+                                          empty_string_feat=empty_string_feat).cuda(),
+                           device_ids=[local_rank],
+                           broadcast_buffers=False)
+        if cfg.compile:
+            # NOTE: though train_fn and val_fn are very similar
+            # (early on they are implemented as a single function)
+            # keeping them separate and compiling them separately are CRUCIAL for high performance
+            self.train_fn = torch.compile(self.train_fn)
+            self.val_fn = torch.compile(self.val_fn)
+
+        self.fm = FlowMatching(cfg.sampling.min_sigma,
+                               inference_mode=cfg.sampling.method,
+                               num_steps=cfg.sampling.num_steps)
+
+        # ema profile
+        if for_training and cfg.ema.enable and local_rank == 0:
+            self.ema = PostHocEMA(self.network.module,
+                                  sigma_rels=cfg.ema.sigma_rels,
+                                  update_every=cfg.ema.update_every,
+                                  checkpoint_every_num_steps=cfg.ema.checkpoint_every,
+                                  checkpoint_folder=cfg.ema.checkpoint_folder,
+                                  step_size_correction=True).cuda()
+            self.ema_start = cfg.ema.start
+        else:
+            self.ema = None
+
+        self.rng = torch.Generator(device='cuda')
+        self.rng.manual_seed(cfg['seed'] + local_rank)
+
+        # setting up feature extractors and VAEs
+        if mode == '16k':
+            self.features = FeaturesUtils(
+                tod_vae_ckpt=cfg['vae_16k_ckpt'],
+                bigvgan_vocoder_ckpt=cfg['bigvgan_vocoder_ckpt'],
+                synchformer_ckpt=cfg['synchformer_ckpt'],
+                enable_conditions=True,
+                mode=mode,
+                need_vae_encoder=False,
+            )
+        elif mode == '44k':
+            self.features = FeaturesUtils(
+                tod_vae_ckpt=cfg['vae_44k_ckpt'],
+                synchformer_ckpt=cfg['synchformer_ckpt'],
+                enable_conditions=True,
+                mode=mode,
+                need_vae_encoder=False,
+            )
+        self.features = self.features.cuda().eval()
+
+        if cfg.compile:
+            self.features.compile()
+
+        # hyperparameters
+        self.log_normal_sampling_mean = cfg.sampling.mean
+        self.log_normal_sampling_scale = cfg.sampling.scale
+        self.null_condition_probability = cfg.null_condition_probability
+        self.cfg_strength = cfg.cfg_strength
+
+        # setting up logging
+        self.log = log
+        self.run_path = Path(run_path)
+        vgg_cfg = cfg.data.VGGSound
+        if for_training:
+            self.val_video_joiner = VideoJoiner(vgg_cfg.root, self.run_path / 'val-sampled-videos',
+                                                self.sample_rate, self.duration_sec)
+        else:
+            self.test_video_joiner = VideoJoiner(vgg_cfg.root,
+                                                 self.run_path / 'test-sampled-videos',
+                                                 self.sample_rate, self.duration_sec)
+        string_if_rank_zero(self.log, 'model_size',
+                            f'{sum([param.nelement() for param in self.network.parameters()])}')
+        string_if_rank_zero(
+            self.log, 'number_of_parameters_that_require_gradient: ',
+            str(
+                sum([
+                    param.nelement()
+                    for param in filter(lambda p: p.requires_grad, self.network.parameters())
+                ])))
+        info_if_rank_zero(self.log, 'torch version: ' + torch.__version__)
+        self.train_integrator = Integrator(self.log, distributed=True)
+        self.val_integrator = Integrator(self.log, distributed=True)
+
+        # setting up optimizer and loss
+        if for_training:
+            self.enter_train()
+            parameter_groups = get_parameter_groups(self.network, cfg, print_log=(local_rank == 0))
+            self.optimizer = optim.AdamW(parameter_groups,
+                                         lr=cfg['learning_rate'],
+                                         weight_decay=cfg['weight_decay'],
+                                         betas=[0.9, 0.95],
+                                         eps=1e-6 if self.use_amp else 1e-8,
+                                         fused=True)
+            if self.enable_grad_scaler:
+                self.scaler = torch.amp.GradScaler(init_scale=2048)
+            self.clip_grad_norm = cfg['clip_grad_norm']
+
+            # linearly warmup learning rate
+            linear_warmup_steps = cfg['linear_warmup_steps']
+
+            def warmup(currrent_step: int):
+                return (currrent_step + 1) / (linear_warmup_steps + 1)
+
+            warmup_scheduler = optim.lr_scheduler.LambdaLR(self.optimizer, lr_lambda=warmup)
+
+            # setting up learning rate scheduler
+            if cfg['lr_schedule'] == 'constant':
+                next_scheduler = optim.lr_scheduler.LambdaLR(self.optimizer, lr_lambda=lambda _: 1)
+            elif cfg['lr_schedule'] == 'poly':
+                total_num_iter = cfg['iterations']
+                next_scheduler = optim.lr_scheduler.LambdaLR(self.optimizer,
+                                                             lr_lambda=lambda x:
+                                                             (1 - (x / total_num_iter))**0.9)
+            elif cfg['lr_schedule'] == 'step':
+                next_scheduler = optim.lr_scheduler.MultiStepLR(self.optimizer,
+                                                                cfg['lr_schedule_steps'],
+                                                                cfg['lr_schedule_gamma'])
+            else:
+                raise NotImplementedError
+
+            self.scheduler = optim.lr_scheduler.SequentialLR(self.optimizer,
+                                                             [warmup_scheduler, next_scheduler],
+                                                             [linear_warmup_steps])
+
+            # Logging info
+            self.log_text_interval = cfg['log_text_interval']
+            self.log_extra_interval = cfg['log_extra_interval']
+            self.save_weights_interval = cfg['save_weights_interval']
+            self.save_checkpoint_interval = cfg['save_checkpoint_interval']
+            self.save_copy_iterations = cfg['save_copy_iterations']
+            self.num_iterations = cfg['num_iterations']
+            if cfg['debug']:
+                self.log_text_interval = self.log_extra_interval = 1
+
+            # update() is called when we log metrics, within the logger
+            self.log.batch_timer = TimeEstimator(self.num_iterations, self.log_text_interval)
+            # update() is called every iteration, in this script
+            self.log.data_timer = PartialTimeEstimator(self.num_iterations, 1, ema_alpha=0.9)
+        else:
+            self.enter_val()
+
+    def train_fn(
+        self,
+        clip_f: torch.Tensor,
+        sync_f: torch.Tensor,
+        text_f: torch.Tensor,
+        a_mean: torch.Tensor,
+        a_std: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        # sample
+        a_randn = torch.empty_like(a_mean).normal_(generator=self.rng)
+        x1 = a_mean + a_std * a_randn
+        bs = x1.shape[0]  # batch_size * seq_len * num_channels
+
+        # normalize the latents
+        x1 = self.network.module.normalize(x1)
+
+        t = log_normal_sample(x1,
+                              generator=self.rng,
+                              m=self.log_normal_sampling_mean,
+                              s=self.log_normal_sampling_scale)
+        x0, x1, xt, (clip_f, sync_f, text_f) = self.fm.get_x0_xt_c(x1,
+                                                                   t,
+                                                                   Cs=[clip_f, sync_f, text_f],
+                                                                   generator=self.rng)
+
+        # classifier-free training
+        samples = torch.rand(bs, device=x1.device, generator=self.rng)
+        null_video = (samples < self.null_condition_probability)
+        clip_f[null_video] = self.network.module.empty_clip_feat
+        sync_f[null_video] = self.network.module.empty_sync_feat
+
+        samples = torch.rand(bs, device=x1.device, generator=self.rng)
+        null_text = (samples < self.null_condition_probability)
+        text_f[null_text] = self.network.module.empty_string_feat
+
+        pred_v = self.network(xt, clip_f, sync_f, text_f, t)
+        loss = self.fm.loss(pred_v, x0, x1)
+        mean_loss = loss.mean()
+        return x1, loss, mean_loss, t
+
+    def val_fn(
+        self,
+        clip_f: torch.Tensor,
+        sync_f: torch.Tensor,
+        text_f: torch.Tensor,
+        x1: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        bs = x1.shape[0]  # batch_size * seq_len * num_channels
+        # normalize the latents
+        x1 = self.network.module.normalize(x1)
+        t = log_normal_sample(x1,
+                              generator=self.rng,
+                              m=self.log_normal_sampling_mean,
+                              s=self.log_normal_sampling_scale)
+        x0, x1, xt, (clip_f, sync_f, text_f) = self.fm.get_x0_xt_c(x1,
+                                                                   t,
+                                                                   Cs=[clip_f, sync_f, text_f],
+                                                                   generator=self.rng)
+
+        # classifier-free training
+        samples = torch.rand(bs, device=x1.device, generator=self.rng)
+        # null mask is for when a video is provided but we decided to ignore it
+        null_video = (samples < self.null_condition_probability)
+        # complete mask is for when a video is not provided or we decided to ignore it
+        clip_f[null_video] = self.network.module.empty_clip_feat
+        sync_f[null_video] = self.network.module.empty_sync_feat
+
+        samples = torch.rand(bs, device=x1.device, generator=self.rng)
+        null_text = (samples < self.null_condition_probability)
+        text_f[null_text] = self.network.module.empty_string_feat
+
+        pred_v = self.network(xt, clip_f, sync_f, text_f, t)
+
+        loss = self.fm.loss(pred_v, x0, x1)
+        mean_loss = loss.mean()
+        return loss, mean_loss, t
+
+    def train_pass(self, data, it: int = 0):
+
+        if not self.for_training:
+            raise ValueError('train_pass() should not be called when not training.')
+
+        self.enter_train()
+        with torch.amp.autocast('cuda', enabled=self.use_amp, dtype=torch.bfloat16):
+            clip_f = data['clip_features'].cuda(non_blocking=True)
+            sync_f = data['sync_features'].cuda(non_blocking=True)
+            text_f = data['text_features'].cuda(non_blocking=True)
+            video_exist = data['video_exist'].cuda(non_blocking=True)
+            text_exist = data['text_exist'].cuda(non_blocking=True)
+            a_mean = data['a_mean'].cuda(non_blocking=True)
+            a_std = data['a_std'].cuda(non_blocking=True)
+
+            # these masks are for non-existent data; masking for CFG training is in train_fn
+            clip_f[~video_exist] = self.network.module.empty_clip_feat
+            sync_f[~video_exist] = self.network.module.empty_sync_feat
+            text_f[~text_exist] = self.network.module.empty_string_feat
+
+            self.log.data_timer.end()
+            if it % self.log_extra_interval == 0:
+                unmasked_clip_f = clip_f.clone()
+                unmasked_sync_f = sync_f.clone()
+                unmasked_text_f = text_f.clone()
+            x1, loss, mean_loss, t = self.train_fn(clip_f, sync_f, text_f, a_mean, a_std)
+
+            self.train_integrator.add_dict({'loss': mean_loss})
+
+        if it % self.log_text_interval == 0 and it != 0:
+            self.train_integrator.add_scalar('lr', self.scheduler.get_last_lr()[0])
+            self.train_integrator.add_binned_tensor('binned_loss', loss, t)
+            self.train_integrator.finalize('train', it)
+            self.train_integrator.reset_except_hooks()
+
+        # Backward pass
+        self.optimizer.zero_grad(set_to_none=True)
+        if self.enable_grad_scaler:
+            self.scaler.scale(mean_loss).backward()
+            self.scaler.unscale_(self.optimizer)
+            grad_norm = torch.nn.utils.clip_grad_norm_(self.network.parameters(),
+                                                       self.clip_grad_norm)
+            self.scaler.step(self.optimizer)
+            self.scaler.update()
+        else:
+            mean_loss.backward()
+            grad_norm = torch.nn.utils.clip_grad_norm_(self.network.parameters(),
+                                                       self.clip_grad_norm)
+            self.optimizer.step()
+
+        if self.ema is not None and it >= self.ema_start:
+            self.ema.update()
+        self.scheduler.step()
+        self.integrator.add_scalar('grad_norm', grad_norm)
+
+        self.enter_val()
+        with torch.amp.autocast('cuda', enabled=self.use_amp,
+                                dtype=torch.bfloat16), torch.inference_mode():
+            try:
+                if it % self.log_extra_interval == 0:
+                    # save GT audio
+                    # unnormalize the latents
+                    x1 = self.network.module.unnormalize(x1[0:1])
+                    mel = self.features.decode(x1)
+                    audio = self.features.vocode(mel).cpu()[0]  # 1 * num_samples
+                    self.log.log_spectrogram('train', f'spec-gt-r{local_rank}', mel.cpu()[0], it)
+                    self.log.log_audio('train',
+                                       f'audio-gt-r{local_rank}',
+                                       audio,
+                                       it,
+                                       sample_rate=self.sample_rate)
+
+                    # save audio from sampling
+                    x0 = torch.empty_like(x1[0:1]).normal_(generator=self.rng)
+                    clip_f = unmasked_clip_f[0:1]
+                    sync_f = unmasked_sync_f[0:1]
+                    text_f = unmasked_text_f[0:1]
+                    conditions = self.network.module.preprocess_conditions(clip_f, sync_f, text_f)
+                    empty_conditions = self.network.module.get_empty_conditions(x0.shape[0])
+                    cfg_ode_wrapper = lambda t, x: self.network.module.ode_wrapper(
+                        t, x, conditions, empty_conditions, self.cfg_strength)
+                    x1_hat = self.fm.to_data(cfg_ode_wrapper, x0)
+                    x1_hat = self.network.module.unnormalize(x1_hat)
+                    mel = self.features.decode(x1_hat)
+                    audio = self.features.vocode(mel).cpu()[0]
+                    self.log.log_spectrogram('train', f'spec-r{local_rank}', mel.cpu()[0], it)
+                    self.log.log_audio('train',
+                                       f'audio-r{local_rank}',
+                                       audio,
+                                       it,
+                                       sample_rate=self.sample_rate)
+            except Exception as e:
+                self.log.warning(f'Error in extra logging: {e}')
+                if self.cfg.debug:
+                    raise
+
+        # Save network weights and checkpoint if needed
+        save_copy = it in self.save_copy_iterations
+
+        if (it % self.save_weights_interval == 0 and it != 0) or save_copy:
+            self.save_weights(it)
+
+        if it % self.save_checkpoint_interval == 0 and it != 0:
+            self.save_checkpoint(it, save_copy=save_copy)
+
+        self.log.data_timer.start()
+
+    @torch.inference_mode()
+    def validation_pass(self, data, it: int = 0):
+        self.enter_val()
+        with torch.amp.autocast('cuda', enabled=self.use_amp, dtype=torch.bfloat16):
+            clip_f = data['clip_features'].cuda(non_blocking=True)
+            sync_f = data['sync_features'].cuda(non_blocking=True)
+            text_f = data['text_features'].cuda(non_blocking=True)
+            video_exist = data['video_exist'].cuda(non_blocking=True)
+            text_exist = data['text_exist'].cuda(non_blocking=True)
+            a_mean = data['a_mean'].cuda(non_blocking=True)
+            a_std = data['a_std'].cuda(non_blocking=True)
+
+            clip_f[~video_exist] = self.network.module.empty_clip_feat
+            sync_f[~video_exist] = self.network.module.empty_sync_feat
+            text_f[~text_exist] = self.network.module.empty_string_feat
+            a_randn = torch.empty_like(a_mean).normal_(generator=self.rng)
+            x1 = a_mean + a_std * a_randn
+
+            self.log.data_timer.end()
+            loss, mean_loss, t = self.val_fn(clip_f.clone(), sync_f.clone(), text_f.clone(), x1)
+
+            self.val_integrator.add_binned_tensor('binned_loss', loss, t)
+            self.val_integrator.add_dict({'loss': mean_loss})
+
+        self.log.data_timer.start()
+
+    @torch.inference_mode()
+    def inference_pass(self,
+                       data,
+                       it: int,
+                       data_cfg: DictConfig,
+                       *,
+                       save_eval: bool = True) -> Path:
+        self.enter_val()
+        with torch.amp.autocast('cuda', enabled=self.use_amp, dtype=torch.bfloat16):
+            clip_f = data['clip_features'].cuda(non_blocking=True)
+            sync_f = data['sync_features'].cuda(non_blocking=True)
+            text_f = data['text_features'].cuda(non_blocking=True)
+            video_exist = data['video_exist'].cuda(non_blocking=True)
+            text_exist = data['text_exist'].cuda(non_blocking=True)
+            a_mean = data['a_mean'].cuda(non_blocking=True)  # for the shape only
+
+            clip_f[~video_exist] = self.network.module.empty_clip_feat
+            sync_f[~video_exist] = self.network.module.empty_sync_feat
+            text_f[~text_exist] = self.network.module.empty_string_feat
+
+            # sample
+            x0 = torch.empty_like(a_mean).normal_(generator=self.rng)
+            conditions = self.network.module.preprocess_conditions(clip_f, sync_f, text_f)
+            empty_conditions = self.network.module.get_empty_conditions(x0.shape[0])
+            cfg_ode_wrapper = lambda t, x: self.network.module.ode_wrapper(
+                t, x, conditions, empty_conditions, self.cfg_strength)
+            x1_hat = self.fm.to_data(cfg_ode_wrapper, x0)
+            x1_hat = self.network.module.unnormalize(x1_hat)
+            mel = self.features.decode(x1_hat)
+            audio = self.features.vocode(mel).cpu()
+            for i in range(audio.shape[0]):
+                video_id = data['id'][i]
+                if (not self.for_training) and i == 0:
+                    # save very few videos
+                    self.test_video_joiner.join(video_id, f'{video_id}', audio[i].transpose(0, 1))
+
+                if data_cfg.output_subdir is not None:
+                    # validation
+                    if save_eval:
+                        iter_naming = f'{it:09d}'
+                    else:
+                        iter_naming = 'val-cache'
+                    audio_dir = self.log.log_audio(iter_naming,
+                                                   f'{video_id}',
+                                                   audio[i],
+                                                   it=None,
+                                                   sample_rate=self.sample_rate,
+                                                   subdir=Path(data_cfg.output_subdir))
+                    if save_eval and i == 0:
+                        self.val_video_joiner.join(video_id, f'{iter_naming}-{video_id}',
+                                                   audio[i].transpose(0, 1))
+                else:
+                    # full test set, usually
+                    audio_dir = self.log.log_audio(f'{data_cfg.tag}-sampled',
+                                                   f'{video_id}',
+                                                   audio[i],
+                                                   it=None,
+                                                   sample_rate=self.sample_rate)
+
+        return Path(audio_dir)
+
+    @torch.inference_mode()
+    def eval(self, audio_dir: Path, it: int, data_cfg: DictConfig) -> dict[str, float]:
+        with torch.amp.autocast('cuda', enabled=False):
+            if local_rank == 0:
+                extract(audio_path=audio_dir,
+                        output_path=audio_dir / 'cache',
+                        device='cuda',
+                        batch_size=32,
+                        audio_length=8)
+                output_metrics = evaluate(gt_audio_cache=Path(data_cfg.gt_cache),
+                                          pred_audio_cache=audio_dir / 'cache')
+                for k, v in output_metrics.items():
+                    # pad k to 10 characters
+                    # pad v to 10 decimal places
+                    self.log.log_scalar(f'{data_cfg.tag}/{k}', v, it)
+                    self.log.info(f'{data_cfg.tag}/{k:<10}: {v:.10f}')
+            else:
+                output_metrics = None
+
+        return output_metrics
+
+    def save_weights(self, it, save_copy=False):
+        if local_rank != 0:
+            return
+
+        os.makedirs(self.run_path, exist_ok=True)
+        if save_copy:
+            model_path = self.run_path / f'{self.exp_id}_{it}.pth'
+            torch.save(self.network.module.state_dict(), model_path)
+            self.log.info(f'Network weights saved to {model_path}.')
+
+        # if last exists, move it to a shadow copy
+        model_path = self.run_path / f'{self.exp_id}_last.pth'
+        if model_path.exists():
+            shadow_path = model_path.with_name(model_path.name.replace('last', 'shadow'))
+            model_path.replace(shadow_path)
+            self.log.info(f'Network weights shadowed to {shadow_path}.')
+
+        torch.save(self.network.module.state_dict(), model_path)
+        self.log.info(f'Network weights saved to {model_path}.')
+
+    def save_checkpoint(self, it, save_copy=False):
+        if local_rank != 0:
+            return
+
+        checkpoint = {
+            'it': it,
+            'weights': self.network.module.state_dict(),
+            'optimizer': self.optimizer.state_dict(),
+            'scheduler': self.scheduler.state_dict(),
+            'ema': self.ema.state_dict() if self.ema is not None else None,
+        }
+
+        os.makedirs(self.run_path, exist_ok=True)
+        if save_copy:
+            model_path = self.run_path / f'{self.exp_id}_ckpt_{it}.pth'
+            torch.save(checkpoint, model_path)
+            self.log.info(f'Checkpoint saved to {model_path}.')
+
+        # if ckpt_last exists, move it to a shadow copy
+        model_path = self.run_path / f'{self.exp_id}_ckpt_last.pth'
+        if model_path.exists():
+            shadow_path = model_path.with_name(model_path.name.replace('last', 'shadow'))
+            model_path.replace(shadow_path)  # moves the file
+            self.log.info(f'Checkpoint shadowed to {shadow_path}.')
+
+        torch.save(checkpoint, model_path)
+        self.log.info(f'Checkpoint saved to {model_path}.')
+
+    def get_latest_checkpoint_path(self):
+        ckpt_path = self.run_path / f'{self.exp_id}_ckpt_last.pth'
+        if not ckpt_path.exists():
+            info_if_rank_zero(self.log, f'No checkpoint found at {ckpt_path}.')
+            return None
+        return ckpt_path
+
+    def get_latest_weight_path(self):
+        weight_path = self.run_path / f'{self.exp_id}_last.pth'
+        if not weight_path.exists():
+            self.log.info(f'No weight found at {weight_path}.')
+            return None
+        return weight_path
+
+    def get_final_ema_weight_path(self):
+        weight_path = self.run_path / f'{self.exp_id}_ema_final.pth'
+        if not weight_path.exists():
+            self.log.info(f'No weight found at {weight_path}.')
+            return None
+        return weight_path
+
+    def load_checkpoint(self, path):
+        # This method loads everything and should be used to resume training
+        map_location = 'cuda:%d' % local_rank
+        checkpoint = torch.load(path, map_location={'cuda:0': map_location}, weights_only=True)
+
+        it = checkpoint['it']
+        weights = checkpoint['weights']
+        optimizer = checkpoint['optimizer']
+        scheduler = checkpoint['scheduler']
+        if self.ema is not None:
+            self.ema.load_state_dict(checkpoint['ema'])
+            self.log.info(f'EMA states loaded from step {self.ema.step}')
+
+        map_location = 'cuda:%d' % local_rank
+        self.network.module.load_state_dict(weights)
+        self.optimizer.load_state_dict(optimizer)
+        self.scheduler.load_state_dict(scheduler)
+
+        self.log.info(f'Global iteration {it} loaded.')
+        self.log.info('Network weights, optimizer states, and scheduler states loaded.')
+
+        return it
+
+    def load_weights_in_memory(self, src_dict):
+        self.network.module.load_weights(src_dict)
+        self.log.info('Network weights loaded from memory.')
+
+    def load_weights(self, path):
+        # This method loads only the network weight and should be used to load a pretrained model
+        map_location = 'cuda:%d' % local_rank
+        src_dict = torch.load(path, map_location={'cuda:0': map_location}, weights_only=True)
+
+        self.log.info(f'Importing network weights from {path}...')
+        self.load_weights_in_memory(src_dict)
+
+    def weights(self):
+        return self.network.module.state_dict()
+
+    def enter_train(self):
+        self.integrator = self.train_integrator
+        self.network.train()
+        return self
+
+    def enter_val(self):
+        self.network.eval()
+        return self
diff --git a/third_party/MMAudio/mmaudio/sample.py b/third_party/MMAudio/mmaudio/sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..72b83389d7dbb55bed02991f51731b0d1e346a2b
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/sample.py
@@ -0,0 +1,90 @@
+import json
+import logging
+import os
+import random
+
+import numpy as np
+import torch
+from hydra.core.hydra_config import HydraConfig
+from omegaconf import DictConfig, open_dict
+from tqdm import tqdm
+
+from mmaudio.data.data_setup import setup_test_datasets
+from mmaudio.runner import Runner
+from mmaudio.utils.dist_utils import info_if_rank_zero
+from mmaudio.utils.logger import TensorboardLogger
+
+local_rank = int(os.environ['LOCAL_RANK'])
+world_size = int(os.environ['WORLD_SIZE'])
+
+
+def sample(cfg: DictConfig):
+    # initial setup
+    num_gpus = world_size
+    run_dir = HydraConfig.get().run.dir
+
+    # wrap python logger with a tensorboard logger
+    log = TensorboardLogger(cfg.exp_id,
+                            run_dir,
+                            logging.getLogger(),
+                            is_rank0=(local_rank == 0),
+                            enable_email=cfg.enable_email and not cfg.debug)
+
+    info_if_rank_zero(log, f'All configuration: {cfg}')
+    info_if_rank_zero(log, f'Number of GPUs detected: {num_gpus}')
+
+    # cuda setup
+    torch.cuda.set_device(local_rank)
+    torch.backends.cudnn.benchmark = cfg.cudnn_benchmark
+
+    # number of dataloader workers
+    info_if_rank_zero(log, f'Number of dataloader workers (per GPU): {cfg.num_workers}')
+
+    # Set seeds to ensure the same initialization
+    torch.manual_seed(cfg.seed)
+    np.random.seed(cfg.seed)
+    random.seed(cfg.seed)
+
+    # setting up configurations
+    info_if_rank_zero(log, f'Configuration: {cfg}')
+    info_if_rank_zero(log, f'Batch size (per GPU): {cfg.batch_size}')
+
+    # construct the trainer
+    runner = Runner(cfg, log=log, run_path=run_dir, for_training=False).enter_val()
+
+    # load the last weights if needed
+    if cfg['weights'] is not None:
+        info_if_rank_zero(log, f'Loading weights from the disk: {cfg["weights"]}')
+        runner.load_weights(cfg['weights'])
+        cfg['weights'] = None
+    else:
+        weights = runner.get_final_ema_weight_path()
+        if weights is not None:
+            info_if_rank_zero(log, f'Automatically finding weight: {weights}')
+            runner.load_weights(weights)
+
+    # setup datasets
+    dataset, sampler, loader = setup_test_datasets(cfg)
+    data_cfg = cfg.data.ExtractedVGG_test
+    with open_dict(data_cfg):
+        if cfg.output_name is not None:
+            # append to the tag
+            data_cfg.tag = f'{data_cfg.tag}-{cfg.output_name}'
+
+    # loop
+    audio_path = None
+    for curr_iter, data in enumerate(tqdm(loader)):
+        new_audio_path = runner.inference_pass(data, curr_iter, data_cfg)
+        if audio_path is None:
+            audio_path = new_audio_path
+        else:
+            assert audio_path == new_audio_path, 'Different audio path detected'
+
+    info_if_rank_zero(log, f'Inference completed. Audio path: {audio_path}')
+    output_metrics = runner.eval(audio_path, curr_iter, data_cfg)
+
+    if local_rank == 0:
+        # write the output metrics to run_dir
+        output_metrics_path = os.path.join(run_dir, f'{data_cfg.tag}-output_metrics.json')
+        with open(output_metrics_path, 'w') as f:
+            json.dump(output_metrics, f, indent=4)
diff --git a/third_party/MMAudio/mmaudio/utils/__init__.py b/third_party/MMAudio/mmaudio/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/mmaudio/utils/dist_utils.py b/third_party/MMAudio/mmaudio/utils/dist_utils.py
similarity index 100%
rename from mmaudio/utils/dist_utils.py
rename to third_party/MMAudio/mmaudio/utils/dist_utils.py
diff --git a/mmaudio/utils/download_utils.py b/third_party/MMAudio/mmaudio/utils/download_utils.py
similarity index 100%
rename from mmaudio/utils/download_utils.py
rename to third_party/MMAudio/mmaudio/utils/download_utils.py
diff --git a/third_party/MMAudio/mmaudio/utils/email_utils.py b/third_party/MMAudio/mmaudio/utils/email_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..873b77c2b3b8cc0abe4df61c7fb981159c8bccfc
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/email_utils.py
@@ -0,0 +1,50 @@
+import logging
+import os
+from datetime import datetime
+
+import requests
+from dotenv import load_dotenv
+from pytz import timezone
+
+from mmaudio.utils.timezone import my_timezone
+
+_source = 'USE YOURS'
+_target = 'USE YOURS'
+
+log = logging.getLogger()
+
+_fmt = "%Y-%m-%d %H:%M:%S %Z%z"
+
+
+class EmailSender:
+
+    def __init__(self, exp_id: str, enable: bool):
+        self.exp_id = exp_id
+        self.enable = enable
+        if enable:
+            load_dotenv()
+            self.MAILGUN_API_KEY = os.getenv('MAILGUN_API_KEY')
+            if self.MAILGUN_API_KEY is None:
+                log.warning('MAILGUN_API_KEY is not set')
+                self.enable = False
+
+    def send(self, subject, content):
+        if self.enable:
+            subject = str(subject)
+            content = str(content)
+            try:
+                return requests.post(f'https://api.mailgun.net/v3/{_source}/messages',
+                                     auth=('api', self.MAILGUN_API_KEY),
+                                     data={
+                                         'from':
+                                         f'<agent name>🤖 <mailgun@{_source}>',
+                                         'to': [f'{_target}'],
+                                         'subject':
+                                         f'[{self.exp_id}] {subject}',
+                                         'text':
+                                         ('\n\n' + content + '\n\n<your sign off>\n' +
+                                          datetime.now(timezone(my_timezone)).strftime(_fmt)),
+                                     },
+                                     timeout=20)
+            except Exception as e:
+                log.error(f'Failed to send email: {e}')
diff --git a/third_party/MMAudio/mmaudio/utils/log_integrator.py b/third_party/MMAudio/mmaudio/utils/log_integrator.py
new file mode 100644
index 0000000000000000000000000000000000000000..af042fa05bd73d3411f3464de3b0ed61dad61ab7
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/log_integrator.py
@@ -0,0 +1,112 @@
+"""
+Integrate numerical values for some iterations
+Typically used for loss computation / logging to tensorboard
+Call finalize and create a new Integrator when you want to display/log
+"""
+from typing import Callable, Union
+
+import torch
+
+from mmaudio.utils.logger import TensorboardLogger
+from mmaudio.utils.tensor_utils import distribute_into_histogram
+
+
+class Integrator:
+
+    def __init__(self, logger: TensorboardLogger, distributed: bool = True):
+        self.values = {}
+        self.counts = {}
+        self.hooks = []  # List is used here to maintain insertion order
+
+        # for binned tensors
+        self.binned_tensors = {}
+        self.binned_tensor_indices = {}
+
+        self.logger = logger
+
+        self.distributed = distributed
+        self.local_rank = torch.distributed.get_rank()
+        self.world_size = torch.distributed.get_world_size()
+
+    def add_scalar(self, key: str, x: Union[torch.Tensor, int, float]):
+        if isinstance(x, torch.Tensor):
+            x = x.detach()
+            if x.dtype in [torch.long, torch.int, torch.bool]:
+                x = x.float()
+
+        if key not in self.values:
+            self.counts[key] = 1
+            self.values[key] = x
+        else:
+            self.counts[key] += 1
+            self.values[key] += x
+
+    def add_dict(self, tensor_dict: dict[str, torch.Tensor]):
+        for k, v in tensor_dict.items():
+            self.add_scalar(k, v)
+
+    def add_binned_tensor(self, key: str, x: torch.Tensor, indices: torch.Tensor):
+        if key not in self.binned_tensors:
+            self.binned_tensors[key] = [x.detach().flatten()]
+            self.binned_tensor_indices[key] = [indices.detach().flatten()]
+        else:
+            self.binned_tensors[key].append(x.detach().flatten())
+            self.binned_tensor_indices[key].append(indices.detach().flatten())
+
+    def add_hook(self, hook: Callable[[torch.Tensor], tuple[str, torch.Tensor]]):
+        """
+        Adds a custom hook, i.e. compute new metrics using values in the dict
+        The hook takes the dict as argument, and returns a (k, v) tuple
+        e.g. for computing IoU
+        """
+        self.hooks.append(hook)
+
+    def reset_except_hooks(self):
+        self.values = {}
+        self.counts = {}
+
+    # Average and output the metrics
+    def finalize(self, prefix: str, it: int, ignore_timer: bool = False) -> None:
+
+        for hook in self.hooks:
+            k, v = hook(self.values)
+            self.add_scalar(k, v)
+
+        # for the metrics
+        outputs = {}
+        for k, v in self.values.items():
+            avg = v / self.counts[k]
+            if self.distributed:
+                # Inplace operation
+                if isinstance(avg, torch.Tensor):
+                    avg = avg.cuda()
+                else:
+                    avg = torch.tensor(avg).cuda()
+                torch.distributed.reduce(avg, dst=0)
+
+                if self.local_rank == 0:
+                    avg = (avg / self.world_size).cpu().item()
+                    outputs[k] = avg
+            else:
+                # Simple does it
+                outputs[k] = avg
+
+        if (not self.distributed) or (self.local_rank == 0):
+            self.logger.log_metrics(prefix, outputs, it, ignore_timer=ignore_timer)
+
+        # for the binned tensors
+        for k, v in self.binned_tensors.items():
+            x = torch.cat(v, dim=0)
+            indices = torch.cat(self.binned_tensor_indices[k], dim=0)
+            hist, count = distribute_into_histogram(x, indices)
+
+            if self.distributed:
+                torch.distributed.reduce(hist, dst=0)
+                torch.distributed.reduce(count, dst=0)
+                if self.local_rank == 0:
+                    hist = hist / count
+            else:
+                hist = hist / count
+
+            if (not self.distributed) or (self.local_rank == 0):
+                self.logger.log_histogram(f'{prefix}/{k}', hist, it)
diff --git a/third_party/MMAudio/mmaudio/utils/logger.py b/third_party/MMAudio/mmaudio/utils/logger.py
new file mode 100644
index 0000000000000000000000000000000000000000..3878232ec3e0097f4063fcaa9111564e84ea98b9
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/logger.py
@@ -0,0 +1,231 @@
+"""
+Dumps things to tensorboard and console
+"""
+
+import datetime
+import logging
+import math
+import os
+from collections import defaultdict
+from pathlib import Path
+from typing import Optional, Union
+
+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+import torchaudio
+from PIL import Image
+from pytz import timezone
+from torch.utils.tensorboard import SummaryWriter
+
+from mmaudio.utils.email_utils import EmailSender
+from mmaudio.utils.time_estimator import PartialTimeEstimator, TimeEstimator
+from mmaudio.utils.timezone import my_timezone
+
+
+def tensor_to_numpy(image: torch.Tensor):
+    image_np = (image.numpy() * 255).astype('uint8')
+    return image_np
+
+
+def detach_to_cpu(x: torch.Tensor):
+    return x.detach().cpu()
+
+
+def fix_width_trunc(x: float):
+    return ('{:.9s}'.format('{:0.9f}'.format(x)))
+
+
+def plot_spectrogram(spectrogram: np.ndarray, title=None, ylabel="freq_bin", ax=None):
+    if ax is None:
+        _, ax = plt.subplots(1, 1)
+    if title is not None:
+        ax.set_title(title)
+    ax.set_ylabel(ylabel)
+    ax.imshow(spectrogram, origin="lower", aspect="auto", interpolation="nearest")
+
+
+class TensorboardLogger:
+
+    def __init__(self,
+                 exp_id: str,
+                 run_dir: Union[Path, str],
+                 py_logger: logging.Logger,
+                 *,
+                 is_rank0: bool = False,
+                 enable_email: bool = False):
+        self.exp_id = exp_id
+        self.run_dir = Path(run_dir)
+        self.py_log = py_logger
+        self.email_sender = EmailSender(exp_id, enable=(is_rank0 and enable_email))
+        if is_rank0:
+            self.tb_log = SummaryWriter(run_dir)
+        else:
+            self.tb_log = None
+
+        # Get current git info for logging
+        try:
+            import git
+            repo = git.Repo(".")
+            git_info = str(repo.active_branch) + ' ' + str(repo.head.commit.hexsha)
+        except (ImportError, RuntimeError, TypeError):
+            print('Failed to fetch git info. Defaulting to None')
+            git_info = 'None'
+
+        self.log_string('git', git_info)
+
+        # log the SLURM job id if available
+        job_id = os.environ.get('SLURM_JOB_ID', None)
+        if job_id is not None:
+            self.log_string('slurm_job_id', job_id)
+            self.email_sender.send(f'Job {job_id} started', f'Job started {run_dir}')
+
+        # used when logging metrics
+        self.batch_timer: TimeEstimator = None
+        self.data_timer: PartialTimeEstimator = None
+
+        self.nan_count = defaultdict(int)
+
+    def log_scalar(self, tag: str, x: float, it: int):
+        if self.tb_log is None:
+            return
+        if math.isnan(x) and 'grad_norm' not in tag:
+            self.nan_count[tag] += 1
+            if self.nan_count[tag] == 10:
+                self.email_sender.send(
+                    f'Nan detected in {tag} @ {self.run_dir}',
+                    f'Nan detected in {tag} at iteration {it}; run_dir: {self.run_dir}')
+        else:
+            self.nan_count[tag] = 0
+        self.tb_log.add_scalar(tag, x, it)
+
+    def log_metrics(self,
+                    prefix: str,
+                    metrics: dict[str, float],
+                    it: int,
+                    ignore_timer: bool = False):
+        msg = f'{self.exp_id}-{prefix} - it {it:6d}: '
+        metrics_msg = ''
+        for k, v in sorted(metrics.items()):
+            self.log_scalar(f'{prefix}/{k}', v, it)
+            metrics_msg += f'{k: >10}:{v:.7f},\t'
+
+        if self.batch_timer is not None and not ignore_timer:
+            self.batch_timer.update()
+            avg_time = self.batch_timer.get_and_reset_avg_time()
+            data_time = self.data_timer.get_and_reset_avg_time()
+
+            # add time to tensorboard
+            self.log_scalar(f'{prefix}/avg_time', avg_time, it)
+            self.log_scalar(f'{prefix}/data_time', data_time, it)
+
+            est = self.batch_timer.get_est_remaining(it)
+            est = datetime.timedelta(seconds=est)
+            if est.days > 0:
+                remaining_str = f'{est.days}d {est.seconds // 3600}h'
+            else:
+                remaining_str = f'{est.seconds // 3600}h {(est.seconds%3600) // 60}m'
+            eta = datetime.datetime.now(timezone(my_timezone)) + est
+            eta_str = eta.strftime('%Y-%m-%d %H:%M:%S %Z%z')
+            time_msg = f'avg_time:{avg_time:.3f},data:{data_time:.3f},remaining:{remaining_str},eta:{eta_str},\t'
+            msg = f'{msg} {time_msg}'
+
+        msg = f'{msg} {metrics_msg}'
+        self.py_log.info(msg)
+
+    def log_histogram(self, tag: str, hist: torch.Tensor, it: int):
+        if self.tb_log is None:
+            return
+        # hist should be a 1D tensor
+        hist = hist.cpu().numpy()
+        fig, ax = plt.subplots()
+        x_range = np.linspace(0, 1, len(hist))
+        ax.bar(x_range, hist, width=1 / (len(hist) - 1))
+        ax.set_xticks(x_range)
+        ax.set_xticklabels(x_range)
+        plt.tight_layout()
+        self.tb_log.add_figure(tag, fig, it)
+        plt.close()
+
+    def log_image(self, prefix: str, tag: str, image: np.ndarray, it: int):
+        image_dir = self.run_dir / f'{prefix}_images'
+        image_dir.mkdir(exist_ok=True, parents=True)
+
+        image = Image.fromarray(image)
+        image.save(image_dir / f'{it:09d}_{tag}.png')
+
+    def log_audio(self,
+                  prefix: str,
+                  tag: str,
+                  waveform: torch.Tensor,
+                  it: Optional[int] = None,
+                  *,
+                  subdir: Optional[Path] = None,
+                  sample_rate: int = 16000) -> Path:
+        if subdir is None:
+            audio_dir = self.run_dir / prefix
+        else:
+            audio_dir = self.run_dir / subdir / prefix
+        audio_dir.mkdir(exist_ok=True, parents=True)
+
+        if it is None:
+            name = f'{tag}.flac'
+        else:
+            name = f'{it:09d}_{tag}.flac'
+
+        torchaudio.save(audio_dir / name,
+                        waveform.cpu().float(),
+                        sample_rate=sample_rate,
+                        channels_first=True)
+        return Path(audio_dir)
+
+    def log_spectrogram(
+        self,
+        prefix: str,
+        tag: str,
+        spec: torch.Tensor,
+        it: Optional[int],
+        *,
+        subdir: Optional[Path] = None,
+    ):
+        if subdir is None:
+            spec_dir = self.run_dir / prefix
+        else:
+            spec_dir = self.run_dir / subdir / prefix
+        spec_dir.mkdir(exist_ok=True, parents=True)
+
+        if it is None:
+            name = f'{tag}.png'
+        else:
+            name = f'{it:09d}_{tag}.png'
+
+        plot_spectrogram(spec.cpu().float())
+        plt.tight_layout()
+        plt.savefig(spec_dir / name)
+        plt.close()
+
+    def log_string(self, tag: str, x: str):
+        self.py_log.info(f'{tag} - {x}')
+        if self.tb_log is None:
+            return
+        self.tb_log.add_text(tag, x)
+
+    def debug(self, x):
+        self.py_log.debug(x)
+
+    def info(self, x):
+        self.py_log.info(x)
+
+    def warning(self, x):
+        self.py_log.warning(x)
+
+    def error(self, x):
+        self.py_log.error(x)
+
+    def critical(self, x):
+        self.py_log.critical(x)
+
+        self.email_sender.send(f'Error occurred in {self.run_dir}', x)
+
+    def complete(self):
+        self.email_sender.send(f'Job completed in {self.run_dir}', 'Job completed')
diff --git a/third_party/MMAudio/mmaudio/utils/synthesize_ema.py b/third_party/MMAudio/mmaudio/utils/synthesize_ema.py
new file mode 100644
index 0000000000000000000000000000000000000000..d71348010f5776360460152d9d910e0bec62c1f5
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/synthesize_ema.py
@@ -0,0 +1,19 @@
+from typing import Optional
+
+from nitrous_ema import PostHocEMA
+from omegaconf import DictConfig
+
+from mmaudio.model.networks import get_my_mmaudio
+
+
+def synthesize_ema(cfg: DictConfig, sigma: float, step: Optional[int]):
+    vae = get_my_mmaudio(cfg.model)
+    emas = PostHocEMA(vae,
+                      sigma_rels=cfg.ema.sigma_rels,
+                      update_every=cfg.ema.update_every,
+                      checkpoint_every_num_steps=cfg.ema.checkpoint_every,
+                      checkpoint_folder=cfg.ema.checkpoint_folder)
+
+    synthesized_ema = emas.synthesize_ema_model(sigma_rel=sigma, step=step, device='cpu')
+    state_dict = synthesized_ema.ema_model.state_dict()
+    return state_dict
diff --git a/third_party/MMAudio/mmaudio/utils/tensor_utils.py b/third_party/MMAudio/mmaudio/utils/tensor_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b650955b04ce097d0a03bbafb6424f9528c631c2
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/tensor_utils.py
@@ -0,0 +1,14 @@
+import torch
+
+
+def distribute_into_histogram(loss: torch.Tensor,
+                              t: torch.Tensor,
+                              num_bins: int = 25) -> tuple[torch.Tensor, torch.Tensor]:
+    loss = loss.detach().flatten()
+    t = t.detach().flatten()
+    t = (t * num_bins).long()
+    hist = torch.zeros(num_bins, device=loss.device)
+    count = torch.zeros(num_bins, device=loss.device)
+    hist.scatter_add_(0, t, loss)
+    count.scatter_add_(0, t, torch.ones_like(loss))
+    return hist, count
diff --git a/third_party/MMAudio/mmaudio/utils/time_estimator.py b/third_party/MMAudio/mmaudio/utils/time_estimator.py
new file mode 100644
index 0000000000000000000000000000000000000000..62ff3ca189cda8f9524c11196fdc292eedb1d354
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/time_estimator.py
@@ -0,0 +1,72 @@
+import time
+
+
+class TimeEstimator:
+
+    def __init__(self, total_iter: int, step_size: int, ema_alpha: float = 0.7):
+        self.avg_time_window = []  # window-based average
+        self.exp_avg_time = None  # exponential moving average
+        self.alpha = ema_alpha  # for exponential moving average
+
+        self.last_time = time.time()  # would not be accurate for the first iteration but well
+        self.total_iter = total_iter
+        self.step_size = step_size
+
+        self._buffering_exp = True
+
+    # call this at a fixed interval
+    # does not have to be every step
+    def update(self):
+        curr_time = time.time()
+        time_per_iter = curr_time - self.last_time
+        self.last_time = curr_time
+
+        self.avg_time_window.append(time_per_iter)
+
+        if self._buffering_exp:
+            if self.exp_avg_time is not None:
+                # discard the first iteration call to not pollute the ema
+                self._buffering_exp = False
+            self.exp_avg_time = time_per_iter
+        else:
+            self.exp_avg_time = self.alpha * self.exp_avg_time + (1 - self.alpha) * time_per_iter
+
+    def get_est_remaining(self, it: int):
+        if self.exp_avg_time is None:
+            return 0
+
+        remaining_iter = self.total_iter - it
+        return remaining_iter * self.exp_avg_time / self.step_size
+
+    def get_and_reset_avg_time(self):
+        avg = sum(self.avg_time_window) / len(self.avg_time_window) / self.step_size
+        self.avg_time_window = []
+        return avg
+
+
+class PartialTimeEstimator(TimeEstimator):
+    """
+    Used where the start_time and the end_time do not align
+    """
+
+    def update(self):
+        raise RuntimeError('Please use start() and end() for PartialTimeEstimator')
+
+    def start(self):
+        self.last_time = time.time()
+
+    def end(self):
+        assert self.last_time is not None, 'Please call start() before calling end()'
+        curr_time = time.time()
+        time_per_iter = curr_time - self.last_time
+        self.last_time = None
+
+        self.avg_time_window.append(time_per_iter)
+
+        if self._buffering_exp:
+            if self.exp_avg_time is not None:
+                # discard the first iteration call to not pollute the ema
+                self._buffering_exp = False
+            self.exp_avg_time = time_per_iter
+        else:
+            self.exp_avg_time = self.alpha * self.exp_avg_time + (1 - self.alpha) * time_per_iter
diff --git a/third_party/MMAudio/mmaudio/utils/timezone.py b/third_party/MMAudio/mmaudio/utils/timezone.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c7f0e6e753816a421f8e5d829ac131c95192a03
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/timezone.py
@@ -0,0 +1 @@
+my_timezone = 'US/Central'
diff --git a/third_party/MMAudio/mmaudio/utils/video_joiner.py b/third_party/MMAudio/mmaudio/utils/video_joiner.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a05ae84a079e03f9af96bb2dc0bf38f004732ca
--- /dev/null
+++ b/third_party/MMAudio/mmaudio/utils/video_joiner.py
@@ -0,0 +1,66 @@
+from pathlib import Path
+from typing import Union
+
+import torch
+from torio.io import StreamingMediaDecoder, StreamingMediaEncoder
+
+
+class VideoJoiner:
+
+    def __init__(self, src_root: Union[str, Path], output_root: Union[str, Path], sample_rate: int,
+                 duration_seconds: float):
+        self.src_root = Path(src_root)
+        self.output_root = Path(output_root)
+        self.sample_rate = sample_rate
+        self.duration_seconds = duration_seconds
+
+        self.output_root.mkdir(parents=True, exist_ok=True)
+
+    def join(self, video_id: str, output_name: str, audio: torch.Tensor):
+        video_path = self.src_root / f'{video_id}.mp4'
+        output_path = self.output_root / f'{output_name}.mp4'
+        merge_audio_into_video(video_path, output_path, audio, self.sample_rate,
+                               self.duration_seconds)
+
+
+def merge_audio_into_video(video_path: Union[str, Path], output_path: Union[str, Path],
+                           audio: torch.Tensor, sample_rate: int, duration_seconds: float):
+    # audio: (num_samples, num_channels=1/2)
+
+    frame_rate = 24
+    # read the video
+    reader = StreamingMediaDecoder(video_path)
+    reader.add_basic_video_stream(
+        frames_per_chunk=int(frame_rate * duration_seconds),
+        # buffer_chunk_size=1, # does not work with this -- extracted audio would be too short
+        format="rgb24",
+        frame_rate=frame_rate,
+    )
+
+    reader.fill_buffer()
+    video_chunk = reader.pop_chunks()[0]
+    t, _, h, w = video_chunk.shape
+
+    writer = StreamingMediaEncoder(output_path)
+    writer.add_audio_stream(
+        sample_rate=sample_rate,
+        num_channels=audio.shape[-1],
+        encoder="libmp3lame",
+    )
+    writer.add_video_stream(frame_rate=frame_rate,
+                            width=w,
+                            height=h,
+                            format="rgb24",
+                            encoder="libx264",
+                            encoder_format="yuv420p")
+
+    with writer.open():
+        writer.write_audio_chunk(0, audio.float())
+        writer.write_video_chunk(1, video_chunk)
+
+
+if __name__ == '__main__':
+    # Usage example
+    import sys
+    audio = torch.randn(16000 * 4, 1)
+    merge_audio_into_video(sys.argv[1], sys.argv[2], audio, 16000, 4)
diff --git a/third_party/MusicSourceSeparationTraining/LICENSE b/third_party/MusicSourceSeparationTraining/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..9d7186e88bca9975edd65956cd499fa60bd04251
--- /dev/null
+++ b/third_party/MusicSourceSeparationTraining/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Roman Solovyev (ZFTurbo)
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/third_party/MusicSourceSeparationTraining/__pycache__/utils.cpython-310.pyc b/third_party/MusicSourceSeparationTraining/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..40c9f273cc431f158ee006c171f16a2720979721
Binary files /dev/null and b/third_party/MusicSourceSeparationTraining/__pycache__/utils.cpython-310.pyc differ
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/__init__.py b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..980e0afa5b7b4fd66168bce6905a94e7c91c380e
--- /dev/null
+++ b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__init__.py
@@ -0,0 +1,2 @@
+from models.bs_roformer.bs_roformer import BSRoformer
+from models.bs_roformer.mel_band_roformer import MelBandRoformer
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/__init__.cpython-310.pyc b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..691ea4d427783a86db00e4685692484064f6de45
Binary files /dev/null and b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/__init__.cpython-310.pyc differ
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/attend.cpython-310.pyc b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/attend.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4240dee88df09225f5aa31963589c16581b4b472
Binary files /dev/null and b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/attend.cpython-310.pyc differ
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/bs_roformer.cpython-310.pyc b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/bs_roformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3d9b39e8e48c00e0593d315dfaded68a357e0a04
Binary files /dev/null and b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/bs_roformer.cpython-310.pyc differ
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/mel_band_roformer.cpython-310.pyc b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/mel_band_roformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..26bfed7129e411219e6ff38bf58347ff83aaf354
Binary files /dev/null and b/third_party/MusicSourceSeparationTraining/models/bs_roformer/__pycache__/mel_band_roformer.cpython-310.pyc differ
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/attend.py b/third_party/MusicSourceSeparationTraining/models/bs_roformer/attend.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6dc4b3079cff5b3c8c90cea8df2301afd18918b
--- /dev/null
+++ b/third_party/MusicSourceSeparationTraining/models/bs_roformer/attend.py
@@ -0,0 +1,126 @@
+from functools import wraps
+from packaging import version
+from collections import namedtuple
+
+import os
+import torch
+from torch import nn, einsum
+import torch.nn.functional as F
+
+from einops import rearrange, reduce
+
+# constants
+
+FlashAttentionConfig = namedtuple('FlashAttentionConfig', ['enable_flash', 'enable_math', 'enable_mem_efficient'])
+
+# helpers
+
+def exists(val):
+    return val is not None
+
+def default(v, d):
+    return v if exists(v) else d
+
+def once(fn):
+    called = False
+    @wraps(fn)
+    def inner(x):
+        nonlocal called
+        if called:
+            return
+        called = True
+        return fn(x)
+    return inner
+
+print_once = once(print)
+
+# main class
+
+class Attend(nn.Module):
+    def __init__(
+        self,
+        dropout = 0.,
+        flash = False,
+        scale = None
+    ):
+        super().__init__()
+        self.scale = scale
+        self.dropout = dropout
+        self.attn_dropout = nn.Dropout(dropout)
+
+        self.flash = flash
+        assert not (flash and version.parse(torch.__version__) < version.parse('2.0.0')), 'in order to use flash attention, you must be using pytorch 2.0 or above'
+
+        # determine efficient attention configs for cuda and cpu
+
+        self.cpu_config = FlashAttentionConfig(True, True, True)
+        self.cuda_config = None
+
+        if not torch.cuda.is_available() or not flash:
+            return
+
+        device_properties = torch.cuda.get_device_properties(torch.device('cuda'))
+        device_version = version.parse(f'{device_properties.major}.{device_properties.minor}')
+
+        if device_version >= version.parse('8.0'):
+            if os.name == 'nt':
+                print_once('Windows OS detected, using math or mem efficient attention if input tensor is on cuda')
+                self.cuda_config = FlashAttentionConfig(False, True, True)
+            else:
+                print_once('GPU Compute Capability equal or above 8.0, using flash attention if input tensor is on cuda')
+                self.cuda_config = FlashAttentionConfig(True, False, False)
+        else:
+            print_once('GPU Compute Capability below 8.0, using math or mem efficient attention if input tensor is on cuda')
+            self.cuda_config = FlashAttentionConfig(False, True, True)
+
+    def flash_attn(self, q, k, v):
+        _, heads, q_len, _, k_len, is_cuda, device = *q.shape, k.shape[-2], q.is_cuda, q.device
+
+        if exists(self.scale):
+            default_scale = q.shape[-1] ** -0.5
+            q = q * (self.scale / default_scale)
+
+        # Check if there is a compatible device for flash attention
+
+        config = self.cuda_config if is_cuda else self.cpu_config
+
+        # pytorch 2.0 flash attn: q, k, v, mask, dropout, softmax_scale
+
+        with torch.backends.cuda.sdp_kernel(**config._asdict()):
+            out = F.scaled_dot_product_attention(
+                q, k, v,
+                dropout_p = self.dropout if self.training else 0.
+            )
+
+        return out
+
+    def forward(self, q, k, v):
+        """
+        einstein notation
+        b - batch
+        h - heads
+        n, i, j - sequence length (base sequence length, source, target)
+        d - feature dimension
+        """
+
+        q_len, k_len, device = q.shape[-2], k.shape[-2], q.device
+
+        scale = default(self.scale, q.shape[-1] ** -0.5)
+
+        if self.flash:
+            return self.flash_attn(q, k, v)
+
+        # similarity
+
+        sim = einsum(f"b h i d, b h j d -> b h i j", q, k) * scale
+
+        # attention
+
+        attn = sim.softmax(dim=-1)
+        attn = self.attn_dropout(attn)
+
+        # aggregate values
+
+        out = einsum(f"b h i j, b h j d -> b h i d", attn, v)
+
+        return out
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/bs_roformer.py b/third_party/MusicSourceSeparationTraining/models/bs_roformer/bs_roformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..195593ed4794f808034ca422993bc63d68ae9643
--- /dev/null
+++ b/third_party/MusicSourceSeparationTraining/models/bs_roformer/bs_roformer.py
@@ -0,0 +1,622 @@
+from functools import partial
+
+import torch
+from torch import nn, einsum, Tensor
+from torch.nn import Module, ModuleList
+import torch.nn.functional as F
+
+from models.bs_roformer.attend import Attend
+from torch.utils.checkpoint import checkpoint
+
+from beartype.typing import Tuple, Optional, List, Callable
+from beartype import beartype
+
+from rotary_embedding_torch import RotaryEmbedding
+
+from einops import rearrange, pack, unpack
+from einops.layers.torch import Rearrange
+
+# helper functions
+
+def exists(val):
+    return val is not None
+
+
+def default(v, d):
+    return v if exists(v) else d
+
+
+def pack_one(t, pattern):
+    return pack([t], pattern)
+
+
+def unpack_one(t, ps, pattern):
+    return unpack(t, ps, pattern)[0]
+
+
+# norm
+
+def l2norm(t):
+    return F.normalize(t, dim = -1, p = 2)
+
+
+class RMSNorm(Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.scale = dim ** 0.5
+        self.gamma = nn.Parameter(torch.ones(dim))
+
+    def forward(self, x):
+        return F.normalize(x, dim=-1) * self.scale * self.gamma
+
+
+# attention
+
+class FeedForward(Module):
+    def __init__(
+            self,
+            dim,
+            mult=4,
+            dropout=0.
+    ):
+        super().__init__()
+        dim_inner = int(dim * mult)
+        self.net = nn.Sequential(
+            RMSNorm(dim),
+            nn.Linear(dim, dim_inner),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(dim_inner, dim),
+            nn.Dropout(dropout)
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+
+class Attention(Module):
+    def __init__(
+            self,
+            dim,
+            heads=8,
+            dim_head=64,
+            dropout=0.,
+            rotary_embed=None,
+            flash=True
+    ):
+        super().__init__()
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        dim_inner = heads * dim_head
+
+        self.rotary_embed = rotary_embed
+
+        self.attend = Attend(flash=flash, dropout=dropout)
+
+        self.norm = RMSNorm(dim)
+        self.to_qkv = nn.Linear(dim, dim_inner * 3, bias=False)
+
+        self.to_gates = nn.Linear(dim, heads)
+
+        self.to_out = nn.Sequential(
+            nn.Linear(dim_inner, dim, bias=False),
+            nn.Dropout(dropout)
+        )
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        q, k, v = rearrange(self.to_qkv(x), 'b n (qkv h d) -> qkv b h n d', qkv=3, h=self.heads)
+
+        if exists(self.rotary_embed):
+            q = self.rotary_embed.rotate_queries_or_keys(q)
+            k = self.rotary_embed.rotate_queries_or_keys(k)
+
+        out = self.attend(q, k, v)
+
+        gates = self.to_gates(x)
+        out = out * rearrange(gates, 'b n h -> b h n 1').sigmoid()
+
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+
+class LinearAttention(Module):
+    """
+    this flavor of linear attention proposed in https://arxiv.org/abs/2106.09681 by El-Nouby et al.
+    """
+
+    @beartype
+    def __init__(
+            self,
+            *,
+            dim,
+            dim_head=32,
+            heads=8,
+            scale=8,
+            flash=False,
+            dropout=0.
+    ):
+        super().__init__()
+        dim_inner = dim_head * heads
+        self.norm = RMSNorm(dim)
+
+        self.to_qkv = nn.Sequential(
+            nn.Linear(dim, dim_inner * 3, bias=False),
+            Rearrange('b n (qkv h d) -> qkv b h d n', qkv=3, h=heads)
+        )
+
+        self.temperature = nn.Parameter(torch.ones(heads, 1, 1))
+
+        self.attend = Attend(
+            scale=scale,
+            dropout=dropout,
+            flash=flash
+        )
+
+        self.to_out = nn.Sequential(
+            Rearrange('b h d n -> b n (h d)'),
+            nn.Linear(dim_inner, dim, bias=False)
+        )
+
+    def forward(
+            self,
+            x
+    ):
+        x = self.norm(x)
+
+        q, k, v = self.to_qkv(x)
+
+        q, k = map(l2norm, (q, k))
+        q = q * self.temperature.exp()
+
+        out = self.attend(q, k, v)
+
+        return self.to_out(out)
+
+
+class Transformer(Module):
+    def __init__(
+            self,
+            *,
+            dim,
+            depth,
+            dim_head=64,
+            heads=8,
+            attn_dropout=0.,
+            ff_dropout=0.,
+            ff_mult=4,
+            norm_output=True,
+            rotary_embed=None,
+            flash_attn=True,
+            linear_attn=False
+    ):
+        super().__init__()
+        self.layers = ModuleList([])
+
+        for _ in range(depth):
+            if linear_attn:
+                attn = LinearAttention(dim=dim, dim_head=dim_head, heads=heads, dropout=attn_dropout, flash=flash_attn)
+            else:
+                attn = Attention(dim=dim, dim_head=dim_head, heads=heads, dropout=attn_dropout,
+                                 rotary_embed=rotary_embed, flash=flash_attn)
+
+            self.layers.append(ModuleList([
+                attn,
+                FeedForward(dim=dim, mult=ff_mult, dropout=ff_dropout)
+            ]))
+
+        self.norm = RMSNorm(dim) if norm_output else nn.Identity()
+
+    def forward(self, x):
+
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+
+        return self.norm(x)
+
+
+# bandsplit module
+
+class BandSplit(Module):
+    @beartype
+    def __init__(
+            self,
+            dim,
+            dim_inputs: Tuple[int, ...]
+    ):
+        super().__init__()
+        self.dim_inputs = dim_inputs
+        self.to_features = ModuleList([])
+
+        for dim_in in dim_inputs:
+            net = nn.Sequential(
+                RMSNorm(dim_in),
+                nn.Linear(dim_in, dim)
+            )
+
+            self.to_features.append(net)
+
+    def forward(self, x):
+        x = x.split(self.dim_inputs, dim=-1)
+
+        outs = []
+        for split_input, to_feature in zip(x, self.to_features):
+            split_output = to_feature(split_input)
+            outs.append(split_output)
+
+        return torch.stack(outs, dim=-2)
+
+
+def MLP(
+        dim_in,
+        dim_out,
+        dim_hidden=None,
+        depth=1,
+        activation=nn.Tanh
+):
+    dim_hidden = default(dim_hidden, dim_in)
+
+    net = []
+    dims = (dim_in, *((dim_hidden,) * (depth - 1)), dim_out)
+
+    for ind, (layer_dim_in, layer_dim_out) in enumerate(zip(dims[:-1], dims[1:])):
+        is_last = ind == (len(dims) - 2)
+
+        net.append(nn.Linear(layer_dim_in, layer_dim_out))
+
+        if is_last:
+            continue
+
+        net.append(activation())
+
+    return nn.Sequential(*net)
+
+
+class MaskEstimator(Module):
+    @beartype
+    def __init__(
+            self,
+            dim,
+            dim_inputs: Tuple[int, ...],
+            depth,
+            mlp_expansion_factor=4
+    ):
+        super().__init__()
+        self.dim_inputs = dim_inputs
+        self.to_freqs = ModuleList([])
+        dim_hidden = dim * mlp_expansion_factor
+
+        for dim_in in dim_inputs:
+            net = []
+
+            mlp = nn.Sequential(
+                MLP(dim, dim_in * 2, dim_hidden=dim_hidden, depth=depth),
+                nn.GLU(dim=-1)
+            )
+
+            self.to_freqs.append(mlp)
+
+    def forward(self, x):
+        x = x.unbind(dim=-2)
+
+        outs = []
+
+        for band_features, mlp in zip(x, self.to_freqs):
+            freq_out = mlp(band_features)
+            outs.append(freq_out)
+
+        return torch.cat(outs, dim=-1)
+
+
+# main class
+
+DEFAULT_FREQS_PER_BANDS = (
+    2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+    2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+    2, 2, 2, 2,
+    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+    12, 12, 12, 12, 12, 12, 12, 12,
+    24, 24, 24, 24, 24, 24, 24, 24,
+    48, 48, 48, 48, 48, 48, 48, 48,
+    128, 129,
+)
+
+
+class BSRoformer(Module):
+
+    @beartype
+    def __init__(
+            self,
+            dim,
+            *,
+            depth,
+            stereo=False,
+            num_stems=1,
+            time_transformer_depth=2,
+            freq_transformer_depth=2,
+            linear_transformer_depth=0,
+            freqs_per_bands: Tuple[int, ...] = DEFAULT_FREQS_PER_BANDS,
+            # in the paper, they divide into ~60 bands, test with 1 for starters
+            dim_head=64,
+            heads=8,
+            attn_dropout=0.,
+            ff_dropout=0.,
+            flash_attn=True,
+            dim_freqs_in=1025,
+            stft_n_fft=2048,
+            stft_hop_length=512,
+            # 10ms at 44100Hz, from sections 4.1, 4.4 in the paper - @faroit recommends // 2 or // 4 for better reconstruction
+            stft_win_length=2048,
+            stft_normalized=False,
+            stft_window_fn: Optional[Callable] = None,
+            mask_estimator_depth=2,
+            multi_stft_resolution_loss_weight=1.,
+            multi_stft_resolutions_window_sizes: Tuple[int, ...] = (4096, 2048, 1024, 512, 256),
+            multi_stft_hop_size=147,
+            multi_stft_normalized=False,
+            multi_stft_window_fn: Callable = torch.hann_window,
+            mlp_expansion_factor=4,
+            use_torch_checkpoint=False,
+            skip_connection=False,
+    ):
+        super().__init__()
+
+        self.stereo = stereo
+        self.audio_channels = 2 if stereo else 1
+        self.num_stems = num_stems
+        self.use_torch_checkpoint = use_torch_checkpoint
+        self.skip_connection = skip_connection
+
+        self.layers = ModuleList([])
+
+        transformer_kwargs = dict(
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            attn_dropout=attn_dropout,
+            ff_dropout=ff_dropout,
+            flash_attn=flash_attn,
+            norm_output=False
+        )
+
+        time_rotary_embed = RotaryEmbedding(dim=dim_head)
+        freq_rotary_embed = RotaryEmbedding(dim=dim_head)
+
+        for _ in range(depth):
+            tran_modules = []
+            if linear_transformer_depth > 0:
+                tran_modules.append(Transformer(depth=linear_transformer_depth, linear_attn=True, **transformer_kwargs))
+            tran_modules.append(
+                Transformer(depth=time_transformer_depth, rotary_embed=time_rotary_embed, **transformer_kwargs)
+            )
+            tran_modules.append(
+                Transformer(depth=freq_transformer_depth, rotary_embed=freq_rotary_embed, **transformer_kwargs)
+            )
+            self.layers.append(nn.ModuleList(tran_modules))
+
+        self.final_norm = RMSNorm(dim)
+
+        self.stft_kwargs = dict(
+            n_fft=stft_n_fft,
+            hop_length=stft_hop_length,
+            win_length=stft_win_length,
+            normalized=stft_normalized
+        )
+
+        self.stft_window_fn = partial(default(stft_window_fn, torch.hann_window), stft_win_length)
+
+        freqs = torch.stft(torch.randn(1, 4096), **self.stft_kwargs, window=torch.ones(stft_win_length), return_complex=True).shape[1]
+
+        assert len(freqs_per_bands) > 1
+        assert sum(
+            freqs_per_bands) == freqs, f'the number of freqs in the bands must equal {freqs} based on the STFT settings, but got {sum(freqs_per_bands)}'
+
+        freqs_per_bands_with_complex = tuple(2 * f * self.audio_channels for f in freqs_per_bands)
+
+        self.band_split = BandSplit(
+            dim=dim,
+            dim_inputs=freqs_per_bands_with_complex
+        )
+
+        self.mask_estimators = nn.ModuleList([])
+
+        for _ in range(num_stems):
+            mask_estimator = MaskEstimator(
+                dim=dim,
+                dim_inputs=freqs_per_bands_with_complex,
+                depth=mask_estimator_depth,
+                mlp_expansion_factor=mlp_expansion_factor,
+            )
+
+            self.mask_estimators.append(mask_estimator)
+
+        # for the multi-resolution stft loss
+
+        self.multi_stft_resolution_loss_weight = multi_stft_resolution_loss_weight
+        self.multi_stft_resolutions_window_sizes = multi_stft_resolutions_window_sizes
+        self.multi_stft_n_fft = stft_n_fft
+        self.multi_stft_window_fn = multi_stft_window_fn
+
+        self.multi_stft_kwargs = dict(
+            hop_length=multi_stft_hop_size,
+            normalized=multi_stft_normalized
+        )
+
+    def forward(
+            self,
+            raw_audio,
+            target=None,
+            return_loss_breakdown=False
+    ):
+        """
+        einops
+
+        b - batch
+        f - freq
+        t - time
+        s - audio channel (1 for mono, 2 for stereo)
+        n - number of 'stems'
+        c - complex (2)
+        d - feature dimension
+        """
+
+        device = raw_audio.device
+
+        # defining whether model is loaded on MPS (MacOS GPU accelerator)
+        x_is_mps = True if device.type == "mps" else False
+
+        if raw_audio.ndim == 2:
+            raw_audio = rearrange(raw_audio, 'b t -> b 1 t')
+
+        channels = raw_audio.shape[1]
+        assert (not self.stereo and channels == 1) or (self.stereo and channels == 2), 'stereo needs to be set to True if passing in audio signal that is stereo (channel dimension of 2). also need to be False if mono (channel dimension of 1)'
+
+        # to stft
+
+        raw_audio, batch_audio_channel_packed_shape = pack_one(raw_audio, '* t')
+
+        stft_window = self.stft_window_fn(device=device)
+
+        # RuntimeError: FFT operations are only supported on MacOS 14+
+        # Since it's tedious to define whether we're on correct MacOS version - simple try-catch is used
+        try:
+            stft_repr = torch.stft(raw_audio, **self.stft_kwargs, window=stft_window, return_complex=True)
+        except:
+            stft_repr = torch.stft(raw_audio.cpu() if x_is_mps else raw_audio, **self.stft_kwargs,
+                                   window=stft_window.cpu() if x_is_mps else stft_window, return_complex=True).to(
+                device)
+        stft_repr = torch.view_as_real(stft_repr)
+
+        stft_repr = unpack_one(stft_repr, batch_audio_channel_packed_shape, '* f t c')
+
+        # merge stereo / mono into the frequency, with frequency leading dimension, for band splitting
+        stft_repr = rearrange(stft_repr,'b s f t c -> b (f s) t c')
+
+        x = rearrange(stft_repr, 'b f t c -> b t (f c)')
+
+        if self.use_torch_checkpoint:
+            x = checkpoint(self.band_split, x, use_reentrant=False)
+        else:
+            x = self.band_split(x)
+
+        # axial / hierarchical attention
+
+        store = [None] * len(self.layers)
+        for i, transformer_block in enumerate(self.layers):
+
+            if len(transformer_block) == 3:
+                linear_transformer, time_transformer, freq_transformer = transformer_block
+
+                x, ft_ps = pack([x], 'b * d')
+                if self.use_torch_checkpoint:
+                    x = checkpoint(linear_transformer, x, use_reentrant=False)
+                else:
+                    x = linear_transformer(x)
+                x, = unpack(x, ft_ps, 'b * d')
+            else:
+                time_transformer, freq_transformer = transformer_block
+
+            if self.skip_connection:
+                # Sum all previous
+                for j in range(i):
+                    x = x + store[j]
+
+            x = rearrange(x, 'b t f d -> b f t d')
+            x, ps = pack([x], '* t d')
+
+            if self.use_torch_checkpoint:
+                x = checkpoint(time_transformer, x, use_reentrant=False)
+            else:
+                x = time_transformer(x)
+
+            x, = unpack(x, ps, '* t d')
+            x = rearrange(x, 'b f t d -> b t f d')
+            x, ps = pack([x], '* f d')
+
+            if self.use_torch_checkpoint:
+                x = checkpoint(freq_transformer, x, use_reentrant=False)
+            else:
+                x = freq_transformer(x)
+
+            x, = unpack(x, ps, '* f d')
+
+            if self.skip_connection:
+                store[i] = x
+
+        x = self.final_norm(x)
+
+        num_stems = len(self.mask_estimators)
+
+        if self.use_torch_checkpoint:
+            mask = torch.stack([checkpoint(fn, x, use_reentrant=False) for fn in self.mask_estimators], dim=1)
+        else:
+            mask = torch.stack([fn(x) for fn in self.mask_estimators], dim=1)
+        mask = rearrange(mask, 'b n t (f c) -> b n f t c', c=2)
+
+        # modulate frequency representation
+
+        stft_repr = rearrange(stft_repr, 'b f t c -> b 1 f t c')
+
+        # complex number multiplication
+
+        stft_repr = torch.view_as_complex(stft_repr)
+        mask = torch.view_as_complex(mask)
+
+        stft_repr = stft_repr * mask
+
+        # istft
+
+        stft_repr = rearrange(stft_repr, 'b n (f s) t -> (b n s) f t', s=self.audio_channels)
+
+        # same as torch.stft() fix for MacOS MPS above
+        try:
+            recon_audio = torch.istft(stft_repr, **self.stft_kwargs, window=stft_window, return_complex=False, length=raw_audio.shape[-1])
+        except:
+            recon_audio = torch.istft(stft_repr.cpu() if x_is_mps else stft_repr, **self.stft_kwargs, window=stft_window.cpu() if x_is_mps else stft_window, return_complex=False, length=raw_audio.shape[-1]).to(device)
+
+        recon_audio = rearrange(recon_audio, '(b n s) t -> b n s t', s=self.audio_channels, n=num_stems)
+
+        if num_stems == 1:
+            recon_audio = rearrange(recon_audio, 'b 1 s t -> b s t')
+
+        # if a target is passed in, calculate loss for learning
+
+        if not exists(target):
+            return recon_audio
+
+        if self.num_stems > 1:
+            assert target.ndim == 4 and target.shape[1] == self.num_stems
+
+        if target.ndim == 2:
+            target = rearrange(target, '... t -> ... 1 t')
+
+        target = target[..., :recon_audio.shape[-1]]  # protect against lost length on istft
+
+        loss = F.l1_loss(recon_audio, target)
+
+        multi_stft_resolution_loss = 0.
+
+        for window_size in self.multi_stft_resolutions_window_sizes:
+            res_stft_kwargs = dict(
+                n_fft=max(window_size, self.multi_stft_n_fft),  # not sure what n_fft is across multi resolution stft
+                win_length=window_size,
+                return_complex=True,
+                window=self.multi_stft_window_fn(window_size, device=device),
+                **self.multi_stft_kwargs,
+            )
+
+            recon_Y = torch.stft(rearrange(recon_audio, '... s t -> (... s) t'), **res_stft_kwargs)
+            target_Y = torch.stft(rearrange(target, '... s t -> (... s) t'), **res_stft_kwargs)
+
+            multi_stft_resolution_loss = multi_stft_resolution_loss + F.l1_loss(recon_Y, target_Y)
+
+        weighted_multi_resolution_loss = multi_stft_resolution_loss * self.multi_stft_resolution_loss_weight
+
+        total_loss = loss + weighted_multi_resolution_loss
+
+        if not return_loss_breakdown:
+            return total_loss
+
+        return total_loss, (loss, multi_stft_resolution_loss)
\ No newline at end of file
diff --git a/third_party/MusicSourceSeparationTraining/models/bs_roformer/mel_band_roformer.py b/third_party/MusicSourceSeparationTraining/models/bs_roformer/mel_band_roformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0d2c40f2e00eb0b99521e6506cf0c0027561541
--- /dev/null
+++ b/third_party/MusicSourceSeparationTraining/models/bs_roformer/mel_band_roformer.py
@@ -0,0 +1,668 @@
+from functools import partial
+
+import torch
+from torch import nn, einsum, Tensor
+from torch.nn import Module, ModuleList
+import torch.nn.functional as F
+
+from models.bs_roformer.attend import Attend
+from torch.utils.checkpoint import checkpoint
+
+from beartype.typing import Tuple, Optional, List, Callable
+from beartype import beartype
+
+from rotary_embedding_torch import RotaryEmbedding
+
+from einops import rearrange, pack, unpack, reduce, repeat
+from einops.layers.torch import Rearrange
+
+from librosa import filters
+
+
+# helper functions
+
+def exists(val):
+    return val is not None
+
+
+def default(v, d):
+    return v if exists(v) else d
+
+
+def pack_one(t, pattern):
+    return pack([t], pattern)
+
+
+def unpack_one(t, ps, pattern):
+    return unpack(t, ps, pattern)[0]
+
+
+def pad_at_dim(t, pad, dim=-1, value=0.):
+    dims_from_right = (- dim - 1) if dim < 0 else (t.ndim - dim - 1)
+    zeros = ((0, 0) * dims_from_right)
+    return F.pad(t, (*zeros, *pad), value=value)
+
+
+def l2norm(t):
+    return F.normalize(t, dim=-1, p=2)
+
+
+# norm
+
+class RMSNorm(Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.scale = dim ** 0.5
+        self.gamma = nn.Parameter(torch.ones(dim))
+
+    def forward(self, x):
+        return F.normalize(x, dim=-1) * self.scale * self.gamma
+
+
+# attention
+
+class FeedForward(Module):
+    def __init__(
+            self,
+            dim,
+            mult=4,
+            dropout=0.
+    ):
+        super().__init__()
+        dim_inner = int(dim * mult)
+        self.net = nn.Sequential(
+            RMSNorm(dim),
+            nn.Linear(dim, dim_inner),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(dim_inner, dim),
+            nn.Dropout(dropout)
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+
+class Attention(Module):
+    def __init__(
+            self,
+            dim,
+            heads=8,
+            dim_head=64,
+            dropout=0.,
+            rotary_embed=None,
+            flash=True
+    ):
+        super().__init__()
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        dim_inner = heads * dim_head
+
+        self.rotary_embed = rotary_embed
+
+        self.attend = Attend(flash=flash, dropout=dropout)
+
+        self.norm = RMSNorm(dim)
+        self.to_qkv = nn.Linear(dim, dim_inner * 3, bias=False)
+
+        self.to_gates = nn.Linear(dim, heads)
+
+        self.to_out = nn.Sequential(
+            nn.Linear(dim_inner, dim, bias=False),
+            nn.Dropout(dropout)
+        )
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        q, k, v = rearrange(self.to_qkv(x), 'b n (qkv h d) -> qkv b h n d', qkv=3, h=self.heads)
+
+        if exists(self.rotary_embed):
+            q = self.rotary_embed.rotate_queries_or_keys(q)
+            k = self.rotary_embed.rotate_queries_or_keys(k)
+
+        out = self.attend(q, k, v)
+
+        gates = self.to_gates(x)
+        out = out * rearrange(gates, 'b n h -> b h n 1').sigmoid()
+
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+
+class LinearAttention(Module):
+    """
+    this flavor of linear attention proposed in https://arxiv.org/abs/2106.09681 by El-Nouby et al.
+    """
+
+    @beartype
+    def __init__(
+            self,
+            *,
+            dim,
+            dim_head=32,
+            heads=8,
+            scale=8,
+            flash=False,
+            dropout=0.
+    ):
+        super().__init__()
+        dim_inner = dim_head * heads
+        self.norm = RMSNorm(dim)
+
+        self.to_qkv = nn.Sequential(
+            nn.Linear(dim, dim_inner * 3, bias=False),
+            Rearrange('b n (qkv h d) -> qkv b h d n', qkv=3, h=heads)
+        )
+
+        self.temperature = nn.Parameter(torch.ones(heads, 1, 1))
+
+        self.attend = Attend(
+            scale=scale,
+            dropout=dropout,
+            flash=flash
+        )
+
+        self.to_out = nn.Sequential(
+            Rearrange('b h d n -> b n (h d)'),
+            nn.Linear(dim_inner, dim, bias=False)
+        )
+
+    def forward(
+            self,
+            x
+    ):
+        x = self.norm(x)
+
+        q, k, v = self.to_qkv(x)
+
+        q, k = map(l2norm, (q, k))
+        q = q * self.temperature.exp()
+
+        out = self.attend(q, k, v)
+
+        return self.to_out(out)
+
+
+class Transformer(Module):
+    def __init__(
+            self,
+            *,
+            dim,
+            depth,
+            dim_head=64,
+            heads=8,
+            attn_dropout=0.,
+            ff_dropout=0.,
+            ff_mult=4,
+            norm_output=True,
+            rotary_embed=None,
+            flash_attn=True,
+            linear_attn=False
+    ):
+        super().__init__()
+        self.layers = ModuleList([])
+
+        for _ in range(depth):
+            if linear_attn:
+                attn = LinearAttention(dim=dim, dim_head=dim_head, heads=heads, dropout=attn_dropout, flash=flash_attn)
+            else:
+                attn = Attention(dim=dim, dim_head=dim_head, heads=heads, dropout=attn_dropout,
+                                 rotary_embed=rotary_embed, flash=flash_attn)
+
+            self.layers.append(ModuleList([
+                attn,
+                FeedForward(dim=dim, mult=ff_mult, dropout=ff_dropout)
+            ]))
+
+        self.norm = RMSNorm(dim) if norm_output else nn.Identity()
+
+    def forward(self, x):
+
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+
+        return self.norm(x)
+
+
+# bandsplit module
+
+class BandSplit(Module):
+    @beartype
+    def __init__(
+            self,
+            dim,
+            dim_inputs: Tuple[int, ...]
+    ):
+        super().__init__()
+        self.dim_inputs = dim_inputs
+        self.to_features = ModuleList([])
+
+        for dim_in in dim_inputs:
+            net = nn.Sequential(
+                RMSNorm(dim_in),
+                nn.Linear(dim_in, dim)
+            )
+
+            self.to_features.append(net)
+
+    def forward(self, x):
+        x = x.split(self.dim_inputs, dim=-1)
+
+        outs = []
+        for split_input, to_feature in zip(x, self.to_features):
+            split_output = to_feature(split_input)
+            outs.append(split_output)
+
+        return torch.stack(outs, dim=-2)
+
+
+def MLP(
+        dim_in,
+        dim_out,
+        dim_hidden=None,
+        depth=1,
+        activation=nn.Tanh
+):
+    dim_hidden = default(dim_hidden, dim_in)
+
+    net = []
+    dims = (dim_in, *((dim_hidden,) * depth), dim_out)
+
+    for ind, (layer_dim_in, layer_dim_out) in enumerate(zip(dims[:-1], dims[1:])):
+        is_last = ind == (len(dims) - 2)
+
+        net.append(nn.Linear(layer_dim_in, layer_dim_out))
+
+        if is_last:
+            continue
+
+        net.append(activation())
+
+    return nn.Sequential(*net)
+
+
+class MaskEstimator(Module):
+    @beartype
+    def __init__(
+            self,
+            dim,
+            dim_inputs: Tuple[int, ...],
+            depth,
+            mlp_expansion_factor=4
+    ):
+        super().__init__()
+        self.dim_inputs = dim_inputs
+        self.to_freqs = ModuleList([])
+        dim_hidden = dim * mlp_expansion_factor
+
+        for dim_in in dim_inputs:
+            net = []
+
+            mlp = nn.Sequential(
+                MLP(dim, dim_in * 2, dim_hidden=dim_hidden, depth=depth),
+                nn.GLU(dim=-1)
+            )
+
+            self.to_freqs.append(mlp)
+
+    def forward(self, x):
+        x = x.unbind(dim=-2)
+
+        outs = []
+
+        for band_features, mlp in zip(x, self.to_freqs):
+            freq_out = mlp(band_features)
+            outs.append(freq_out)
+
+        return torch.cat(outs, dim=-1)
+
+
+# main class
+
+class MelBandRoformer(Module):
+
+    @beartype
+    def __init__(
+            self,
+            dim,
+            *,
+            depth,
+            stereo=False,
+            num_stems=1,
+            time_transformer_depth=2,
+            freq_transformer_depth=2,
+            linear_transformer_depth=0,
+            num_bands=60,
+            dim_head=64,
+            heads=8,
+            attn_dropout=0.1,
+            ff_dropout=0.1,
+            flash_attn=True,
+            dim_freqs_in=1025,
+            sample_rate=44100,  # needed for mel filter bank from librosa
+            stft_n_fft=2048,
+            stft_hop_length=512,
+            # 10ms at 44100Hz, from sections 4.1, 4.4 in the paper - @faroit recommends // 2 or // 4 for better reconstruction
+            stft_win_length=2048,
+            stft_normalized=False,
+            stft_window_fn: Optional[Callable] = None,
+            mask_estimator_depth=1,
+            multi_stft_resolution_loss_weight=1.,
+            multi_stft_resolutions_window_sizes: Tuple[int, ...] = (4096, 2048, 1024, 512, 256),
+            multi_stft_hop_size=147,
+            multi_stft_normalized=False,
+            multi_stft_window_fn: Callable = torch.hann_window,
+            match_input_audio_length=False,  # if True, pad output tensor to match length of input tensor
+            mlp_expansion_factor=4,
+            use_torch_checkpoint=False,
+            skip_connection=False,
+    ):
+        super().__init__()
+
+        self.stereo = stereo
+        self.audio_channels = 2 if stereo else 1
+        self.num_stems = num_stems
+        self.use_torch_checkpoint = use_torch_checkpoint
+        self.skip_connection = skip_connection
+
+        self.layers = ModuleList([])
+
+        transformer_kwargs = dict(
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            attn_dropout=attn_dropout,
+            ff_dropout=ff_dropout,
+            flash_attn=flash_attn
+        )
+
+        time_rotary_embed = RotaryEmbedding(dim=dim_head)
+        freq_rotary_embed = RotaryEmbedding(dim=dim_head)
+
+        for _ in range(depth):
+            tran_modules = []
+            if linear_transformer_depth > 0:
+                tran_modules.append(Transformer(depth=linear_transformer_depth, linear_attn=True, **transformer_kwargs))
+            tran_modules.append(
+                Transformer(depth=time_transformer_depth, rotary_embed=time_rotary_embed, **transformer_kwargs)
+            )
+            tran_modules.append(
+                Transformer(depth=freq_transformer_depth, rotary_embed=freq_rotary_embed, **transformer_kwargs)
+            )
+            self.layers.append(nn.ModuleList(tran_modules))
+
+        self.stft_window_fn = partial(default(stft_window_fn, torch.hann_window), stft_win_length)
+
+        self.stft_kwargs = dict(
+            n_fft=stft_n_fft,
+            hop_length=stft_hop_length,
+            win_length=stft_win_length,
+            normalized=stft_normalized
+        )
+
+        freqs = torch.stft(torch.randn(1, 4096), **self.stft_kwargs, window=torch.ones(stft_n_fft), return_complex=True).shape[1]
+
+        # create mel filter bank
+        # with librosa.filters.mel as in section 2 of paper
+
+        mel_filter_bank_numpy = filters.mel(sr=sample_rate, n_fft=stft_n_fft, n_mels=num_bands)
+
+        mel_filter_bank = torch.from_numpy(mel_filter_bank_numpy)
+
+        # for some reason, it doesn't include the first freq? just force a value for now
+
+        mel_filter_bank[0][0] = 1.
+
+        # In some systems/envs we get 0.0 instead of ~1.9e-18 in the last position,
+        # so let's force a positive value
+
+        mel_filter_bank[-1, -1] = 1.
+
+        # binary as in paper (then estimated masks are averaged for overlapping regions)
+
+        freqs_per_band = mel_filter_bank > 0
+        assert freqs_per_band.any(dim=0).all(), 'all frequencies need to be covered by all bands for now'
+
+        repeated_freq_indices = repeat(torch.arange(freqs), 'f -> b f', b=num_bands)
+        freq_indices = repeated_freq_indices[freqs_per_band]
+
+        if stereo:
+            freq_indices = repeat(freq_indices, 'f -> f s', s=2)
+            freq_indices = freq_indices * 2 + torch.arange(2)
+            freq_indices = rearrange(freq_indices, 'f s -> (f s)')
+
+        self.register_buffer('freq_indices', freq_indices, persistent=False)
+        self.register_buffer('freqs_per_band', freqs_per_band, persistent=False)
+
+        num_freqs_per_band = reduce(freqs_per_band, 'b f -> b', 'sum')
+        num_bands_per_freq = reduce(freqs_per_band, 'b f -> f', 'sum')
+
+        self.register_buffer('num_freqs_per_band', num_freqs_per_band, persistent=False)
+        self.register_buffer('num_bands_per_freq', num_bands_per_freq, persistent=False)
+
+        # band split and mask estimator
+
+        freqs_per_bands_with_complex = tuple(2 * f * self.audio_channels for f in num_freqs_per_band.tolist())
+
+        self.band_split = BandSplit(
+            dim=dim,
+            dim_inputs=freqs_per_bands_with_complex
+        )
+
+        self.mask_estimators = nn.ModuleList([])
+
+        for _ in range(num_stems):
+            mask_estimator = MaskEstimator(
+                dim=dim,
+                dim_inputs=freqs_per_bands_with_complex,
+                depth=mask_estimator_depth,
+                mlp_expansion_factor=mlp_expansion_factor,
+            )
+
+            self.mask_estimators.append(mask_estimator)
+
+        # for the multi-resolution stft loss
+
+        self.multi_stft_resolution_loss_weight = multi_stft_resolution_loss_weight
+        self.multi_stft_resolutions_window_sizes = multi_stft_resolutions_window_sizes
+        self.multi_stft_n_fft = stft_n_fft
+        self.multi_stft_window_fn = multi_stft_window_fn
+
+        self.multi_stft_kwargs = dict(
+            hop_length=multi_stft_hop_size,
+            normalized=multi_stft_normalized
+        )
+
+        self.match_input_audio_length = match_input_audio_length
+
+    def forward(
+            self,
+            raw_audio,
+            target=None,
+            return_loss_breakdown=False
+    ):
+        """
+        einops
+
+        b - batch
+        f - freq
+        t - time
+        s - audio channel (1 for mono, 2 for stereo)
+        n - number of 'stems'
+        c - complex (2)
+        d - feature dimension
+        """
+
+        device = raw_audio.device
+
+        if raw_audio.ndim == 2:
+            raw_audio = rearrange(raw_audio, 'b t -> b 1 t')
+
+        batch, channels, raw_audio_length = raw_audio.shape
+
+        istft_length = raw_audio_length if self.match_input_audio_length else None
+
+        assert (not self.stereo and channels == 1) or (
+                    self.stereo and channels == 2), 'stereo needs to be set to True if passing in audio signal that is stereo (channel dimension of 2). also need to be False if mono (channel dimension of 1)'
+
+        # to stft
+
+        raw_audio, batch_audio_channel_packed_shape = pack_one(raw_audio, '* t')
+
+        stft_window = self.stft_window_fn(device=device)
+
+        stft_repr = torch.stft(raw_audio, **self.stft_kwargs, window=stft_window, return_complex=True)
+        stft_repr = torch.view_as_real(stft_repr)
+
+        stft_repr = unpack_one(stft_repr, batch_audio_channel_packed_shape, '* f t c')
+
+        # merge stereo / mono into the frequency, with frequency leading dimension, for band splitting
+        stft_repr = rearrange(stft_repr,'b s f t c -> b (f s) t c')
+
+        # index out all frequencies for all frequency ranges across bands ascending in one go
+
+        batch_arange = torch.arange(batch, device=device)[..., None]
+
+        # account for stereo
+
+        x = stft_repr[batch_arange, self.freq_indices]
+
+        # fold the complex (real and imag) into the frequencies dimension
+
+        x = rearrange(x, 'b f t c -> b t (f c)')
+
+        if self.use_torch_checkpoint:
+            x = checkpoint(self.band_split, x, use_reentrant=False)
+        else:
+            x = self.band_split(x)
+
+        # axial / hierarchical attention
+
+        store = [None] * len(self.layers)
+        for i, transformer_block in enumerate(self.layers):
+
+            if len(transformer_block) == 3:
+                linear_transformer, time_transformer, freq_transformer = transformer_block
+
+                x, ft_ps = pack([x], 'b * d')
+                if self.use_torch_checkpoint:
+                    x = checkpoint(linear_transformer, x, use_reentrant=False)
+                else:
+                    x = linear_transformer(x)
+                x, = unpack(x, ft_ps, 'b * d')
+            else:
+                time_transformer, freq_transformer = transformer_block
+
+            if self.skip_connection:
+                # Sum all previous
+                for j in range(i):
+                    x = x + store[j]
+
+            x = rearrange(x, 'b t f d -> b f t d')
+            x, ps = pack([x], '* t d')
+
+            if self.use_torch_checkpoint:
+                x = checkpoint(time_transformer, x, use_reentrant=False)
+            else:
+                x = time_transformer(x)
+
+            x, = unpack(x, ps, '* t d')
+            x = rearrange(x, 'b f t d -> b t f d')
+            x, ps = pack([x], '* f d')
+
+            if self.use_torch_checkpoint:
+                x = checkpoint(freq_transformer, x, use_reentrant=False)
+            else:
+                x = freq_transformer(x)
+
+            x, = unpack(x, ps, '* f d')
+
+            if self.skip_connection:
+                store[i] = x
+
+        num_stems = len(self.mask_estimators)
+        if self.use_torch_checkpoint:
+            masks = torch.stack([checkpoint(fn, x, use_reentrant=False) for fn in self.mask_estimators], dim=1)
+        else:
+            masks = torch.stack([fn(x) for fn in self.mask_estimators], dim=1)
+        masks = rearrange(masks, 'b n t (f c) -> b n f t c', c=2)
+
+        # modulate frequency representation
+
+        stft_repr = rearrange(stft_repr, 'b f t c -> b 1 f t c')
+
+        # complex number multiplication
+
+        stft_repr = torch.view_as_complex(stft_repr)
+        masks = torch.view_as_complex(masks)
+
+        masks = masks.type(stft_repr.dtype)
+
+        # need to average the estimated mask for the overlapped frequencies
+
+        scatter_indices = repeat(self.freq_indices, 'f -> b n f t', b=batch, n=num_stems, t=stft_repr.shape[-1])
+
+        stft_repr_expanded_stems = repeat(stft_repr, 'b 1 ... -> b n ...', n=num_stems)
+        masks_summed = torch.zeros_like(stft_repr_expanded_stems).scatter_add_(2, scatter_indices, masks)
+
+        denom = repeat(self.num_bands_per_freq, 'f -> (f r) 1', r=channels)
+
+        masks_averaged = masks_summed / denom.clamp(min=1e-8)
+
+        # modulate stft repr with estimated mask
+
+        stft_repr = stft_repr * masks_averaged
+
+        # istft
+
+        stft_repr = rearrange(stft_repr, 'b n (f s) t -> (b n s) f t', s=self.audio_channels)
+
+        recon_audio = torch.istft(stft_repr, **self.stft_kwargs, window=stft_window, return_complex=False,
+                                  length=istft_length)
+
+        recon_audio = rearrange(recon_audio, '(b n s) t -> b n s t', b=batch, s=self.audio_channels, n=num_stems)
+
+        if num_stems == 1:
+            recon_audio = rearrange(recon_audio, 'b 1 s t -> b s t')
+
+        # if a target is passed in, calculate loss for learning
+
+        if not exists(target):
+            return recon_audio
+
+        if self.num_stems > 1:
+            assert target.ndim == 4 and target.shape[1] == self.num_stems
+
+        if target.ndim == 2:
+            target = rearrange(target, '... t -> ... 1 t')
+
+        target = target[..., :recon_audio.shape[-1]]  # protect against lost length on istft
+
+        loss = F.l1_loss(recon_audio, target)
+
+        multi_stft_resolution_loss = 0.
+
+        for window_size in self.multi_stft_resolutions_window_sizes:
+            res_stft_kwargs = dict(
+                n_fft=max(window_size, self.multi_stft_n_fft),  # not sure what n_fft is across multi resolution stft
+                win_length=window_size,
+                return_complex=True,
+                window=self.multi_stft_window_fn(window_size, device=device),
+                **self.multi_stft_kwargs,
+            )
+
+            recon_Y = torch.stft(rearrange(recon_audio, '... s t -> (... s) t'), **res_stft_kwargs)
+            target_Y = torch.stft(rearrange(target, '... s t -> (... s) t'), **res_stft_kwargs)
+
+            multi_stft_resolution_loss = multi_stft_resolution_loss + F.l1_loss(recon_Y, target_Y)
+
+        weighted_multi_resolution_loss = multi_stft_resolution_loss * self.multi_stft_resolution_loss_weight
+
+        total_loss = loss + weighted_multi_resolution_loss
+
+        if not return_loss_breakdown:
+            return total_loss
+
+        return total_loss, (loss, multi_stft_resolution_loss)
diff --git a/third_party/MusicSourceSeparationTraining/utils.py b/third_party/MusicSourceSeparationTraining/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c277fc66889576987120d969bf6344349a457ef
--- /dev/null
+++ b/third_party/MusicSourceSeparationTraining/utils.py
@@ -0,0 +1,665 @@
+# coding: utf-8
+__author__ = 'Roman Solovyev (ZFTurbo): https://github.com/ZFTurbo/'
+import sys
+import os
+sys.path.append(os.path.dirname(__file__))
+
+
+import argparse
+import numpy as np
+import torch
+import torch.nn as nn
+import yaml
+import os
+import soundfile as sf
+import matplotlib.pyplot as plt
+from ml_collections import ConfigDict
+from omegaconf import OmegaConf
+from tqdm.auto import tqdm
+from typing import Dict, List, Tuple, Any, Union
+import loralib as lora
+
+
+def load_config(model_type: str, config_path: str) -> Union[ConfigDict, OmegaConf]:
+    """
+    Load the configuration from the specified path based on the model type.
+
+    Parameters:
+    ----------
+    model_type : str
+        The type of model to load (e.g., 'htdemucs', 'mdx23c', etc.).
+    config_path : str
+        The path to the YAML or OmegaConf configuration file.
+
+    Returns:
+    -------
+    config : Any
+        The loaded configuration, which can be in different formats (e.g., OmegaConf or ConfigDict).
+
+    Raises:
+    ------
+    FileNotFoundError:
+        If the configuration file at `config_path` is not found.
+    ValueError:
+        If there is an error loading the configuration file.
+    """
+    try:
+        with open(config_path, 'r') as f:
+            if model_type == 'htdemucs':
+                config = OmegaConf.load(config_path)
+            else:
+                config = ConfigDict(yaml.load(f, Loader=yaml.FullLoader))
+            return config
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Configuration file not found at {config_path}")
+    except Exception as e:
+        raise ValueError(f"Error loading configuration: {e}")
+
+'''
+def get_model_from_config(model_type: str, config_path: str) -> Tuple:
+    """
+    Load the model specified by the model type and configuration file.
+
+    Parameters:
+    ----------
+    model_type : str
+        The type of model to load (e.g., 'mdx23c', 'htdemucs', 'scnet', etc.).
+    config_path : str
+        The path to the configuration file (YAML or OmegaConf format).
+
+    Returns:
+    -------
+    model : nn.Module or None
+        The initialized model based on the `model_type`, or None if the model type is not recognized.
+    config : Any
+        The configuration used to initialize the model. This could be in different formats
+        depending on the model type (e.g., OmegaConf, ConfigDict).
+
+    Raises:
+    ------
+    ValueError:
+        If the `model_type` is unknown or an error occurs during model initialization.
+    """
+
+    config = load_config(model_type, config_path)
+
+    if model_type == 'mdx23c':
+        from models.mdx23c_tfc_tdf_v3 import TFC_TDF_net
+        model = TFC_TDF_net(config)
+    elif model_type == 'htdemucs':
+        from models.demucs4ht import get_model
+        model = get_model(config)
+    elif model_type == 'segm_models':
+        from models.segm_models import Segm_Models_Net
+        model = Segm_Models_Net(config)
+    elif model_type == 'torchseg':
+        from models.torchseg_models import Torchseg_Net
+        model = Torchseg_Net(config)
+    elif model_type == 'mel_band_roformer':
+        from models.bs_roformer import MelBandRoformer
+        model = MelBandRoformer(**dict(config.model))
+    elif model_type == 'bs_roformer':
+        from models.bs_roformer import BSRoformer
+        model = BSRoformer(**dict(config.model))
+    elif model_type == 'swin_upernet':
+        from models.upernet_swin_transformers import Swin_UperNet_Model
+        model = Swin_UperNet_Model(config)
+    elif model_type == 'bandit':
+        from models.bandit.core.model import MultiMaskMultiSourceBandSplitRNNSimple
+        model = MultiMaskMultiSourceBandSplitRNNSimple(**config.model)
+    elif model_type == 'bandit_v2':
+        from models.bandit_v2.bandit import Bandit
+        model = Bandit(**config.kwargs)
+    elif model_type == 'scnet_unofficial':
+        from models.scnet_unofficial import SCNet
+        model = SCNet(**config.model)
+    elif model_type == 'scnet':
+        from models.scnet import SCNet
+        model = SCNet(**config.model)
+    elif model_type == 'apollo':
+        from models.look2hear.models import BaseModel
+        model = BaseModel.apollo(**config.model)
+    elif model_type == 'bs_mamba2':
+        from models.ts_bs_mamba2 import Separator
+        model = Separator(**config.model)
+    else:
+        raise ValueError(f"Unknown model type: {model_type}")
+
+    return model, config
+'''
+
+def read_audio_transposed(path: str, instr: str = None, skip_err: bool = False) -> Tuple[np.ndarray, int]:
+    """
+    Reads an audio file, ensuring mono audio is converted to two-dimensional format,
+    and transposes the data to have channels as the first dimension.
+    Parameters
+    ----------
+    path : str
+        Path to the audio file.
+    skip_err: bool
+        If true, not raise errors
+    instr:
+        name of instument
+    Returns
+    -------
+    Tuple[np.ndarray, int]
+        A tuple containing:
+        - Transposed audio data as a NumPy array with shape (channels, length).
+          For mono audio, the shape will be (1, length).
+        - Sampling rate (int), e.g., 44100.
+    """
+
+    try:
+        mix, sr = sf.read(path)
+    except Exception as e:
+        if skip_err:
+            print(f"No stem {instr}: skip!")
+            return None, None
+        else:
+            raise RuntimeError(f"Error reading the file at {path}: {e}")
+    else:
+        if len(mix.shape) == 1:  # For mono audio
+            mix = np.expand_dims(mix, axis=-1)
+        return mix.T, sr
+
+
+def normalize_audio(audio: np.ndarray) -> tuple[np.ndarray, Dict[str, float]]:
+    """
+    Normalize an audio signal by subtracting the mean and dividing by the standard deviation.
+
+    Parameters:
+    ----------
+    audio : np.ndarray
+        Input audio array with shape (channels, time) or (time,).
+
+    Returns:
+    -------
+    tuple[np.ndarray, dict[str, float]]
+        - Normalized audio array with the same shape as the input.
+        - Dictionary containing the mean and standard deviation of the original audio.
+    """
+
+    mono = audio.mean(0)
+    mean, std = mono.mean(), mono.std()
+    return (audio - mean) / std, {"mean": mean, "std": std}
+
+
+def denormalize_audio(audio: np.ndarray, norm_params: Dict[str, float]) -> np.ndarray:
+    """
+    Denormalize an audio signal by reversing the normalization process (multiplying by the standard deviation
+    and adding the mean).
+
+    Parameters:
+    ----------
+    audio : np.ndarray
+        Normalized audio array to be denormalized.
+    norm_params : dict[str, float]
+        Dictionary containing the 'mean' and 'std' values used for normalization.
+
+    Returns:
+    -------
+    np.ndarray
+        Denormalized audio array with the same shape as the input.
+    """
+
+    return audio * norm_params["std"] + norm_params["mean"]
+
+
+def apply_tta(
+        config,
+        model: torch.nn.Module,
+        mix: torch.Tensor,
+        waveforms_orig: Dict[str, torch.Tensor],
+        device: torch.device,
+        model_type: str
+) -> Dict[str, torch.Tensor]:
+    """
+    Apply Test-Time Augmentation (TTA) for source separation.
+
+    This function processes the input mixture with test-time augmentations, including
+    channel inversion and polarity inversion, to enhance the separation results. The
+    results from all augmentations are averaged to produce the final output.
+
+    Parameters:
+    ----------
+    config : Any
+        Configuration object containing model and processing parameters.
+    model : torch.nn.Module
+        The trained model used for source separation.
+    mix : torch.Tensor
+        The mixed audio tensor with shape (channels, time).
+    waveforms_orig : Dict[str, torch.Tensor]
+        Dictionary of original separated waveforms (before TTA) for each instrument.
+    device : torch.device
+        Device (CPU or CUDA) on which the model will be executed.
+    model_type : str
+        Type of the model being used (e.g., "demucs", "custom_model").
+
+    Returns:
+    -------
+    Dict[str, torch.Tensor]
+        Updated dictionary of separated waveforms after applying TTA.
+    """
+    # Create augmentations: channel inversion and polarity inversion
+    track_proc_list = [mix[::-1].copy(), -1.0 * mix.copy()]
+
+    # Process each augmented mixture
+    for i, augmented_mix in enumerate(track_proc_list):
+        waveforms = demix(config, model, augmented_mix, device, model_type=model_type)
+        for el in waveforms:
+            if i == 0:
+                waveforms_orig[el] += waveforms[el][::-1].copy()
+            else:
+                waveforms_orig[el] -= waveforms[el]
+
+    # Average the results across augmentations
+    for el in waveforms_orig:
+        waveforms_orig[el] /= len(track_proc_list) + 1
+
+    return waveforms_orig
+
+
+def _getWindowingArray(window_size: int, fade_size: int) -> torch.Tensor:
+    """
+    Generate a windowing array with a linear fade-in at the beginning and a fade-out at the end.
+
+    This function creates a window of size `window_size` where the first `fade_size` elements
+    linearly increase from 0 to 1 (fade-in) and the last `fade_size` elements linearly decrease
+    from 1 to 0 (fade-out). The middle part of the window is filled with ones.
+
+    Parameters:
+    ----------
+    window_size : int
+        The total size of the window.
+    fade_size : int
+        The size of the fade-in and fade-out regions.
+
+    Returns:
+    -------
+    torch.Tensor
+        A tensor of shape (window_size,) containing the generated windowing array.
+
+    Example:
+    -------
+    If `window_size=10` and `fade_size=3`, the output will be:
+    tensor([0.0000, 0.5000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.5000, 0.0000])
+    """
+
+    fadein = torch.linspace(0, 1, fade_size)
+    fadeout = torch.linspace(1, 0, fade_size)
+
+    window = torch.ones(window_size)
+    window[-fade_size:] = fadeout
+    window[:fade_size] = fadein
+    return window
+
+
+def demix(
+        config: ConfigDict,
+        model: torch.nn.Module,
+        mix: torch.Tensor,
+        device: torch.device,
+        model_type: str,
+        pbar: bool = False
+) -> Tuple[List[Dict[str, np.ndarray]], np.ndarray]:
+    """
+    Unified function for audio source separation with support for multiple processing modes.
+
+    This function separates audio into its constituent sources using either a generic custom logic
+    or a Demucs-specific logic. It supports batch processing and overlapping window-based chunking
+    for efficient and artifact-free separation.
+
+    Parameters:
+    ----------
+    config : ConfigDict
+        Configuration object containing audio and inference settings.
+    model : torch.nn.Module
+        The trained model used for audio source separation.
+    mix : torch.Tensor
+        Input audio tensor with shape (channels, time).
+    device : torch.device
+        The computation device (CPU or CUDA).
+    model_type : str, optional
+        Processing mode:
+            - "demucs" for logic specific to the Demucs model.
+        Default is "generic".
+    pbar : bool, optional
+        If True, displays a progress bar during chunk processing. Default is False.
+
+    Returns:
+    -------
+    Union[Dict[str, np.ndarray], np.ndarray]
+        - A dictionary mapping target instruments to separated audio sources if multiple instruments are present.
+        - A numpy array of the separated source if only one instrument is present.
+    """
+
+    mix = torch.tensor(mix, dtype=torch.float32)
+
+    if model_type == 'htdemucs':
+        mode = 'demucs'
+    else:
+        mode = 'generic'
+    # Define processing parameters based on the mode
+    if mode == 'demucs':
+        chunk_size = config.training.samplerate * config.training.segment
+        num_instruments = len(config.training.instruments)
+        num_overlap = config.inference.num_overlap
+        step = chunk_size // num_overlap
+    else:
+        chunk_size = config.audio.chunk_size
+        num_instruments = len(prefer_target_instrument(config))
+        num_overlap = config.inference.num_overlap
+
+        fade_size = chunk_size // 10
+        step = chunk_size // num_overlap
+        border = chunk_size - step
+        length_init = mix.shape[-1]
+        windowing_array = _getWindowingArray(chunk_size, fade_size)
+        # Add padding for generic mode to handle edge artifacts
+        if length_init > 2 * border and border > 0:
+            mix = nn.functional.pad(mix, (border, border), mode="reflect")
+
+    batch_size = config.inference.batch_size
+
+    use_amp = getattr(config.training, 'use_amp', True)
+
+    with torch.cuda.amp.autocast(enabled=use_amp):
+        with torch.inference_mode():
+            # Initialize result and counter tensors
+            req_shape = (num_instruments,) + mix.shape
+            result = torch.zeros(req_shape, dtype=torch.float32)
+            counter = torch.zeros(req_shape, dtype=torch.float32)
+
+            i = 0
+            batch_data = []
+            batch_locations = []
+            progress_bar = tqdm(
+                total=mix.shape[1], desc="Processing audio chunks", leave=False
+            ) if pbar else None
+
+            while i < mix.shape[1]:
+                # Extract chunk and apply padding if necessary
+                part = mix[:, i:i + chunk_size].to(device)
+                chunk_len = part.shape[-1]
+                if mode == "generic" and chunk_len > chunk_size // 2:
+                    pad_mode = "reflect"
+                else:
+                    pad_mode = "constant"
+                part = nn.functional.pad(part, (0, chunk_size - chunk_len), mode=pad_mode, value=0)
+
+                batch_data.append(part)
+                batch_locations.append((i, chunk_len))
+                i += step
+
+                # Process batch if it's full or the end is reached
+                if len(batch_data) >= batch_size or i >= mix.shape[1]:
+                    arr = torch.stack(batch_data, dim=0)
+                    x = model(arr)
+
+                    if mode == "generic":
+                        window = windowing_array.clone() # using clone() fixes the clicks at chunk edges when using batch_size=1
+                        if i - step == 0:  # First audio chunk, no fadein
+                            window[:fade_size] = 1
+                        elif i >= mix.shape[1]:  # Last audio chunk, no fadeout
+                            window[-fade_size:] = 1
+
+                    for j, (start, seg_len) in enumerate(batch_locations):
+                        if mode == "generic":
+                            result[..., start:start + seg_len] += x[j, ..., :seg_len].cpu() * window[..., :seg_len]
+                            counter[..., start:start + seg_len] += window[..., :seg_len]
+                        else:
+                            result[..., start:start + seg_len] += x[j, ..., :seg_len].cpu()
+                            counter[..., start:start + seg_len] += 1.0
+
+                    batch_data.clear()
+                    batch_locations.clear()
+
+                if progress_bar:
+                    progress_bar.update(step)
+
+            if progress_bar:
+                progress_bar.close()
+
+            # Compute final estimated sources
+            estimated_sources = result / counter
+            estimated_sources = estimated_sources.cpu().numpy()
+            np.nan_to_num(estimated_sources, copy=False, nan=0.0)
+
+            # Remove padding for generic mode
+            if mode == "generic":
+                if length_init > 2 * border and border > 0:
+                    estimated_sources = estimated_sources[..., border:-border]
+
+    # Return the result as a dictionary or a single array
+    if mode == "demucs":
+        instruments = config.training.instruments
+    else:
+        instruments = prefer_target_instrument(config)
+
+    ret_data = {k: v for k, v in zip(instruments, estimated_sources)}
+
+    if mode == "demucs" and num_instruments <= 1:
+        return estimated_sources
+    else:
+        return ret_data
+
+
+def prefer_target_instrument(config: ConfigDict) -> List[str]:
+    """
+        Return the list of target instruments based on the configuration.
+        If a specific target instrument is specified in the configuration,
+        it returns a list with that instrument. Otherwise, it returns the list of instruments.
+
+        Parameters:
+        ----------
+        config : ConfigDict
+            Configuration object containing the list of instruments or the target instrument.
+
+        Returns:
+        -------
+        List[str]
+            A list of target instruments.
+        """
+    if getattr(config.training, 'target_instrument', None):
+        return [config.training.target_instrument]
+    else:
+        return config.training.instruments
+
+
+def load_not_compatible_weights(model: torch.nn.Module, weights: str, verbose: bool = False) -> None:
+    """
+    Load weights into a model, handling mismatched shapes and dimensions.
+
+    Args:
+        model: PyTorch model into which the weights will be loaded.
+        weights: Path to the weights file.
+        verbose: If True, prints detailed information about matching and mismatched layers.
+    """
+
+    new_model = model.state_dict()
+    old_model = torch.load(weights)
+    if 'state' in old_model:
+        # Fix for htdemucs weights loading
+        old_model = old_model['state']
+    if 'state_dict' in old_model:
+        # Fix for apollo weights loading
+        old_model = old_model['state_dict']
+
+    for el in new_model:
+        if el in old_model:
+            if verbose:
+                print(f'Match found for {el}!')
+            if new_model[el].shape == old_model[el].shape:
+                if verbose:
+                    print('Action: Just copy weights!')
+                new_model[el] = old_model[el]
+            else:
+                if len(new_model[el].shape) != len(old_model[el].shape):
+                    if verbose:
+                        print('Action: Different dimension! Too lazy to write the code... Skip it')
+                else:
+                    if verbose:
+                        print(f'Shape is different: {tuple(new_model[el].shape)} != {tuple(old_model[el].shape)}')
+                    ln = len(new_model[el].shape)
+                    max_shape = []
+                    slices_old = []
+                    slices_new = []
+                    for i in range(ln):
+                        max_shape.append(max(new_model[el].shape[i], old_model[el].shape[i]))
+                        slices_old.append(slice(0, old_model[el].shape[i]))
+                        slices_new.append(slice(0, new_model[el].shape[i]))
+                    # print(max_shape)
+                    # print(slices_old, slices_new)
+                    slices_old = tuple(slices_old)
+                    slices_new = tuple(slices_new)
+                    max_matrix = np.zeros(max_shape, dtype=np.float32)
+                    for i in range(ln):
+                        max_matrix[slices_old] = old_model[el].cpu().numpy()
+                    max_matrix = torch.from_numpy(max_matrix)
+                    new_model[el] = max_matrix[slices_new]
+        else:
+            if verbose:
+                print(f'Match not found for {el}!')
+    model.load_state_dict(
+        new_model
+    )
+
+
+def load_lora_weights(model: torch.nn.Module, lora_path: str, device: str = 'cpu') -> None:
+    """
+    Load LoRA weights into a model.
+    This function updates the given model with LoRA-specific weights from the specified checkpoint file.
+    It does not require the checkpoint to match the model's full state dictionary, as only LoRA layers are updated.
+
+    Parameters:
+    ----------
+    model : Module
+        The PyTorch model into which the LoRA weights will be loaded.
+    lora_path : str
+        Path to the LoRA checkpoint file.
+    device : str, optional
+        The device to load the weights onto, by default 'cpu'. Common values are 'cpu' or 'cuda'.
+
+    Returns:
+    -------
+    None
+        The model is updated in place.
+    """
+    lora_state_dict = torch.load(lora_path, map_location=device)
+    model.load_state_dict(lora_state_dict, strict=False)
+
+
+def load_start_checkpoint(args: argparse.Namespace, model: torch.nn.Module, type_='train') -> None:
+    """
+    Load the starting checkpoint for a model.
+
+    Args:
+        args: Parsed command-line arguments containing the checkpoint path.
+        model: PyTorch model to load the checkpoint into.
+        type_: how to load weights - for train we can load not fully compatible weights
+    """
+
+    print(f'Start from checkpoint: {args.start_check_point}')
+    if type_ in ['train']:
+        if 1:
+            load_not_compatible_weights(model, args.start_check_point, verbose=False)
+        else:
+            model.load_state_dict(torch.load(args.start_check_point))
+    else:
+        device='cpu'
+        if args.model_type in ['htdemucs', 'apollo']:
+            state_dict = torch.load(args.start_check_point, map_location=device, weights_only=False)
+            # Fix for htdemucs pretrained models
+            if 'state' in state_dict:
+                state_dict = state_dict['state']
+            # Fix for apollo pretrained models
+            if 'state_dict' in state_dict:
+                state_dict = state_dict['state_dict']
+        else:
+            state_dict = torch.load(args.start_check_point, map_location=device, weights_only=True)
+        model.load_state_dict(state_dict)
+
+    if args.lora_checkpoint:
+        print(f"Loading LoRA weights from: {args.lora_checkpoint}")
+        load_lora_weights(model, args.lora_checkpoint)
+
+
+def bind_lora_to_model(config: Dict[str, Any], model: nn.Module) -> nn.Module:
+    """
+    Replaces specific layers in the model with LoRA-extended versions.
+
+    Parameters:
+    ----------
+    config : Dict[str, Any]
+        Configuration containing parameters for LoRA. It should include a 'lora' key with parameters for `MergedLinear`.
+    model : nn.Module
+        The original model in which the layers will be replaced.
+
+    Returns:
+    -------
+    nn.Module
+        The modified model with the replaced layers.
+    """
+
+    if 'lora' not in config:
+        raise ValueError("Configuration must contain the 'lora' key with parameters for LoRA.")
+
+    replaced_layers = 0  # Counter for replaced layers
+
+    for name, module in model.named_modules():
+        hierarchy = name.split('.')
+        layer_name = hierarchy[-1]
+
+        # Check if this is the target layer to replace (and layer_name == 'to_qkv')
+        if isinstance(module, nn.Linear):
+            try:
+                # Get the parent module
+                parent_module = model
+                for submodule_name in hierarchy[:-1]:
+                    parent_module = getattr(parent_module, submodule_name)
+
+                # Replace the module with LoRA-enabled layer
+                setattr(
+                    parent_module,
+                    layer_name,
+                    lora.MergedLinear(
+                        in_features=module.in_features,
+                        out_features=module.out_features,
+                        bias=module.bias is not None,
+                        **config['lora']
+                    )
+                )
+                replaced_layers += 1  # Increment the counter
+
+            except Exception as e:
+                print(f"Error replacing layer {name}: {e}")
+
+    if replaced_layers == 0:
+        print("Warning: No layers were replaced. Check the model structure and configuration.")
+    else:
+        print(f"Number of layers replaced with LoRA: {replaced_layers}")
+
+    return model
+
+
+def draw_spectrogram(waveform, sample_rate, length, output_file):
+    import librosa.display
+
+    # Cut only required part of spectorgram
+    x = waveform[:int(length * sample_rate), :]
+    X = librosa.stft(x.mean(axis=-1))  # perform short-term fourier transform on mono signal
+    Xdb = librosa.amplitude_to_db(np.abs(X), ref=np.max)  # convert an amplitude spectrogram to dB-scaled spectrogram.
+    fig, ax = plt.subplots()
+    # plt.figure(figsize=(30, 10))  # initialize the fig size
+    img = librosa.display.specshow(
+        Xdb,
+        cmap='plasma',
+        sr=sample_rate,
+        x_axis='time',
+        y_axis='linear',
+        ax=ax
+    )
+    ax.set(title='File: ' + os.path.basename(output_file))
+    fig.colorbar(img, ax=ax, format="%+2.f dB")
+    if output_file is not None:
+        plt.savefig(output_file)
diff --git a/third_party/VideoLLaMA2/.gitignore b/third_party/VideoLLaMA2/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..5d2b4c1ab07337c2106b40a88610bf991e0707aa
--- /dev/null
+++ b/third_party/VideoLLaMA2/.gitignore
@@ -0,0 +1,58 @@
+# Python
+__pycache__
+*.pyc
+*.egg-info
+dist
+
+# Log
+*.log
+*.log.*
+*.json
+*.jsonl
+log_dir*/
+temp*/
+
+# Data
+!**/alpaca-data-conversation.json
+
+# Editor
+.idea
+*.swp
+
+# Other
+.DS_Store
+3rd_parties
+
+# jupyter
+.ipynb_checkpoints
+*.ipynb
+
+# DevContainer
+!.devcontainer/*
+
+# Demo
+serve_images/
+temp/
+
+# data folder
+data/
+dataset/
+datasets/
+
+# training folder
+wandb
+ckpts*
+output
+output/
+checkpoints
+checkpoints/
+work_dirs*/
+
+# evaluation folder
+/eval
+/eval*
+
+# pretrained weights
+pretrained/
+publish_models/
+public_models/
\ No newline at end of file
diff --git a/third_party/VideoLLaMA2/LICENSE b/third_party/VideoLLaMA2/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64
--- /dev/null
+++ b/third_party/VideoLLaMA2/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/third_party/VideoLLaMA2/README.md b/third_party/VideoLLaMA2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..99f6dea9ef4a61c1cb7551cf1042c05a56ee6dfa
--- /dev/null
+++ b/third_party/VideoLLaMA2/README.md
@@ -0,0 +1,365 @@
+<p align="center">
+    <img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/blob/e7bc34e0e9a96d77947a75b54399d9f96ccf209d/assets/logo.png" width="150" style="margin-bottom: 0.2;"/>
+<p>
+
+<h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#9C276A">
+VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs</a></h3>
+<h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2>
+
+<h5 align="center">
+
+[![hf_space](https://img.shields.io/badge/🤗-Demo-9C276A.svg)](https://huggingface.co/spaces/lixin4ever/VideoLLaMA2)
+[![hf_checkpoint](https://img.shields.io/badge/🤗-Checkpoints-9C276A.svg)](https://huggingface.co/collections/DAMO-NLP-SG/videollama-2-6669b6b6f0493188305c87ed)
+[![hf_data](https://img.shields.io/badge/🤗-MSVC-9C276A.svg)](https://huggingface.co/datasets/DAMO-NLP-SG/Multi-Source-Video-Captioning)
+[![arXiv](https://img.shields.io/badge/Arxiv-2406.07476-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2406.07476) <br>
+[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/DAMO-NLP-SG/VideoLLaMA2/blob/main/LICENSE) 
+[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FDAMO-NLP-SG%2FVideoLLaMA2&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=Visitor&edge_flat=false)](https://hits.seeyoufarm.com)
+[![GitHub issues](https://img.shields.io/github/issues/DAMO-NLP-SG/VideoLLaMA2?color=critical&label=Issues)](https://github.com/DAMO-NLP-SG/VideoLLaMA2/issues?q=is%3Aopen+is%3Aissue)
+[![GitHub closed issues](https://img.shields.io/github/issues-closed/DAMO-NLP-SG/VideoLLaMA2?color=success&label=Issues)](https://github.com/DAMO-NLP-SG/VideoLLaMA2/issues?q=is%3Aissue+is%3Aclosed)  <br>
+
+</h5>
+
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/video-question-answering-on-perception-test)](https://paperswithcode.com/sota/video-question-answering-on-perception-test?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/zero-shot-video-question-answer-on-video-mme-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-video-mme-1?p=videollama-2-advancing-spatial-temporal) <br>
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videollama-2-advancing-spatial-temporal/zero-shot-video-question-answer-on-video-mme)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-video-mme?p=videollama-2-advancing-spatial-temporal) <br>
+
+<details open><summary>💡 Some other multimodal-LLM projects from our team may interest you ✨. </summary><p>
+<!--  may -->
+
+> [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://github.com/DAMO-NLP-SG/Video-LLaMA) <br>
+> Hang Zhang, Xin Li, Lidong Bing <br>
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/Video-LLaMA)  [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social)](https://github.com/DAMO-NLP-SG/Video-LLaMA) [![arXiv](https://img.shields.io/badge/Arxiv-2306.02858-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2306.02858) <br>
+
+> [**VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding**](https://arxiv.org/abs/2311.16922) <br>
+> Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing <br>
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/VCD)  [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/VCD.svg?style=social)](https://github.com/DAMO-NLP-SG/VCD)  [![arXiv](https://img.shields.io/badge/Arxiv-2311.16922-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.16922) <br>
+
+> [**The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio**](https://arxiv.org/abs/2410.12787) <br>
+> Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing <br>
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/CMM)  [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/CMM.svg?style=social)](https://github.com/DAMO-NLP-SG/CMM)  [![arXiv](https://img.shields.io/badge/Arxiv-2410.12787-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.12787) <br>
+
+</p></details>
+
+<div align="center"><video src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/e0e7951c-f392-42ed-afad-b2c7984d3e38" width="800"></div>
+
+
+## 📰 News
+* **[2024.10.22]**  Release checkpoints of [VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV).
+* **[2024.10.15]**  Release checkpoints of [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) and [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F).
+* **[2024.08.14]**  Release checkpoints of [VideoLLaMA2-72B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B-Base) and [VideoLLaMA2-72B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B).
+* **[2024.07.30]**  Release checkpoints of [VideoLLaMA2-8x7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B-Base) and [VideoLLaMA2-8x7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B).
+* **[2024.06.25]**  🔥🔥 As of Jun 25, our [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) is the **Top-1** ~7B-sized VideoLLM on the [MLVU Leaderboard](https://github.com/JUNJIE99/MLVU?tab=readme-ov-file#trophy-mini-leaderboard).
+* **[2024.06.18]**  🔥🔥 As of Jun 18, our [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F) is the **Top-1** ~7B-sized VideoLLM on the [VideoMME Leaderboard](https://video-mme.github.io/home_page.html#leaderboard).
+* **[2024.06.17]**  👋👋 Update technical report with the latest results and the missing references. If you have works closely related to VideoLLaMA 2 but not mentioned in the paper, feel free to let us know.  
+* **[2024.06.14]**  🔥🔥 [Online Demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA2) is available.
+* **[2024.06.03]**  Release training, evaluation, and serving codes of VideoLLaMA 2.
+
+
+<img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/b9faf24f-bdd2-4728-9385-acea17ea086d" width="800" />
+
+## 🛠️ Requirements and Installation
+Basic Dependencies:
+* Python >= 3.8
+* Pytorch >= 2.2.0
+* CUDA Version >= 11.8
+* transformers == 4.40.0 (for reproducing paper results)
+* tokenizers == 0.19.1
+
+**[Online Mode]** Install required packages (better for development):
+```bash
+git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
+cd VideoLLaMA2
+git checkout audio_visual
+pip install -r requirements.txt
+pip install flash-attn==2.5.8 --no-build-isolation
+pip install opencv-python==4.5.5.64
+apt-get update && apt-get install ffmpeg libsm6 libxext6  -y
+```
+
+**[Offline Mode]** Install VideoLLaMA2 as a Python package (better for direct use):
+```bash
+git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
+cd VideoLLaMA2
+git checkout audio_visual
+pip install --upgrade pip  # enable PEP 660 support
+pip install -e .
+pip install flash-attn==2.5.8 --no-build-isolation
+pip install opencv-python==4.5.5.64
+apt-get update && apt-get install ffmpeg libsm6 libxext6  -y
+```
+
+## 🚀 Main Results
+
+### Multi-Choice Video QA & Video Captioning
+<p><img src="https://github.com/user-attachments/assets/e87fe4cf-07ea-4fde-998b-a0c63671c3b4" width="800" "/></p>
+
+###  Open-Ended Video QA
+<p><img src="https://github.com/user-attachments/assets/80b16c04-75ac-43b8-bc22-6952fdf994bb" width="800" "/></p>
+
+### Audio QA 
+<p><img src="https://github.com/user-attachments/assets/46e55952-5a54-4564-bcd4-cfa4edd7f36a" width="800" "/></p>
+
+### Audio-Visual QA 
+<p><img src="https://github.com/user-attachments/assets/8114c1e3-7f93-401b-9ea6-9ce7c96d7b05" width="800" "/></p>
+
+
+## :earth_americas: Model Zoo
+### Vision-only Checkpoints
+| Model Name     | Model Type | Visual Encoder | Language Decoder | # Training Frames |
+|:----------------|:------------:|:----------------|:------------------|:----------------:|
+| [VideoLLaMA2-7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-Base)  | Base  | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)  | 8 |
+| [VideoLLaMA2-7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)  | 8 |
+| [VideoLLaMA2-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F-Base)  | Base  | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)  | 16 |
+| [VideoLLaMA2-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-7B-16F)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)  | 16 |
+| [VideoLLaMA2-8x7B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B-Base)  | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)  | 8 |
+| [VideoLLaMA2-8x7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-8x7B)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)  | 8 |
+| [VideoLLaMA2-72B-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B-Base)  | Base | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct)  | 8 |
+| [VideoLLaMA2-72B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2-72B)  | Chat | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct)  | 8 |
+| [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base) | Base | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+| [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F)  | Chat | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+
+### Audio-Visual Checkpoints
+| Model Name     | Type | Audio Encoder | Language Decoder |
+|:-------------------|:----------------|:----------------|:------------------|
+| [VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV)  | Chat | [Fine-tuned BEATs_iter3+(AS2M)(cpt2)](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) | [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F)  |
+
+
+## [🤗 Demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA2-AV)
+
+It is highly recommended to try our [online demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA2-AV) first.
+
+To run a video-based LLM (Large Language Model) web demonstration on your device, you will first need to ensure that you have the necessary model checkpoints prepared, followed by adhering to the steps outlined to successfully launch the demo.
+
+### Single-model Version
+
+* Launch a gradio app directly ([VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV) is adopted by default):
+```bash
+python videollama2/serve/gradio_web_server_adhoc_av.py
+```
+
+## 🗝️ Training & Evaluation
+
+### Quick Start
+
+To facilitate further development on top of our codebase, we provide a quick-start guide on how to train a customized [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2) with [VideoLLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA) dataset and evaluate the trained model on the mainstream video-llm benchmarks.
+
+1. Training Data Structure:
+Follow the main branch(https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/main) of this VideoLLaMA2 codebase.
+2. Command:
+```bash
+# VideoLLaMA2.1-audio pretraining
+bash scripts/custom/pretrain_audio.sh
+# VideoLLaMA2.1-audio finetuning
+bash scripts/custom/finetune_audio.sh
+# VideoLLaMA2.1-audio_visual finetuning
+bash scripts/custom/va_joint.sh
+```
+3. Evaluation Data Structure:
+Follow the main branch(https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/main) of this VideoLLaMA2 codebase.
+
+4. Command:
+```bash
+# ClothoAQA.sh evaluation
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_audio_clothoAQA.sh
+# TUT2017 evaluation
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_audio_TUT2017.sh
+# VocalSound evaluation
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_audio_vocalsound.sh
+# AVQA_music evaluation
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_audio_video_AVQA.sh
+# AVSD evaluation (need to set azure openai key/endpoint/deployname)
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_audio_video_AVSD.sh
+# AVSSD evaluation (need to set azure openai key/endpoint/deployname)
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_audio_video_AVSSD.sh
+```
+
+### Data Format
+
+If you want to train a video-llm on your data, you need to follow the procedures below to prepare the audio/video/image sft data:
+
+1. Suppose your data structure is like:
+```bash
+VideoLLaMA2
+├── datasets
+│   ├── custom_sft
+│   |   ├── audio
+│   |   ├── video
+│   |   ├── image
+|   |   └── custom.json
+```
+2. Then you should re-organize the annotated audio/video/image sft data according to the following format:
+```json
+[
+    {
+        "id": 0,
+        "audio": "audio/xxx.wav",
+        "conversations": [
+            {
+                "from": "human",
+                "value": "<audio>\nPlease describe the sound event within the audio."
+            },
+            {
+                "from": "gpt",
+                "value": "Loud television static dips in and out of focus."
+            },
+            ...
+        ],
+    }
+    {
+        "id": 1,
+        "video": "images/xxx.jpg",
+        "conversations": [
+            {
+                "from": "human",
+                "value": "<image>\nWhat are the colors of the bus in the image?"
+            },
+            {
+                "from": "gpt",
+                "value": "The bus in the image is white and red."
+            },
+            ...
+        ],
+    }
+    {
+        "id": 2,
+        "video": "videos/xxx.mp4",
+        "conversations": [
+            {
+                "from": "human",
+                "value": "<video>\nWhat are the main activities that take place in the video?"
+            },
+            {
+                "from": "gpt",
+                "value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
+            },
+            ...
+        ],
+    },
+    ...
+]
+```
+3. Modify the `scripts/custom/finetune_audio.sh`:
+```bash
+...
+--data_path datasets/custom_sft/custom.json
+--data_folder datasets/custom_sft/
+--pretrain_mm_mlp_adapter CONNECTOR_DOWNLOAD_PATH (e.g., DAMO-NLP-SG/VideoLLaMA2.1-7B-16F)
+...
+```
+4. Modify the `scripts/custom/va_joint.sh`:
+```bash
+...
+--data_path datasets/custom_sft/custom.json
+--data_folder datasets/custom_sft/
+--pretrain_mm_mlp_adapter CONNECTOR_DOWNLOAD_PATH (e.g., DAMO-NLP-SG/VideoLLaMA2.1-7B-16F)
+...
+```
+
+## 🤖 Inference
+
+Audio/Video-Audio Inference:
+```python
+import sys
+sys.path.append('./')
+from videollama2 import model_init, mm_infer
+from videollama2.utils import disable_torch_init
+import argparse
+
+def inference(args):
+
+    model_path = args.model_path
+    model, processor, tokenizer = model_init(model_path)
+
+    if args.modal_type == "a":
+        model.model.vision_tower = None
+    elif args.modal_type == "v":
+        model.model.audio_tower = None
+    elif args.modal_type == "av":
+        pass
+    else:
+        raise NotImplementedError
+    # Audio-visual Inference
+    audio_video_path = "assets/00000368.mp4"
+    preprocess = processor['audio' if args.modal_type == "a" else "video"]
+    if args.modal_type == "a":
+        audio_video_tensor = preprocess(audio_video_path)
+    else:
+        audio_video_tensor = preprocess(audio_video_path, va=True if args.modal_type == "av" else False)
+    question = f"Who plays the instrument louder?"
+
+    # Audio Inference
+    audio_video_path = "assets/bird-twitter-car.wav"
+    preprocess = processor['audio' if args.modal_type == "a" else "video"]
+    if args.modal_type == "a":
+        audio_video_tensor = preprocess(audio_video_path)
+    else:
+        audio_video_tensor = preprocess(audio_video_path, va=True if args.modal_type == "av" else False)
+    question = f"Please describe the audio:"
+
+    # Video Inference
+    audio_video_path = "assets/output_v_1jgsRbGzCls.mp4"
+    preprocess = processor['audio' if args.modal_type == "a" else "video"]
+    if args.modal_type == "a":
+        audio_video_tensor = preprocess(audio_video_path)
+    else:
+        audio_video_tensor = preprocess(audio_video_path, va=True if args.modal_type == "av" else False)
+    question = f"What activity are the people practicing in the video?"
+
+    output = mm_infer(
+        audio_video_tensor,
+        question,
+        model=model,
+        tokenizer=tokenizer,
+        modal='audio' if args.modal_type == "a" else "video",
+        do_sample=False,
+    )
+
+    print(output)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('--model-path', help='', required=True)
+    parser.add_argument('--modal-type', choices=["a", "v", "av"], help='', required=True)
+    args = parser.parse_args()
+
+    inference(args)
+
+```
+
+## 📑 Citation
+
+If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
+```bibtex
+@article{damonlpsg2024videollama2,
+  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
+  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
+  journal={arXiv preprint arXiv:2406.07476},
+  year={2024},
+  url = {https://arxiv.org/abs/2406.07476}
+}
+
+@article{damonlpsg2023videollama,
+  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
+  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
+  journal = {arXiv preprint arXiv:2306.02858},
+  year = {2023},
+  url = {https://arxiv.org/abs/2306.02858}
+}
+```
+
+## 👍 Acknowledgement
+The codebase of VideoLLaMA 2 is adapted from [**LLaVA 1.5**](https:github.com/haotian-liu/LLaVA) and [**FastChat**](https://github.com/lm-sys/FastChat). We are also grateful for the following projects our VideoLLaMA 2 arise from:
+* [**LLaMA 2**](https://github.com/meta-llama/llama), [**Mistral-7B**](https://mistral.ai/news/announcing-mistral-7b/), [**OpenAI CLIP**](https://openai.com/index/clip/), [**Honeybee**](https://github.com/kakaobrain/honeybee).
+* [**Video-ChatGPT**](https://github.com/mbzuai-oryx/Video-ChatGPT), [**Video-LLaVA**](https://github.com/PKU-YuanGroup/Video-LLaVA). 
+* [**WebVid**](https://github.com/m-bain/webvid), [**Panda-70M**](https://github.com/snap-research/Panda-70M), [**LanguageBind**](https://github.com/PKU-YuanGroup/LanguageBind), [**InternVid**](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid).
+* [**VideoChat2**](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2), [**Valley**](https://github.com/RupertLuo/Valley), [**VTimeLLM**](https://github.com/huangb23/VTimeLLM), [**ShareGPT4V**](https://sharegpt4v.github.io/).
+
+
+## 🔒 License
+
+This project is released under the Apache 2.0 license as found in the LICENSE file.
+The service is a research preview intended for **non-commercial use ONLY**, subject to the model Licenses of LLaMA and Mistral, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.
diff --git a/third_party/VideoLLaMA2/pyproject.toml b/third_party/VideoLLaMA2/pyproject.toml
new file mode 100644
index 0000000000000000000000000000000000000000..eb7d8035940d73f0cfefc3c60398aa6562be06f8
--- /dev/null
+++ b/third_party/VideoLLaMA2/pyproject.toml
@@ -0,0 +1,41 @@
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "videollama2"
+version = "1.0"
+description = "Release of VideoLLaMA2"
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: Apache Software License",
+]
+dependencies = [
+    "torch==2.2.0", "torchvision==0.17.0", "torchaudio==2.2.0", "librosa",
+    "transformers==4.42.3", "tokenizers==0.19.1", 
+    "deepspeed==0.13.1", "accelerate==0.26.1",
+    "peft==0.4.0", "timm==1.0.3", "numpy==1.24.4",
+    "decord==0.6.0", "imageio==2.34.0", "imageio-ffmpeg==0.4.9",
+    "moviepy==1.0.3", "scenedetect==0.6.3",
+    "opencv-python==4.6.0.66", "pysubs2",
+    "scikit-learn==1.2.2", "huggingface_hub==0.23.4", "sentencepiece==0.1.99",
+    "shortuuid", "einops==0.6.1", "einops-exts==0.0.4", 
+    "bitsandbytes==0.43.0", "pydantic>=2.0", "markdown2[all]", 
+    "gradio==3.50.0", "gradio_client==0.6.1", "httpx==0.24.1",
+    "requests", "openai", "uvicorn", "fastapi", "tensorboard", "wandb", "tabulate"
+]
+
+[project.optional-dependencies]
+train = ["ninja"]
+
+[project.urls]
+"Homepage" = "https://github.com/DAMO-NLP-SG/VideoLLaMA2"
+"Bug Tracker" = "https://github.com/DAMO-NLP-SG/VideoLLaMA2/issues"
+
+[tool.setuptools.packages.find]
+exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
+
+[tool.wheel]
+exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
diff --git a/third_party/VideoLLaMA2/requirements.txt b/third_party/VideoLLaMA2/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6321de4158553571112bb2717704dab10c032aa1
--- /dev/null
+++ b/third_party/VideoLLaMA2/requirements.txt
@@ -0,0 +1,42 @@
+--extra-index-url https://download.pytorch.org/whl/cu118
+# basic dependencies
+torch==2.2.0
+torchaudio==2.2.0
+torchvision==0.17.0
+transformers==4.42.3
+tokenizers==0.19.1
+deepspeed==0.13.1
+accelerate==0.26.1
+peft==0.4.0
+timm==1.0.3
+numpy==1.24.4
+# data processing
+decord==0.6.0
+imageio==2.34.0
+imageio-ffmpeg==0.4.9
+moviepy==1.0.3
+scenedetect==0.6.3
+opencv-python==4.6.0.66
+pysubs2
+librosa
+pytorchvideo
+# misc
+scikit-learn==1.2.2
+huggingface_hub==0.23.4
+sentencepiece==0.1.99
+shortuuid
+einops==0.6.1
+einops-exts==0.0.4
+bitsandbytes==0.43.0
+pydantic>=2.0
+markdown2[all]
+gradio==3.50.0
+gradio_client==0.6.1
+httpx==0.24.1
+openai==1.33.0
+requests
+uvicorn
+fastapi
+tensorboard
+wandb
+tabulate
diff --git a/third_party/VideoLLaMA2/scripts/custom/finetune.sh b/third_party/VideoLLaMA2/scripts/custom/finetune.sh
new file mode 100644
index 0000000000000000000000000000000000000000..98e2d71af1e692b8418e0f7b59917e392a59ad0f
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/custom/finetune.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
+
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=128
+LOCAL_BATCH_SIZE=4
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_downstream_sft
+RUN_NAME=siglip_tcv35_7b_16f
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base/mm_projector.bin \
+    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
+    --data_folder ${DATA_DIR}/videollava_sft/ \
+    --mm_vision_select_layer -2 \
+    --image_aspect_ratio pad \
+    --num_frames 16 \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --save_strategy "steps" \
+    --save_steps 500 \
+    --save_total_limit 99 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --report_to tensorboard \
+    --run_name $RUN_NAME \
diff --git a/third_party/VideoLLaMA2/scripts/custom/finetune_audio.sh b/third_party/VideoLLaMA2/scripts/custom/finetune_audio.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bcf207d4ebe4b657a7bb07d595742a6c39005b0b
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/custom/finetune_audio.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=0
+
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=128
+LOCAL_BATCH_SIZE=4
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=audio_stage2_qwen2
+RUN_NAME=audio_stage2_qwen2
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE  \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --deepspeed scripts/zero2.json \
+    --model_type videollama2_qwen2 \
+    --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-16F \
+    --data_path_a ${DATA_DIR}/stage2_audio_text.json \
+    --audio_tower ./BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt \
+    --pretrain_mm_mlp_adapter_a $OUTP_DIR/mm_projector_a.bin \
+    --mm_projector_a_type mlp2x_gelu \
+    --tune_mm_mlp_adapter_a True \
+    --tune_audio_tower True \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
+    --num_train_epochs 2 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 2000 \
+    --save_total_limit 2 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --lazy_preprocess True \
+    --report_to tensorboard \
+    --run_name $RUN_NAME \
diff --git a/third_party/VideoLLaMA2/scripts/custom/finetune_lora.sh b/third_party/VideoLLaMA2/scripts/custom/finetune_lora.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fc59254f8d2576ea5844811c3244366c9dad61c8
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/custom/finetune_lora.sh
@@ -0,0 +1,74 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
+
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=128
+LOCAL_BATCH_SIZE=4
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_downstream_sft
+RUN_NAME=siglip_tcv35_7b_16f_lora
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
+    --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base/mm_projector.bin \
+    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
+    --data_folder ${DATA_DIR}/videollava_sft/ \
+    --mm_vision_select_layer -2 \
+    --image_aspect_ratio pad \
+    --num_frames 16 \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --save_strategy "steps" \
+    --save_steps 500 \
+    --save_total_limit 99 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --report_to tensorboard \
+    --run_name $RUN_NAME \
diff --git a/third_party/VideoLLaMA2/scripts/custom/finetune_qlora.sh b/third_party/VideoLLaMA2/scripts/custom/finetune_qlora.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f44e1e0cfeefa0182e87cd81c622ce747169a9ac
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/custom/finetune_qlora.sh
@@ -0,0 +1,74 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
+
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=128
+LOCAL_BATCH_SIZE=4
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_downstream_sft
+RUN_NAME=siglip_tcv35_7b_16f_qlora
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --bits 4 \
+    --deepspeed scripts/zero2.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --pretrain_mm_mlp_adapter DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base/mm_projector.bin \
+    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
+    --data_folder ${DATA_DIR}/videollava_sft/ \
+    --mm_vision_select_layer -2 \
+    --image_aspect_ratio pad \
+    --num_frames 16 \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --save_strategy "steps" \
+    --save_steps 500 \
+    --save_total_limit 99 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --report_to tensorboard \
+    --run_name $RUN_NAME \
diff --git a/third_party/VideoLLaMA2/scripts/custom/pretrain_audio.sh b/third_party/VideoLLaMA2/scripts/custom/pretrain_audio.sh
new file mode 100644
index 0000000000000000000000000000000000000000..29cda3f5f0236fef9f6b961e1e4c43c6b1806e4f
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/custom/pretrain_audio.sh
@@ -0,0 +1,70 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=0
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=1024
+LOCAL_BATCH_SIZE=32
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_audio_stage1
+RUN_NAME=videollama2qwen2_audio_stage1
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE  \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --deepspeed scripts/zero2.json \
+    --model_type videollama2_qwen2 \
+    --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-16F \
+    --data_path_a ${DATA_DIR}/stage1_pretrain.json \
+    --audio_tower ./BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt \
+    --mm_projector_a_type mlp2x_gelu \
+    --tune_mm_mlp_adapter_a True \
+    --mm_vision_select_layer -1 \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME} \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 1000 \
+    --save_total_limit 1 \
+    --learning_rate 1e-3 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --lazy_preprocess True \
+    --report_to tensorboard \
+    --run_name pretrain_$RUN_NAME \
diff --git a/third_party/VideoLLaMA2/scripts/custom/va_joint.sh b/third_party/VideoLLaMA2/scripts/custom/va_joint.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cf9e81180b83219e31cc782822cd889743f880b9
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/custom/va_joint.sh
@@ -0,0 +1,80 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=0
+
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=128
+LOCAL_BATCH_SIZE=4
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=audio_visual_stage3_qwen2
+RUN_NAME=audio_visual_stage3_qwen2
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE  \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --deepspeed scripts/zero2.json \
+    --model_type videollama2_qwen2 \
+    --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-16F \
+    --data_folder ${DATA_DIR} \
+    --data_path ${DATA_DIR}/stage3_video_audio.json,${DATA_DIR}/stage2_audio_subset_new.json,${DATA_DIR}/stage2_video_subset.json \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --audio_tower $OUTP_DIR/audio_tower.bin \
+    --pretrain_mm_mlp_adapter_a $OUTP_DIR/mm_projector_a.bin \
+    --mm_projector_type stc_connector_v35 \
+    --mm_projector_a_type mlp2x_gelu \
+    --va True \
+    --tune_audio_tower True \
+    --tune_adapter_llm True \
+    --tune_mm_mlp_adapter_a True \
+    --mm_vision_select_layer -2 \
+    --image_aspect_ratio pad \
+    --num_frames 16 \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir $OUTP_DIR/${WANDB_PROJECT}/VideoLLaMA2.1-7B-AV \
+    --num_train_epochs 2 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 2000 \
+    --save_total_limit 2 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --lazy_preprocess True \
+    --report_to tensorboard \
+    --run_name $RUN_NAME \
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_audio_TUT2017.sh b/third_party/VideoLLaMA2/scripts/eval/eval_audio_TUT2017.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e5c3b0466de668dd9a0bfb966c3df275a2cec1d3
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_audio_TUT2017.sh
@@ -0,0 +1,44 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-AV
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/TUT2017/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_audio.py \
+            --model-path ${CKPT} \
+            --dataset TUT2017 \
+            --video-folder ${EVAL_DATA_DIR}/TUT2017 \
+            --question-file ${EVAL_DATA_DIR}/TUT2017/tut2017_eval.jsonl \
+            --answer-file ${EVAL_DATA_DIR}/TUT2017/tut2017_eval.jsonl \
+            --output-file ${OUTPUT_DIR}/TUT2017/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/TUT2017/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+python videollama2/eval/eval_audio_TUT2017.py \
+    --pred-path ${output_file}
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_audio_clothoAQA.sh b/third_party/VideoLLaMA2/scripts/eval/eval_audio_clothoAQA.sh
new file mode 100644
index 0000000000000000000000000000000000000000..679ae1000ce21c1a7e8264892b95649c292bf2dd
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_audio_clothoAQA.sh
@@ -0,0 +1,45 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-AV
+
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/clothoAQA/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_audio.py \
+            --model-path ${CKPT} \
+            --dataset clothoAQA \
+            --video-folder ${EVAL_DATA_DIR}/ClothoAQA/audio_files \
+            --question-file ${EVAL_DATA_DIR}/clothoAQA_eval.json \
+            --answer-file ${EVAL_DATA_DIR}/clothoAQA_eval.json \
+            --output-file ${OUTPUT_DIR}/clothoAQA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/clothoAQA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+python videollama2/eval/eval_audio_clothoAQA.py \
+    --pred-path ${output_file}
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVQA.sh b/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVQA.sh
new file mode 100644
index 0000000000000000000000000000000000000000..031497bf2d0ffe01c8f8e66a5de0d4f156d9b83e
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVQA.sh
@@ -0,0 +1,44 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-AV
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/AVQA/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_audio_video.py \
+            --model-path ${CKPT} \
+            --dataset AVQA \
+            --video-folder ${EVAL_DATA_DIR}/AVQA_music/MUSIC-AVQA-videos \
+            --question-file ${EVAL_DATA_DIR}/AVQA_music/AVQA_music_test.json \
+            --answer-file ${EVAL_DATA_DIR}/AVQA_music/AVQA_music_test.json \
+            --output-file ${OUTPUT_DIR}/AVQA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/AVQA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+python3 videollama2/eval/eval_audio_video_AVQA.py \
+    --pred-path ${output_file}
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVSD.sh b/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVSD.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2ef1de4860d1661f7f5703b418bb9f82ffdab661
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVSD.sh
@@ -0,0 +1,47 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-AV
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/AVSD/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_audio_video.py \
+            --model-path ${CKPT} \
+            --dataset AVSD \
+            --video-folder ${EVAL_DATA_DIR}/AVSD/Charades_v1_480 \
+            --question-file ${EVAL_DATA_DIR}/AVSD/instruction_val.json \
+            --answer-file ${EVAL_DATA_DIR}/AVSD/instruction_val.json \
+            --output-file ${OUTPUT_DIR}/AVSD/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/AVSD/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+python videollama2/eval/eval_audio_video_AVSD.py \
+    --pred-path /mnt/data/xyf/VideoLLaMA2_backup/eval_output/AVSD/answers/vlb_audio_visual_stage3_tuning_projector_beats_qwen2_videollm_ep2/merge.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVSSD.sh b/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVSSD.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7b98c903d1183dae459edcbee4ad022928a3b79e
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_audio_video_AVSSD.sh
@@ -0,0 +1,47 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-AV
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/AVSSD/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_audio_video.py \
+            --model-path ${CKPT} \
+            --dataset AVSSD \
+            --video-folder ${EVAL_DATA_DIR}/VGGSound_final/video \
+            --question-file ${EVAL_DATA_DIR}/avssd_test.json \
+            --answer-file ${EVAL_DATA_DIR}/avssd_test.json \
+            --output-file ${OUTPUT_DIR}/AVSSD/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/AVSSD/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+python videollama2/eval/eval_audio_video_AVSSD.py \
+    --pred-path ${output_file} \
+    --api-key f68a11a54a064caa851e290258d52cce \
+    --api-endpoint https://vl-australiaeast.openai.azure.com/ \
+    --api-deployname gpt35-turbo-0613
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_audio_vocalsound.sh b/third_party/VideoLLaMA2/scripts/eval/eval_audio_vocalsound.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2f8abb027bda1f4183e539e7fc1bb65afbb9fc00
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_audio_vocalsound.sh
@@ -0,0 +1,44 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-AV
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/vocalsound/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_audio.py \
+            --model-path ${CKPT} \
+            --dataset vocalsound \
+            --video-folder ${EVAL_DATA_DIR}/vocal/audio_16k \
+            --question-file ${EVAL_DATA_DIR}/vocal/vocalsound_eval.jsonl \
+            --answer-file ${EVAL_DATA_DIR}/vocal/vocalsound_eval.jsonl \
+            --output-file ${OUTPUT_DIR}/vocalsound/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/vocalsound/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+python videollama2/eval/eval_audio_vocalsound.py \
+    --pred-path ${output_file}
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_cap_msvc.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_cap_msvc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5de202de7c7ff438b025c8211ef40a43eff497eb
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_cap_msvc.sh
@@ -0,0 +1,67 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/merge.json
+
+# judge if the number of json lines is 0
+if [ ! -f "$output_file" ] || [ $(cat "$output_file" | wc -l) -eq 0 ]; then
+    rm -f ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/*.json
+fi
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_cap_msvc.py \
+          --model-path ${CKPT} \
+          --video-folder ${EVAL_DATA_DIR}/msvc \
+          --question-file ${EVAL_DATA_DIR}/msvc/msvc.json \
+          --output-file ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+          --num-chunks $CHUNKS \
+          --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_cap_msvc_correctness.py \
+    --pred-path $output_file \
+    --output-dir ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/correctness_gpt \
+    --output-json ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/correctness_results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4 \
+
+python3 videollama2/eval/eval_video_cap_msvc_detailedness.py \
+    --pred-path $output_file \
+    --output-dir ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/detailedness_gpt \
+    --output-json ${OUTPUT_DIR}/msvc/answers/${CKPT_NAME}/detailedness_results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4 \
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_egoschema.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_egoschema.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0f0cd8cb057513e837b105d800743983549412ac
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_egoschema.sh
@@ -0,0 +1,41 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/egoschema/answers/${CKPT_NAME}/merge.csv
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_mcqa_egoschema.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/egoschema/good_clips_git \
+            --question-file ${EVAL_DATA_DIR}/egoschema/questions.json \
+            --answer-file ${OUTPUT_DIR}/egoschema/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.csv \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    echo 'q_uid, answer' >> "$output_file"
+
+    # Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/egoschema/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.csv >> "$output_file"
+    done
+fi
\ No newline at end of file
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_mvbench.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_mvbench.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b7a39414b9b397ccadef1c592be99a158ab687fe
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_mvbench.sh
@@ -0,0 +1,46 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/mvbench/answers/${CKPT_NAME}/merge.json
+
+# judge if the number of json lines is 0
+if [ ! -f "$output_file" ] || [ $(cat "$output_file" | wc -l) -eq 0 ]; then
+    rm -f ${OUTPUT_DIR}/mvbench/answers/${CKPT_NAME}/*.json
+fi
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_mcqa_mvbench.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/mvbench/video \
+            --question-file ${EVAL_DATA_DIR}/mvbench/json \
+            --answer-file ${OUTPUT_DIR}/mvbench/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    # Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/mvbench/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+python3 videollama2/eval/eval_video_mcqa_mvbench.py \
+    --pred_path ${output_file} \
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_perception_test_mcqa.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_perception_test_mcqa.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0973fd9bceccfc34e06a53374b65b87018079fb0
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_perception_test_mcqa.sh
@@ -0,0 +1,45 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/perception_test_mcqa/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")    
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_mcqa_perception_test_mcqa.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/perception_test_mcqa/videos \
+            --question-file ${EVAL_DATA_DIR}/perception_test_mcqa/mc_question_test.json \
+            --answer-file ${OUTPUT_DIR}/perception_test_mcqa/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    echo "{" >> "$output_file"
+
+    # Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/perception_test_mcqa/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+
+    sed -i '$s/.$//' $output_file
+
+    echo "}" >> "$output_file"
+fi
\ No newline at end of file
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_videomme.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_videomme.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2236a1b93c7717f9ab938aca3f2d911584f22717
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_mcqa_videomme.sh
@@ -0,0 +1,84 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/videomme/answers/${CKPT_NAME}/merge.json
+output_sub_file=${OUTPUT_DIR}/videomme/answers/${CKPT_NAME}/merge_sub.json
+
+# judge if the number of json lines is 0
+if [ ! -f "$output_file" ] || [ $(cat "$output_file" | wc -l) -eq 0 ]; then
+    rm -f ${OUTPUT_DIR}/videomme/answers/${CKPT_NAME}/*.json
+fi
+
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_mcqa_videomme.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/videomme/videos \
+            --subtitle-folder ${EVAL_DATA_DIR}/videomme/subtitles \
+            --question-file ${EVAL_DATA_DIR}/videomme/test-00000-of-00001.parquet \
+            --answer-file ${OUTPUT_DIR}/videomme/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    echo "[" >> "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/videomme/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+
+    sed -i '$s/.$//' $output_file
+
+    echo "]" >> "$output_file"
+
+    # Clear out the output file if it exists.
+    > "$output_sub_file"
+
+    echo "[" >> "$output_sub_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/videomme/answers/${CKPT_NAME}/${CHUNKS}_${IDX}_sub.json >> "$output_sub_file"
+    done
+
+    sed -i '$s/.$//' $output_sub_file
+
+    echo "]" >> "$output_sub_file"
+fi
+
+
+python videollama2/eval/eval_video_mcqa_videomme.py \
+    --results_file $output_file \
+    --video_duration_type "short,medium,long" \
+    --return_categories_accuracy \
+    --return_sub_categories_accuracy \
+    --return_task_types_accuracy \
+    --skip_missing \
+
+python videollama2/eval/eval_video_mcqa_videomme.py \
+    --results_file $output_sub_file \
+    --video_duration_type "short,medium,long" \
+    --return_categories_accuracy \
+    --return_sub_categories_accuracy \
+    --return_task_types_accuracy \
+    --skip_missing \
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_activitynet.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_activitynet.sh
new file mode 100644
index 0000000000000000000000000000000000000000..34bad8b665b3c995720bb104a200da1ef5e31656
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_activitynet.sh
@@ -0,0 +1,54 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/Activitynet_Zero_Shot_QA/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_oqa_activitynet.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/Activitynet_Zero_Shot_QA/all_test \
+            --question-file ${EVAL_DATA_DIR}/Activitynet_Zero_Shot_QA/test_q.json \
+            --answer-file ${EVAL_DATA_DIR}/Activitynet_Zero_Shot_QA/test_a.json \
+            --output-file ${OUTPUT_DIR}/Activitynet_Zero_Shot_QA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/Activitynet_Zero_Shot_QA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_oqa_activitynet.py \
+    --pred-path ${output_file} \
+    --output-dir ${OUTPUT_DIR}/Activitynet_Zero_Shot_QA/answers/${CKPT_NAME}/gpt \
+    --output-json ${OUTPUT_DIR}/Activitynet_Zero_Shot_QA/answers/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_msvd.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_msvd.sh
new file mode 100644
index 0000000000000000000000000000000000000000..95734bbed0d73da0fc9c6922e758d41747a61388
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_msvd.sh
@@ -0,0 +1,54 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/MSVD_Zero_Shot_QA/answers/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_oqa_activitynet.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/MSVD_Zero_Shot_QA/videos \
+            --question-file ${EVAL_DATA_DIR}/MSVD_Zero_Shot_QA/test_q.json \
+            --answer-file ${EVAL_DATA_DIR}/MSVD_Zero_Shot_QA/test_a.json \
+            --output-file ${OUTPUT_DIR}/MSVD_Zero_Shot_QA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/MSVD_Zero_Shot_QA/answers/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_oqa_activitynet.py \
+    --pred-path ${output_file} \
+    --output-dir ${OUTPUT_DIR}/MSVD_Zero_Shot_QA/answers/${CKPT_NAME}/gpt \
+    --output-json ${OUTPUT_DIR}/MSVD_Zero_Shot_QA/answers/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_1_correctness.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_1_correctness.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0c7a3162489845a877f0a51d0b9cd2b39d8399aa
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_1_correctness.sh
@@ -0,0 +1,58 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_oqa_vcgpt_general.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/videochatgpt_gen/Test_Videos \
+            --question-file ${EVAL_DATA_DIR}/videochatgpt_gen/generic_qa.json \
+            --answer-file ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+
+    mkdir -p ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}
+    mkdir -p ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}
+    cp ${output_file} ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/merge.json
+    cp ${output_file} ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/merge.json
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_oqa_vcgpt_1_correctness.py \
+    --pred-path ${output_file} \
+    --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/gpt \
+    --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_2_detail.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_2_detail.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7947eb84ccb15d2a94b853ff5b6ebad537be4f48
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_2_detail.sh
@@ -0,0 +1,58 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/run_inference_video_qa_gpt_general.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/videochatgpt_gen/Test_Videos \
+            --question-file ${EVAL_DATA_DIR}/videochatgpt_gen/generic_qa.json \
+            --answer-file ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+
+    mkdir -p ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}
+    mkdir -p ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}
+    cp ${output_file} ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/merge.json
+    cp ${output_file} ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/merge.json
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_oqa_vcgpt_2_detailed_orientation.py \
+    --pred-path ${output_file} \
+    --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/gpt \
+    --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_3_context.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_3_context.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7e6bccc2355c16de801054f0cab34fcd49aeb02b
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_3_context.sh
@@ -0,0 +1,58 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/merge.json
+
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/run_inference_video_qa_gpt_general.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/videochatgpt_gen/Test_Videos \
+            --question-file ${EVAL_DATA_DIR}/videochatgpt_gen/generic_qa.json \
+            --answer-file ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+
+    mkdir -p ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}
+    mkdir -p ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}
+    cp ${output_file} ${OUTPUT_DIR}/videochatgpt_gen/answers/correctness/${CKPT_NAME}/merge.json
+    cp ${output_file} ${OUTPUT_DIR}/videochatgpt_gen/answers/detail/${CKPT_NAME}/merge.json
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_oqa_vcgpt_3_context.py \
+    --pred-path ${output_file} \
+    --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/gpt \
+    --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/context/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_4_temporal.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_4_temporal.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ff68c75ed38f58a803f45df86129e210c4015980
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_4_temporal.sh
@@ -0,0 +1,54 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/videochatgpt_gen/answers/temporal/${CKPT_NAME}/merge.json
+
+# if output_file not exists then inference
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_oqa_vcgpt_general.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/videochatgpt_gen/Test_Videos \
+            --question-file ${EVAL_DATA_DIR}/videochatgpt_gen/temporal_qa.json \
+            --answer-file ${OUTPUT_DIR}/videochatgpt_gen/answers/temporal/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/videochatgpt_gen/answers/temporal/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_oqa_vcgpt_4_temporal.py \
+    --pred-path ${output_file} \
+    --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/temporal/${CKPT_NAME}/gpt \
+    --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/temporal/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4
diff --git a/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_5_consistency.sh b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_5_consistency.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5adb530a8127e8cfe093154f73350da8818f0f92
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/eval/eval_video_oqa_vcgpt_5_consistency.sh
@@ -0,0 +1,54 @@
+set -x
+
+EVAL_DATA_DIR=eval
+OUTPUT_DIR=eval_output
+CKPT=DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
+CKPT_NAME=$(echo $CKPT | rev | cut -d'/' -f1 | rev)
+
+gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
+IFS=',' read -ra GPULIST <<< "$gpu_list"
+
+# divide data via the number of GPUs per task
+GPUS_PER_TASK=1
+CHUNKS=$((${#GPULIST[@]}/$GPUS_PER_TASK))
+
+output_file=${OUTPUT_DIR}/videochatgpt_gen/answers/consistency/${CKPT_NAME}/merge.json
+
+# if output_file not exists then inference
+if [ ! -f "$output_file" ]; then
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        # select the GPUs for the task
+        gpu_devices=$(IFS=,; echo "${GPULIST[*]:$(($IDX*$GPUS_PER_TASK)):$GPUS_PER_TASK}")
+        TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=${gpu_devices} python3 videollama2/eval/inference_video_oqa_vcgpt_consistency.py \
+            --model-path ${CKPT} \
+            --video-folder ${EVAL_DATA_DIR}/videochatgpt_gen/Test_Videos \
+            --question-file ${EVAL_DATA_DIR}/videochatgpt_gen/consistency_qa.json \
+            --answer-file ${OUTPUT_DIR}/videochatgpt_gen/answers/consistency/${CKPT_NAME}/${CHUNKS}_${IDX}.json \
+            --num-chunks $CHUNKS \
+            --chunk-idx $IDX &
+    done
+
+    wait
+
+    # Clear out the output file if it exists.
+    > "$output_file"
+
+    #Loop through the indices and concatenate each file.
+    for IDX in $(seq 0 $((CHUNKS-1))); do
+        cat ${OUTPUT_DIR}/videochatgpt_gen/answers/consistency/${CKPT_NAME}/${CHUNKS}_${IDX}.json >> "$output_file"
+    done
+fi
+
+
+AZURE_API_KEY=your_key
+AZURE_API_ENDPOINT=your_endpoint
+AZURE_API_DEPLOYNAME=your_deployname
+
+python3 videollama2/eval/eval_video_oqa_vcgpt_5_consistency.py \
+    --pred-path ${output_file} \
+    --output-dir ${OUTPUT_DIR}/videochatgpt_gen/answers/consistency/${CKPT_NAME}/gpt \
+    --output-json ${OUTPUT_DIR}/videochatgpt_gen/answers/consistency/${CKPT_NAME}/results.json \
+    --api-key $AZURE_API_KEY \
+    --api-endpoint $AZURE_API_ENDPOINT \
+    --api-deployname $AZURE_API_DEPLOYNAME \
+    --num-tasks 4
diff --git a/third_party/VideoLLaMA2/scripts/vllava/finetune.sh b/third_party/VideoLLaMA2/scripts/vllava/finetune.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f4a8a8011ee1de15d57916a5890c799086f7399f
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/vllava/finetune.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
+
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=128
+LOCAL_BATCH_SIZE=4
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_vllava
+RUN_NAME=siglip_tcv35_7b_16f
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
+    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
+    --data_folder ${DATA_DIR}/videollava_sft/ \
+    --mm_vision_select_layer -2 \
+    --image_aspect_ratio pad \
+    --num_frames 16 \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --save_strategy "steps" \
+    --save_steps 500 \
+    --save_total_limit 99 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --report_to tensorboard \
+    --run_name $RUN_NAME \
diff --git a/third_party/VideoLLaMA2/scripts/vllava/pretrain.sh b/third_party/VideoLLaMA2/scripts/vllava/pretrain.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4cff364ff49744bbcbde6710f163408d56e35ac0
--- /dev/null
+++ b/third_party/VideoLLaMA2/scripts/vllava/pretrain.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+
+# Environment Variables
+ARG_WORLD_SIZE=${1:-1}
+ARG_NPROC_PER_NODE=${2:-8}
+ARG_MASTER_ADDR="127.0.0.1"
+ARG_MASTER_PORT=16666
+ARG_RANK=${3:-0}
+
+# Multiple conditions
+if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
+    WORLD_SIZE=$ARG_WORLD_SIZE
+    NPROC_PER_NODE=$ARG_NPROC_PER_NODE
+fi
+if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
+    MASTER_ADDR=$ARG_MASTER_ADDR
+    MASTER_PORT=$ARG_MASTER_PORT
+    RANK=$ARG_RANK
+fi
+
+echo "WORLD_SIZE: $WORLD_SIZE"
+echo "NPROC_PER_NODE: $NPROC_PER_NODE"
+
+# Training Arguments
+GLOBAL_BATCH_SIZE=256
+LOCAL_BATCH_SIZE=8
+GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]
+
+# Log Arguments
+export TRANSFORMERS_OFFLINE=1
+export WANDB_PROJECT=videollama2qwen2_vllava
+RUN_NAME=siglip_tcv35_7b_16f
+DATA_DIR=datasets
+OUTP_DIR=work_dirs
+
+torchrun --nnodes $WORLD_SIZE \
+    --nproc_per_node $NPROC_PER_NODE  \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    --node_rank $RANK \
+    videollama2/train.py \
+    --deepspeed scripts/zero3.json \
+    --model_type videollama2_qwen2 \
+    --model_path Qwen/Qwen2-7B-Instruct \
+    --vision_tower google/siglip-so400m-patch14-384 \
+    --mm_projector_type stc_connector_v35 \
+    --tune_mm_mlp_adapter True \
+    --data_path   ${DATA_DIR}/videollava_pt/valley_llavaimage.json \
+    --data_folder ${DATA_DIR}/videollava_pt/ \
+    --mm_vision_select_layer -2 \
+    --num_frames 16 \
+    --bf16 True \
+    --tf32 True \
+    --fp16 False \
+    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME} \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 500 \
+    --save_total_limit 99 \
+    --learning_rate 1e-3 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --report_to tensorboard \
+    --run_name $RUN_NAME \
diff --git a/third_party/VideoLLaMA2/videollama2/__init__.py b/third_party/VideoLLaMA2/videollama2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..911667d811079bb2449a17d4b96374079f7d7c54
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/__init__.py
@@ -0,0 +1,119 @@
+import os
+import copy
+import warnings
+import shutil
+from functools import partial
+
+import torch
+
+from .model import load_pretrained_model
+from .mm_utils import process_image, process_video, tokenizer_multimodal_token, get_model_name_from_path, KeywordsStoppingCriteria, process_audio_file
+from .constants import NUM_FRAMES, DEFAULT_IMAGE_TOKEN, DEFAULT_VIDEO_TOKEN, MODAL_INDEX_MAP, DEFAULT_AUDIO_TOKEN
+
+
+def model_init(model_path=None, **kwargs):
+    model_path = "DAMO-NLP-SG/VideoLLaMA2-7B" if model_path is None else model_path
+    model_name = get_model_name_from_path(model_path)
+    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, **kwargs)
+
+    if tokenizer.pad_token is None and tokenizer.unk_token is not None:
+        tokenizer.pad_token = tokenizer.unk_token
+
+    num_frames = model.config.num_frames if hasattr(model.config, "num_frames") else NUM_FRAMES
+    processor = {
+        'image': partial(process_image, processor=processor, aspect_ratio=None),
+        'video': partial(process_video, processor=processor, aspect_ratio=None, num_frames=num_frames),
+        'audio': process_audio_file,
+    }
+
+    return model, processor, tokenizer
+
+
+def mm_infer(image_or_video, instruct, model, tokenizer, modal='video', **kwargs):
+    """inference api of VideoLLaMA2 for video understanding.
+
+    Args:
+        model: VideoLLaMA2 model.
+        image_or_video (torch.Tensor): image tensor (1, C, H, W) / video tensor (T, C, H, W).
+        instruct (str): text instruction for understanding video.
+        tokenizer: tokenizer.
+        do_sample (bool): whether to sample.
+        modal (str): inference modality.
+    Returns:
+        str: response of the model.
+    """
+
+    # 1. text preprocess (tag process & generate prompt).
+    if modal == 'image':
+        modal_token = DEFAULT_IMAGE_TOKEN
+    elif modal == 'video':
+        modal_token = DEFAULT_VIDEO_TOKEN
+    elif modal == 'text':
+        modal_token = ''
+    elif modal == 'audio':
+        modal_token = DEFAULT_AUDIO_TOKEN
+    else:
+        raise ValueError(f"Unsupported modal: {modal}")
+
+    # 1. vision preprocess (load & transform image or video).
+    if modal == 'text':
+        tensor = None
+    else:
+        if isinstance(image_or_video, dict):
+            tensor = {k: v.half().cuda() for k, v in image_or_video.items()}
+        else:
+            tensor = image_or_video.half().cuda() 
+        tensor = [(tensor, modal)]
+
+    # 2. text preprocess (tag process & generate prompt).
+    if isinstance(instruct, str):
+        message = [{'role': 'user', 'content': modal_token + '\n' + instruct}]
+    elif isinstance(instruct, list):
+        message = copy.deepcopy(instruct)
+        message[0]['content'] = modal_token + '\n' + message[0]['content']
+    else:
+        raise ValueError(f"Unsupported type of instruct: {type(instruct)}")
+
+    if model.config.model_type in ['videollama2', 'videollama2_mistral', 'videollama2_mixtral']:
+        system_message = [
+            {'role': 'system', 'content': (
+            """<<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature."""
+            """\n"""
+            """If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>""")
+            }
+        ]
+    else:
+        system_message = []
+
+    message = system_message + message
+    prompt = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
+
+    input_ids = tokenizer_multimodal_token(prompt, tokenizer, modal_token, return_tensors='pt').unsqueeze(0).long().cuda()
+    attention_masks = input_ids.ne(tokenizer.pad_token_id).long().cuda()
+
+    # 3. generate response according to visual signals and prompts. 
+    keywords = [tokenizer.eos_token]
+    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+
+    do_sample = kwargs.get('do_sample', False)
+    temperature = kwargs.get('temperature', 0.2 if do_sample else 0.0)
+    top_p = kwargs.get('top_p', 0.9)
+    max_new_tokens = kwargs.get('max_new_tokens', 2048)
+
+    with torch.inference_mode():
+        output_ids = model.generate(
+            input_ids,
+            attention_mask=attention_masks,
+            images=tensor,
+            do_sample=do_sample,
+            temperature=temperature,
+            max_new_tokens=max_new_tokens,
+            top_p=top_p,
+            use_cache=True,
+            stopping_criteria=[stopping_criteria],
+            pad_token_id=tokenizer.eos_token_id,
+        )
+
+    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+
+    return outputs
diff --git a/third_party/VideoLLaMA2/videollama2/constants.py b/third_party/VideoLLaMA2/videollama2/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba87b61becb0819594962652fd3e193a9c8c3a3f
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/constants.py
@@ -0,0 +1,32 @@
+CONTROLLER_HEART_BEAT_EXPIRATION = 30
+WORKER_HEART_BEAT_INTERVAL = 15
+
+LOGDIR = "."
+
+# Model Constants
+IGNORE_INDEX = -100
+
+# Image arguments
+IMAGE_TOKEN_INDEX = -200
+DEFAULT_IMAGE_TOKEN = "<image>"
+DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
+DEFAULT_IM_START_TOKEN = "<im_start>"
+DEFAULT_IM_END_TOKEN = "<im_end>"
+IMAGE_PLACEHOLDER = "<image-placeholder>"
+
+# Video arguments
+VIDEO_TOKEN_INDEX = -201
+DEFAULT_VIDEO_TOKEN = "<video>"
+NUM_FRAMES = 8
+MAX_FRAMES = 32
+NUM_FRAMES_PER_SECOND = 1
+
+# Audio arguments
+AUDIO_TOKEN_INDEX = -202
+DEFAULT_AUDIO_TOKEN = "<audio>"
+
+MODAL_INDEX_MAP = {
+    "<image>": -200,
+    "<video>": -201,
+    "<audio>": -202,
+}
diff --git a/third_party/VideoLLaMA2/videollama2/conversation.py b/third_party/VideoLLaMA2/videollama2/conversation.py
new file mode 100644
index 0000000000000000000000000000000000000000..a59b62cd7ba36a54382d8c6db2a701186f9834a4
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/conversation.py
@@ -0,0 +1,507 @@
+import base64
+import dataclasses
+from io import BytesIO
+from enum import auto, Enum
+from typing import List, Tuple
+
+from PIL import Image
+from .constants import LOGDIR, NUM_FRAMES
+
+
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+    TWO = auto()
+    PLAIN = auto()
+    LLAMA2 = auto()
+    QWEN = auto()
+
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "###"
+    sep2: str = None
+    version: str = "Unknown"
+
+    skip_next: bool = False
+    modality: str = "image"
+
+    def get_prompt(self):
+        messages = self.messages
+        modality_token = f"<{self.modality}>"
+        if len(messages) > 0 and type(messages[0][1]) is tuple:
+            messages = self.messages.copy()
+            init_role, init_msg = messages[0].copy()
+            init_msg = init_msg[0].replace(modality_token, "").strip()
+            if 'mmtag' in self.version:
+                messages[0] = (init_role, init_msg)
+                messages.insert(0, (self.roles[0], "<Image><image></Image>"))
+                messages.insert(1, (self.roles[1], "Received."))
+            else:
+                messages[0] = (init_role, f"{modality_token}\n" + init_msg)
+
+        if self.sep_style == SeparatorStyle.SINGLE:
+            ret = self.system + self.sep
+            for role, message in messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + self.sep
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.TWO:
+            seps = [self.sep, self.sep2]
+            ret = self.system + seps[0]
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + seps[i % 2]
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.LLAMA2:
+            wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
+            wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
+            ret = ""
+
+            for i, (role, message) in enumerate(messages):
+                if i == 0:
+                    assert message, "first message should not be none"
+                    assert role == self.roles[0], "first message should come from user"
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    if i == 0: message = wrap_sys(self.system) + message
+                    if i % 2 == 0:
+                        message = wrap_inst(message)
+                        ret += self.sep + message
+                    else:
+                        ret += " " + message + " " + self.sep2
+                else:
+                    ret += ""
+            ret = ret.lstrip(self.sep)
+        elif self.sep_style == SeparatorStyle.QWEN:
+            ret = ""
+            # 1. Add system prompt
+            ret += self.system + self.sep + "\n"
+            # 2. Iterate message
+            for i, (role, message) in enumerate(messages):
+                if i == 0:
+                    assert message, "first message should not be none"
+                    assert role == self.roles[0], "first message should come from user"
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    # 2.1 Add role and message
+                    ret += role + message + self.sep + "\n"
+                else:
+                    # 2.2 Add generation prompt
+                    ret += role
+        elif self.sep_style == SeparatorStyle.PLAIN:
+            seps = [self.sep, self.sep2]
+            ret = self.system
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + message + seps[i % 2]
+                else:
+                    ret += role
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+        return ret
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def process_image(self, image, image_process_mode, return_pil=False, image_format='PNG', max_len=800, min_len=400):
+        if image_process_mode == "Pad":
+            def expand2square(pil_img, background_color=(122, 116, 104)):
+                width, height = pil_img.size
+                if width == height:
+                    return pil_img
+                elif width > height:
+                    result = Image.new(pil_img.mode, (width, width), background_color)
+                    result.paste(pil_img, (0, (width - height) // 2))
+                    return result
+                else:
+                    result = Image.new(pil_img.mode, (height, height), background_color)
+                    result.paste(pil_img, ((height - width) // 2, 0))
+                    return result
+            image = expand2square(image)
+        elif image_process_mode in ["Default", "Crop"]:
+            pass
+        elif image_process_mode == "Resize":
+            image = image.resize((336, 336))
+        else:
+            raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
+        if max(image.size) > max_len:
+            max_hw, min_hw = max(image.size), min(image.size)
+            aspect_ratio = max_hw / min_hw
+            shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+            longest_edge = int(shortest_edge * aspect_ratio)
+            W, H = image.size
+            if H > W:
+                H, W = longest_edge, shortest_edge
+            else:
+                H, W = shortest_edge, longest_edge
+            image = image.resize((W, H))
+        if return_pil:
+            return image
+        else:
+            buffered = BytesIO()
+            image.save(buffered, format=image_format)
+            img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+            return img_b64_str
+
+
+    def get_videos(self, return_pil=False):
+        video_frames = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    from decord import VideoReader, cpu
+                    import numpy as np
+                    # here video is the file path of input video
+                    msg, video, image_process_mode = msg
+                    if not return_pil:
+                        # return filepath
+                        video_frames.append(video)
+                    else:
+                        # read video using decord.VideoReader
+                        decord_vr = VideoReader(uri=video, ctx=cpu(0))
+                        duration = len(decord_vr)
+                        frame_id_list = np.linspace(0, duration-1, NUM_FRAMES, dtype=int)
+                        # convert the extracted image frames into PIL objects
+                        all_images = [Image.fromarray(f) for f in decord_vr.get_batch(frame_id_list).asnumpy()]
+                        video_frames.extend([self.process_image(image, image_process_mode, return_pil=return_pil) for image in all_images])
+        return video_frames
+
+
+    def get_images(self, return_pil=False):
+        images = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    msg, image, image_process_mode = msg
+                    image = self.process_image(image, image_process_mode, return_pil=return_pil)
+                    images.append(image)
+
+                    # import base64
+                    # from io import BytesIO
+                    # from PIL import Image
+                    # # here image is a PIL object
+                    # msg, image, image_process_mode = msg
+                    # if image_process_mode == "Pad":
+                    #     def expand2square(pil_img, background_color=(122, 116, 104)):
+                    #         width, height = pil_img.size
+                    #         if width == height:
+                    #             return pil_img
+                    #         elif width > height:
+                    #             result = Image.new(pil_img.mode, (width, width), background_color)
+                    #             result.paste(pil_img, (0, (width - height) // 2))
+                    #             return result
+                    #         else:
+                    #             result = Image.new(pil_img.mode, (height, height), background_color)
+                    #             result.paste(pil_img, ((height - width) // 2, 0))
+                    #             return result
+                    #     image = expand2square(image)
+                    # elif image_process_mode in ["Default", "Crop"]:
+                    #     pass
+                    # elif image_process_mode == "Resize":
+                    #     image = image.resize((336, 336))
+                    # else:
+                    #     raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
+                    # max_hw, min_hw = max(image.size), min(image.size)
+                    # aspect_ratio = max_hw / min_hw
+                    # max_len, min_len = 800, 400
+                    # shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                    # longest_edge = int(shortest_edge * aspect_ratio)
+                    # W, H = image.size
+                    # if longest_edge != max(image.size):
+                    #     if H > W:
+                    #         H, W = longest_edge, shortest_edge
+                    #     else:
+                    #         H, W = shortest_edge, longest_edge
+                    #     image = image.resize((W, H))
+                    # if return_pil:
+                    #     images.append(image)
+                    # else:
+                    #     buffered = BytesIO()
+                    #     image.save(buffered, format="PNG")
+                    #     img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                    #     images.append(img_b64_str)
+        return images
+
+    def to_gradio_chatbot(self):
+        ret = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    # import base64
+                    # from io import BytesIO
+                    # from PIL import Image
+                    # msg, image, image_process_mode = msg
+                    # max_hw, min_hw = max(image.size), min(image.size)
+                    # aspect_ratio = max_hw / min_hw
+                    # max_len, min_len = 800, 400
+                    # shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                    # longest_edge = int(shortest_edge * aspect_ratio)
+                    # W, H = image.size
+                    # if H > W:
+                    #     H, W = longest_edge, shortest_edge
+                    # else:
+                    #     H, W = shortest_edge, longest_edge
+                    # image = image.resize((W, H))
+                    # buffered = BytesIO()
+                    # image.save(buffered, format="JPEG")
+                    # img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                    # img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
+                    # display image/video in the textbox
+                    msg, image_or_video, image_process_mode = msg
+                    ##print("imagebox:", image)
+                    if isinstance(image_or_video, Image.Image):
+                        # image is PIL object
+                        img_b64_str = self.process_image(image_or_video, "Default", return_pil=False, image_format='JPEG')
+                        img_str = f'<img src="data:image/jpeg;base64,{img_b64_str}" alt="user upload image" />'
+                        msg = img_str + msg.replace('<image>', '').strip()
+                    else:
+                        # video is file path
+                        vid_str = f'<video controls playsinline width="500" style="display: inline-block;" src="./file={image_or_video}"></video><br>'
+                        msg = vid_str + msg.replace('<video>', '').strip()
+                    ret.append([msg, None])
+                else:
+                    ret.append([msg, None])
+            else:
+                ret[-1][-1] = msg
+        return ret
+
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+            sep2=self.sep2,
+            version=self.version)
+
+    def dict(self):
+        if (self.modality == "image" and len(self.get_images()) > 0) or \
+            (self.modality == "video" and len(self.get_videos()) > 0):
+            return {
+                "system": self.system,
+                "roles": self.roles,
+                "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
+                "offset": self.offset,
+                "sep": self.sep,
+                "sep2": self.sep2,
+                "modality": self.modality
+            }
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+        }
+
+
+conv_vicuna_v0 = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "The assistant gives helpful, detailed, and polite answers to the human's questions.",
+    roles=("Human", "Assistant"),
+    messages=(
+        ("Human", "What are the key differences between renewable and non-renewable energy sources?"),
+        ("Assistant",
+            "Renewable energy sources are those that can be replenished naturally in a relatively "
+            "short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
+            "Non-renewable energy sources, on the other hand, are finite and will eventually be "
+            "depleted, such as coal, oil, and natural gas. Here are some key differences between "
+            "renewable and non-renewable energy sources:\n"
+            "1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
+            "energy sources are finite and will eventually run out.\n"
+            "2. Environmental impact: Renewable energy sources have a much lower environmental impact "
+            "than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
+            "and other negative effects.\n"
+            "3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
+            "have lower operational costs than non-renewable sources.\n"
+            "4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
+            "locations than non-renewable sources.\n"
+            "5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
+            "situations and needs, while non-renewable sources are more rigid and inflexible.\n"
+            "6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
+            "non-renewable sources are not, and their depletion can lead to economic and social instability.\n")
+    ),
+    offset=2,
+    sep_style=SeparatorStyle.SINGLE,
+    sep="###",
+)
+
+conv_llava_plain = Conversation(
+    system="",
+    roles=("", ""),
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.PLAIN,
+    sep="",
+    sep2="\n"
+)
+
+conv_llava_v0_mmtag = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+           "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
+           "The visual content will be provided with the following format: <Image>visual content</Image>.",
+    roles=("Human", "Assistant"),
+    messages=(
+    ),
+    offset=0,
+    sep_style=SeparatorStyle.SINGLE,
+    sep="###",
+    version="v0_mmtag",
+)
+
+conv_llava_v0 = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "The assistant gives helpful, detailed, and polite answers to the human's questions.",
+    roles=("Human", "Assistant"),
+    messages=(
+    ),
+    offset=0,
+    sep_style=SeparatorStyle.SINGLE,
+    sep="###",
+)
+
+conv_vicuna_v1 = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+conv_llava_v1_mmtag = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+           "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
+           "The visual content will be provided with the following format: <Image>visual content</Image>.",
+    roles=("USER", "ASSISTANT"),
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+    version="v1_mmtag",
+)
+
+conv_llava_v1 = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "The assistant gives helpful, detailed, and polite answers to the human's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+conv_llava_llama2 = Conversation(
+    system="You are a helpful language and vision assistant. "
+           "You are able to understand the visual content that the user provides, "
+           "and assist the user with a variety of tasks using natural language.",
+    roles=("USER", "ASSISTANT"),
+    version="llama2",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.LLAMA2,
+    sep="<s>",
+    sep2="</s>",
+)
+
+conv_llama2 = Conversation(
+    system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
+    roles=("USER", "ASSISTANT"),
+    version="llama2",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.LLAMA2,
+    sep="<s>",
+    sep2="</s>",
+)
+
+conv_mistral = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. "
+    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="llama2",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.LLAMA2,
+    sep="",
+    sep2="</s>",
+)
+
+conv_qwen = Conversation(
+    system="<|im_start|>system\nYou are a helpful assistant.",
+    roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.QWEN,
+    sep="<|im_end|>",
+    version="qwen",
+)
+
+conv_qwen_plain = Conversation(
+    system="",
+    roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.PLAIN,
+    sep="<|im_end|>",
+    sep2="<|im_end|>",
+    version="qwen_plain",
+)
+
+default_conversation = conv_mistral
+conv_templates = {
+    "default": conv_vicuna_v0,
+    # pretrain template
+    "plain": conv_llava_plain,
+    # llava v0
+    "v0": conv_vicuna_v0,
+    "v0_plain": conv_llava_plain,
+    "v0_mmtag": conv_llava_v0_mmtag,
+    "llava_v0": conv_llava_v0,
+    # llava v1
+    "v1": conv_vicuna_v1,
+    "v1_mmtag": conv_llava_v1_mmtag,
+    "llava_v1": conv_llava_v1,
+    "vicuna_v1": conv_vicuna_v1,
+    # llava v1.5
+    "llava_llama2": conv_llava_llama2,
+    # llama2
+    "llama2": conv_llama2,
+    # mistral
+    "mistral": conv_mistral,
+    # qwen
+    "qwen": conv_qwen,
+    "qwen_plain": conv_qwen_plain,
+}
+
+
+if __name__ == "__main__":
+    print(default_conversation.get_prompt())
diff --git a/third_party/VideoLLaMA2/videollama2/mm_utils.py b/third_party/VideoLLaMA2/videollama2/mm_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..27f282e0d2b3fd2beb12b7745f7d59f4a71588f6
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/mm_utils.py
@@ -0,0 +1,472 @@
+import ast
+import os
+import math
+import base64
+import traceback
+from io import BytesIO
+
+import cv2
+import torch
+import imageio
+import numpy as np
+from PIL import Image
+from decord import VideoReader, cpu
+from moviepy.editor import VideoFileClip
+from transformers import StoppingCriteria
+
+from .constants import NUM_FRAMES, MAX_FRAMES, NUM_FRAMES_PER_SECOND, MODAL_INDEX_MAP, DEFAULT_IMAGE_TOKEN
+from moviepy.editor import VideoFileClip
+import random
+import librosa
+import soundfile as sf
+import torchaudio.compliance.kaldi as ta_kaldi
+from subprocess import CalledProcessError, run, Popen, PIPE
+import math
+from pytorchvideo.data.clip_sampling import ConstantClipsPerVideoSampler
+
+def chunk_list(input_list, chunk_size):
+    return [input_list[i:i + chunk_size] for i in range(0, len(input_list), chunk_size)]
+
+
+def load_image_from_base64(image):
+    return Image.open(BytesIO(base64.b64decode(image)))
+
+
+def expand2square(pil_img, background_color):
+    width, height = pil_img.size
+    if width == height:
+        return pil_img
+    elif width > height:
+        result = Image.new(pil_img.mode, (width, width), background_color)
+        result.paste(pil_img, (0, (width - height) // 2))
+        return result
+    else:
+        result = Image.new(pil_img.mode, (height, height), background_color)
+        result.paste(pil_img, ((height - width) // 2, 0))
+        return result
+
+
+def create_photo_grid(arr, rows=None, cols=None):
+    """
+    Create a photo grid from a 4D numpy array with shape [t, h, w, c].
+
+    Parameters:
+        arr (numpy.ndarray): Input array with shape [t, h, w, c].
+        rows (int): Optional. Number of rows in the grid. If not set, it will be determined based on `cols` or the square root of `t`.
+        cols (int): Optional. Number of columns in the grid. If not set, it will be determined based on `rows` or the square root of `t`.
+
+    Returns:
+        numpy.ndarray: A 3D numpy array representing the photo grid.
+    """
+
+    if isinstance(arr, list):
+        if isinstance(arr[0], Image.Image):
+            arr = np.stack([np.array(img) for img in arr])
+        elif isinstance(arr[0], np.ndarray):
+            arr = np.stack(arr)
+        else:
+            raise ValueError("Invalid input type. Expected list of Images or numpy arrays.")
+
+    t, h, w, c = arr.shape
+    
+    # Calculate the number of rows and columns if not provided
+    if rows is None and cols is None:
+        rows = math.ceil(math.sqrt(t))
+        cols = math.ceil(t / rows)
+    elif rows is None:
+        rows = math.ceil(t / cols)
+    elif cols is None:
+        cols = math.ceil(t / rows)
+
+    # Check if the grid can hold all the images
+    if rows * cols < t:
+        raise ValueError(f"Not enough grid cells ({rows}x{cols}) to hold all images ({t}).")
+    
+    # Create the grid array with appropriate height and width
+    grid_height = h * rows
+    grid_width = w * cols
+    grid = np.zeros((grid_height, grid_width, c), dtype=arr.dtype)
+    
+    # Fill the grid with images
+    for i in range(t):
+        row_idx = i // cols
+        col_idx = i % cols
+        grid[row_idx*h:(row_idx+1)*h, col_idx*w:(col_idx+1)*w, :] = arr[i]
+    
+    return grid
+
+
+def process_image(image_path, processor, aspect_ratio='pad'):
+    image = Image.open(image_path).convert('RGB')
+
+    images = [np.array(image)]
+
+    if aspect_ratio == 'pad':
+        images = [Image.fromarray(f) for f in images]
+        images = [expand2square(image, tuple(int(x*255) for x in processor.image_mean)) for image in images]
+    else:
+        images = [Image.fromarray(f) for f in images]
+
+    images = processor.preprocess(images, return_tensors='pt')['pixel_values']
+    return images
+
+
+def frame_sample(duration, mode='uniform', num_frames=None, fps=None):
+    if mode == 'uniform':
+        assert num_frames is not None, "Number of frames must be provided for uniform sampling."
+        # NOTE: v1 version
+        # Calculate the size of each segment from which a frame will be extracted
+        seg_size = float(duration - 1) / num_frames
+
+        frame_ids = []
+        for i in range(num_frames):
+            # Calculate the start and end indices of each segment
+            start = seg_size * i
+            end   = seg_size * (i + 1)
+            # Append the middle index of the segment to the list
+            frame_ids.append((start + end) / 2)
+
+        return np.round(np.array(frame_ids) + 1e-6).astype(int)
+        # NOTE: v0 version
+        # return np.linspace(0, duration-1, num_frames, dtype=int)
+    elif mode == 'fps':
+        assert fps is not None, "FPS must be provided for FPS sampling."
+        segment_len = min(fps // NUM_FRAMES_PER_SECOND, duration)
+        return np.arange(segment_len // 2, duration, segment_len, dtype=int)
+    else:
+        raise ImportError(f'Unsupported frame sampling mode: {mode}')
+
+
+def process_audio_file(wav_path):
+    # read wav
+    #print(wav_path)
+    wav, sr = sf.read(wav_path)
+    if len(wav.shape) == 2:
+        wav = wav[:, 0]
+    if len(wav) > 30 * sr:
+        max_start = len(wav) - 30 * sr
+        start = random.randint(0, max_start)
+        wav = wav[start: start + 30 * sr]
+    if len(wav) < 30 * sr:
+        pad_length = 30 * sr - len(wav)
+        wav = np.pad(wav, (0, pad_length), mode='constant', constant_values=0.0)
+    if sr != 16000:
+        wav = librosa.resample(wav, orig_sr=sr, target_sr=16000, res_type="fft")
+
+    # beats
+    raw_wav = torch.from_numpy(wav).to('cpu')
+    waveform = raw_wav.unsqueeze(0) * 2 ** 15
+    fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10).to(torch.bfloat16)
+    return fbank.unsqueeze(0)
+
+def get_clip_timepoints(clip_sampler, duration):
+    # Read out all clips in this video
+    all_clips_timepoints = []
+    is_last_clip = False
+    end = 0.0
+    while not is_last_clip:
+        start, end, _, _, is_last_clip = clip_sampler(end, duration, annotation=None)
+        all_clips_timepoints.append((start, end))
+        #print(int(start))
+        #print(int(end))
+        #print("AAAA")
+    return all_clips_timepoints
+
+def load_audio_from_video(file: str, sr: int = 16000):
+    """
+    Open an audio file and read as mono waveform, resampling as necessary
+
+    Parameters
+    ----------
+    file: str
+        The audio file to open
+
+    sr: int
+        The sample rate to resample the audio if necessary
+
+    Returns
+    -------
+    A NumPy array containing the audio waveform, in float32 dtype.
+    """
+
+    # This launches a subprocess to decode audio while down-mixing
+    # and resampling as necessary.  Requires the ffmpeg CLI in PATH.
+
+    cmd = ["ffmpeg", "-nostdin", "-i", file, "-vn",  # no video
+        "-acodec", "pcm_s16le",  # output audio codec (pcm_s16le for .wav)
+        "-ac", "1",  # audio channels (1 for mono)
+        "-ar", str(sr),  # audio sample rate
+        "-f", "s16le",  # output format (s16le for 16-bit PCM)
+        "-"  # output to stdout
+        ]
+    # fmt: on
+    try:
+        out = run(cmd, capture_output=True, check=True).stdout
+    except CalledProcessError as e:
+        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
+    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0, sr
+
+
+def process_audio_from_video(audio_path, clip_duration, device="cpu", num_mel_bins=128, sample_rate=16000, clips_per_video=8, mean=-4.268, std=9.138):
+    clip_sampler = ConstantClipsPerVideoSampler(
+        clip_duration=2, clips_per_video=clips_per_video
+    )
+    try:
+        waveform, sr = load_audio_from_video(audio_path)
+    except Exception as audio_error:
+        print(f"Failed to process audio from video due to error: {audio_error}")
+        waveform = torch.zeros(480000)
+        waveform = waveform.numpy()
+        sr = 16000
+    all_clips_timepoints = get_clip_timepoints(clip_sampler, waveform.shape[0] / sample_rate)
+    all_clips = []
+    for clip_timepoints in all_clips_timepoints:
+        waveform_clip = waveform[
+            int(clip_timepoints[0] * sample_rate) : int(
+                clip_timepoints[1] * sample_rate)]
+        all_clips.append(waveform_clip)
+    all_clips_tensors = [torch.from_numpy(clip) for clip in all_clips]
+    wav = torch.cat(all_clips_tensors, dim=0)
+    if len(wav) > 30 * sr:
+        max_start = len(wav) - 30 * sr
+        start = torch.randint(0, max_start, (1,)).item()
+        wav = wav[start: start + 30 * sr]
+    if len(wav) < 30 * sr:
+        pad_length = 30 * sr - len(wav)
+        wav = torch.nn.functional.pad(wav, (0, pad_length), mode='constant', value=0.0)
+    waveform = wav.unsqueeze(0) * 2 ** 15
+    fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10).to(torch.bfloat16)
+    return fbank.unsqueeze(0)
+
+
+def process_video(video_path, processor, s=None, e=None, aspect_ratio='pad', num_frames=NUM_FRAMES, va=False):
+    if isinstance(video_path, str):
+        if s is not None and e is not None:
+            s = s if s >= 0. else 0.
+            e = e if e >= 0. else 0.
+            if s > e:
+                s, e = e, s
+            elif s == e:
+                e = s + 1
+
+        # 1. Loading Video
+        if os.path.isdir(video_path):                
+            frame_files = sorted(os.listdir(video_path))
+
+            fps = 3
+            num_frames_of_video = len(frame_files)
+        elif video_path.endswith('.gif'):
+            gif_reader = imageio.get_reader(video_path)
+
+            fps = 25
+            num_frames_of_video = len(gif_reader)
+        else:
+            vreader = VideoReader(video_path, ctx=cpu(0), num_threads=1)
+
+            fps = vreader.get_avg_fps()
+            num_frames_of_video = len(vreader)
+
+        # 2. Determine frame range & Calculate frame indices
+        f_start = 0                       if s is None else max(int(s * fps) - 1, 0)
+        f_end   = num_frames_of_video - 1 if e is None else min(int(e * fps) - 1, num_frames_of_video - 1)
+        frame_indices = list(range(f_start, f_end + 1))
+
+        duration = len(frame_indices)
+        # 3. Sampling frame indices 
+        if num_frames is None:
+            sampled_frame_indices = [frame_indices[i] for i in frame_sample(duration, mode='fps', fps=fps)]
+        else:
+            sampled_frame_indices = [frame_indices[i] for i in frame_sample(duration, mode='uniform', num_frames=num_frames)]
+
+        # 4. Acquire frame data
+        if os.path.isdir(video_path): 
+            video_data = [Image.open(os.path.join(video_path, frame_files[f_idx])) for f_idx in sampled_frame_indices]
+        elif video_path.endswith('.gif'):
+            video_data = [Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_RGBA2RGB)) for idx, frame in enumerate(gif_reader) if idx in sampled_frame_indices]
+        else:
+            video_data = [Image.fromarray(frame) for frame in vreader.get_batch(sampled_frame_indices).asnumpy()]
+
+    elif isinstance(video_path, np.ndarray):
+        video_data = [Image.fromarray(f) for f in video_path]
+    elif isinstance(video_path, list) and isinstance(video_path[0], np.ndarray):
+        video_data = [Image.fromarray(f) for f in video_path]
+    elif isinstance(video_path, list) and isinstance(video_path[0], str):
+        video_data = [Image.open(f) for f in video_path]
+    elif isinstance(video_path, list) and isinstance(video_path[0], Image.Image):
+        video_data = video_path
+    else:
+        raise ValueError(f"Unsupported video path type: {type(video_path)}")
+
+    while num_frames is not None and len(video_data) < num_frames:
+        video_data.append(Image.fromarray(np.zeros((*video_data[-1].size, 3), dtype=np.uint8)))
+
+    # MAX_FRAMES filter
+    video_data = video_data[:MAX_FRAMES]
+
+    if aspect_ratio == 'pad':
+        images = [expand2square(f, tuple(int(x*255) for x in processor.image_mean)) for f in video_data]
+        video = processor.preprocess(images, return_tensors='pt')['pixel_values']
+    else:
+        images = [f for f in video_data]
+        video = processor.preprocess(images, return_tensors='pt')['pixel_values']
+
+    if va:
+        # Calculate the duration of the video in seconds
+        video_duration_seconds = num_frames_of_video / fps
+        audio = process_audio_from_video(video_path, video_duration_seconds)
+        video = {'video': video, 'audio': audio}
+        
+    return video
+
+def process_video_old(video_path, processor, aspect_ratio='pad', num_frames=NUM_FRAMES, image_grid=False, sample_scheme='uniform'):
+    def frame_sample(duration, mode='uniform', local_fps=None):
+        if mode == 'uniform':
+            # Calculate the size of each segment from which a frame will be extracted
+            seg_size = float(duration - 1) / num_frames
+
+            frame_ids = []
+            for i in range(num_frames):
+                # Calculate the start and end indices of each segment
+                start = int(np.round(seg_size * i))
+                end = int(np.round(seg_size * (i + 1)))
+                # Append the middle index of the segment to the list
+                frame_ids.append((start + end) // 2)
+
+            return frame_ids
+            # NOTE: old version
+            # return np.linspace(0, duration-1, num_frames, dtype=int)
+        elif mode == 'fps':
+            assert local_fps is not None
+            segment_len = min(local_fps // NUM_FRAMES_PER_SECOND, duration)
+            return np.arange(segment_len // 2, duration, segment_len, dtype=int)
+        else:
+            raise ImportError(f'Unsupported frame sampling mode: {mode}')
+
+    if isinstance(video_path, str):
+        if video_path.endswith('.gif'):
+            video_gif = imageio.get_reader(video_path)
+            duration, local_fps = len(video_gif), 10
+
+            frame_id_list = frame_sample(duration, mode=sample_scheme, local_fps=local_fps)
+            # limit the max input frames
+            if len(frame_id_list) > MAX_FRAMES:
+                frame_id_list = np.linspace(0, duration-1, MAX_FRAMES, dtype=int)
+            video_data = [frame for index, frame in enumerate(video_gif) if index in frame_id_list]
+        # added by lixin4ever, include the support of .webm files from sthsthv2
+        elif video_path.endswith('.webm'):
+            video_webm = VideoFileClip(video_path)
+            video_frames = np.array(list(video_webm.iter_frames()))
+
+            duration, local_fps = len(video_frames), video_webm.fps
+
+            frame_id_list = frame_sample(duration, mode=sample_scheme, local_fps=local_fps)
+            # limit the max input frames
+            if len(frame_id_list) > MAX_FRAMES:
+                frame_id_list = np.linspace(0, duration-1, MAX_FRAMES, dtype=int)
+            video_data = video_frames[frame_id_list]
+        else:
+            # NOTE: num_threads=1 is required to avoid deadlock in multiprocessing
+            decord_vr = VideoReader(uri=video_path, ctx=cpu(0), num_threads=1) 
+            duration, local_fps = len(decord_vr), float(decord_vr.get_avg_fps())
+        
+            frame_id_list = frame_sample(duration, mode=sample_scheme, local_fps=local_fps)
+            # limit the max input frames
+            if len(frame_id_list) > MAX_FRAMES:
+                frame_id_list = np.linspace(0, duration-1, MAX_FRAMES, dtype=int)
+            try:
+                video_data = decord_vr.get_batch(frame_id_list).numpy()
+            except:
+                video_data = decord_vr.get_batch(frame_id_list).asnumpy()
+
+    elif isinstance(video_path, np.ndarray):
+        assert len(video_path) == num_frames
+        video_data = video_path
+    elif isinstance(video_path, list):
+        assert len(video_path) == num_frames
+        video_data = np.stack([np.array(x) for x in video_path])
+
+    if image_grid:
+        grid_h = grid_w = math.ceil(math.sqrt(num_frames))
+        pg = create_photo_grid(video_data, grid_h, grid_w)
+        video_data = [pg, *video_data]
+
+    if aspect_ratio == 'pad':
+        images = [Image.fromarray(f.numpy() if isinstance(f, torch.Tensor) else f) for f in video_data]
+        images = [expand2square(image, tuple(int(x*255) for x in processor.image_mean)) for image in images]
+        video = processor.preprocess(images, return_tensors='pt')['pixel_values']
+    else:
+        images = [Image.fromarray(f.numpy() if isinstance(f, torch.Tensor) else f) for f in video_data]
+        video = processor.preprocess(images, return_tensors='pt')['pixel_values']
+
+    return video
+
+
+def tokenizer_multimodal_token(prompt, tokenizer, multimodal_token=DEFAULT_IMAGE_TOKEN, return_tensors=None):
+    """Tokenize text and multimodal tag to input_ids.
+
+    Args:
+        prompt (str): Text prompt (w/ multimodal tag), e.g., '<video>\nDescribe the video.'
+        tokenizer (transformers.PreTrainedTokenizer): Tokenizer object.
+        multimodal_token (int): Token index corresponding to the multimodal tag.
+    """
+    multimodal_token_index = MODAL_INDEX_MAP.get(multimodal_token, None)
+    if multimodal_token_index is None:
+        input_ids = tokenizer(prompt, add_special_tokens=False).input_ids
+    else:
+        prompt_chunks = [tokenizer(chunk, add_special_tokens=False).input_ids for idx, chunk in enumerate(prompt.split(multimodal_token))]
+
+        input_ids = []
+        for i in range(1, 2 * len(prompt_chunks)):
+            if i % 2 == 1:
+                input_ids.extend(prompt_chunks[i // 2])
+            else:
+                input_ids.append(multimodal_token_index)
+
+    if return_tensors is not None:
+        if return_tensors == 'pt':
+            return torch.tensor(input_ids, dtype=torch.long)
+        raise ValueError(f'Unsupported tensor type: {return_tensors}')
+    return input_ids
+
+
+def get_model_name_from_path(model_path):
+    model_path = model_path.strip("/")
+    model_paths = model_path.split("/")
+    if model_paths[-1].startswith('checkpoint-'):
+        return model_paths[-2] + "_" + model_paths[-1]
+    else:
+        return model_paths[-1]
+
+
+class KeywordsStoppingCriteria(StoppingCriteria):
+    def __init__(self, keywords, tokenizer, input_ids):
+        self.keywords = keywords
+        self.keyword_ids = []
+        self.max_keyword_len = 0
+        for keyword in keywords:
+            cur_keyword_ids = tokenizer(keyword).input_ids
+            if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
+                cur_keyword_ids = cur_keyword_ids[1:]
+            if len(cur_keyword_ids) > self.max_keyword_len:
+                self.max_keyword_len = len(cur_keyword_ids)
+            self.keyword_ids.append(torch.tensor(cur_keyword_ids))
+        self.tokenizer = tokenizer
+        self.start_len = input_ids.shape[1]
+    
+    def call_for_batch(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        offset = min(output_ids.shape[1] - self.start_len, self.max_keyword_len)
+        self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
+        for keyword_id in self.keyword_ids:
+            if (output_ids[0, -keyword_id.shape[0]:] == keyword_id).all():
+                return True
+        outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
+        for keyword in self.keywords:
+            if keyword in outputs:
+                return True
+        return False
+    
+    def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        outputs = []
+        for i in range(output_ids.shape[0]):
+            outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores))
+        return all(outputs)
diff --git a/third_party/VideoLLaMA2/videollama2/model/__init__.py b/third_party/VideoLLaMA2/videollama2/model/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e5e0bbb8302a17695dc2ae258b34e8fad51f5d8
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/__init__.py
@@ -0,0 +1,208 @@
+# Adopted from https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+import os
+import warnings
+import shutil
+
+import torch
+from transformers import PretrainedConfig, AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
+
+from .projector import load_mm_projector
+from .videollama2_llama import Videollama2LlamaForCausalLM, Videollama2LlamaConfig
+from .videollama2_mistral import Videollama2MistralForCausalLM, Videollama2MistralConfig
+from .videollama2_mixtral import Videollama2MixtralForCausalLM, Videollama2MixtralConfig
+from .videollama2_qwen2 import Videollama2Qwen2ForCausalLM, Videollama2Qwen2Config
+from .videollama2_gemma2 import Videollama2Gemma2ForCausalLM, Videollama2Gemma2Config
+from .videollama2_phi3 import Videollama2Phi3ForCausalLM, Videollama2Phi3Config
+
+
+VLLMs = {
+    "videollama2": Videollama2MistralForCausalLM,
+    "videollama2_llama": Videollama2LlamaForCausalLM,
+    "videollama2_mistral": Videollama2MistralForCausalLM,
+    "videollama2_mixtral": Videollama2MixtralForCausalLM,
+    "videollama2_qwen2": Videollama2Qwen2ForCausalLM,
+    "videollama2_gemma2": Videollama2Gemma2ForCausalLM,
+    "videollama2_phi3": Videollama2Phi3ForCausalLM,
+}
+
+VLLMConfigs = {
+    "videollama2": Videollama2MistralConfig,
+    "videollama2_llama": Videollama2LlamaConfig,
+    "videollama2_mistral": Videollama2MistralConfig,
+    "videollama2_mixtral": Videollama2MixtralConfig,
+    "videollama2_qwen2": Videollama2Qwen2Config,
+    "videollama2_gemma2": Videollama2Gemma2Config,
+    "videollama2_phi3": Videollama2Phi3Config,
+}
+
+
+def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs):
+    if 'token' in kwargs:
+        token = kwargs['token']
+    else:
+        token = None
+    
+    kwargs = {"device_map": device_map, **kwargs}
+
+    if device != "cuda":
+        kwargs['device_map'] = {"": device}
+
+    if load_8bit:
+        kwargs['load_in_8bit'] = True
+    elif load_4bit:
+        # NOTE: High-version Transformers will report: """ValueError: You can't pass `load_in_4bit`or `load_in_8bit` as a kwarg when passing `quantization_config` argument at the same time."""
+        # kwargs['load_in_4bit'] = True
+        kwargs['quantization_config'] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type='nf4'
+        )
+    else:
+        kwargs['torch_dtype'] = torch.float16
+
+    if use_flash_attn:
+        kwargs['attn_implementation'] = 'flash_attention_2'
+
+    config = AutoConfig.from_pretrained(model_path)
+
+    # judge model type
+    model_type = config.model_type
+
+    # judge pretrain/finetune
+    try:
+        is_pretraining = config.tune_mm_mlp_adapter
+    except:
+        is_pretraining = False
+
+    # NOTE: lora/qlora model loading
+    if 'lora' in model_name.lower() or 'qlora' in model_name.lower():
+        cfg_pretrained = PretrainedConfig.from_pretrained(model_path, token=token)
+        # NOTE: AutoConfig will modify `_name_or_path` property to `model_path` if `model_path` is not None.
+        # cfg_pretrained = AutoConfig.from_pretrained(model_path, token=token)
+        model_base = model_base if model_base is not None else cfg_pretrained._name_or_path
+
+        # NOTE: remove qlora training quantization config 
+        if hasattr(lora_cfg_pretrained, 'quantization_config'):
+            del lora_cfg_pretrained.quantization_config
+        tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False, token=token)
+        print('Loading VideoLLaMA from base model...')
+
+        if 'vicuna' in model_base.lower():
+            model = Videollama2LlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif 'mistral' in model_base.lower():
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        else:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+
+        token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
+        if model.lm_head.weight.shape[0] != token_num:
+            model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+            model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+
+        print('Loading additional VideoLLaMA weights...')
+        if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
+            non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
+        else:
+            # this is probably from HF Hub
+            from huggingface_hub import hf_hub_download
+            def load_from_hf(repo_id, filename, subfolder=None):
+                cache_file = hf_hub_download(
+                    repo_id=repo_id,
+                    filename=filename,
+                    subfolder=subfolder)
+                return torch.load(cache_file, map_location='cpu')
+            non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
+        non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
+        if any(k.startswith('model.model.') for k in non_lora_trainables):
+            non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
+        model.load_state_dict(non_lora_trainables, strict=False)
+
+        from peft import PeftModel
+        print('Loading LoRA weights...')
+        model = PeftModel.from_pretrained(model, model_path)
+        print('Merging LoRA weights...')
+        model = model.merge_and_unload()
+        print('Model is loaded...')
+    elif model_base is not None or '-base' in model_name.lower() or is_pretraining:
+        # NOTE: Base/Pretrain model loading
+        print('Loading VideoLLaMA 2 from base model...')
+        cfg_pretrained = PretrainedConfig.from_pretrained(model_path, token=token)
+        # NOTE: AutoConfig will modify `_name_or_path` property to `model_path` if `model_path` is not None.
+        # cfg_pretrained = AutoConfig.from_pretrained(model_path, token=token)
+        model_base = model_base if model_base is not None else cfg_pretrained._name_or_path
+
+        tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False, token=token)
+
+        if model_type in ['videollama2', 'videollama2_mistral']:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_mixtral']:
+            model = Videollama2MixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_qwen2']:
+            model = Videollama2Qwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_gemma2']:
+            model = Videollama2Gemma2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_phi3']:
+            model = Videollama2Phi3ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+        else:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=config, **kwargs)
+
+        # NOTE; loading vision-language projector
+        # * old codes for loading local mm_projector.bin
+        # mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
+        # mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
+        # model.load_state_dict(mm_projector_weights, strict=False)
+        # * new codes which supports loading mm_projector.bin both offline and online 
+        mm_projector_weights = load_mm_projector(model_path, token=token)
+        model.load_state_dict(mm_projector_weights, strict=False)
+    elif 'videollama2' in model_type:
+        # NOTE: SFT model loading
+        tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, token=token)
+
+        if model_type in ['videollama2', 'videollama2_mistral']:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_mixtral']:
+            model = Videollama2MixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_qwen2']:
+            model = Videollama2Qwen2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_gemma2']:
+            model = Videollama2Gemma2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        elif model_type in ['videollama2_phi3']:
+            model = Videollama2Phi3ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+        else:
+            model = Videollama2MistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True, token=token)
+        model = AutoModelForCausalLM.from_pretrained(model_path, config=config, **kwargs)
+
+    processor = None
+
+    if "videollama" in model_type:
+        vision_tower = model.get_vision_tower()
+        if not vision_tower.is_loaded:
+            vision_tower.load_model()
+        vision_tower.to(device=device, dtype=torch.float16)
+        # NOTE: videollama2 adopts the same processor for processing image and video.
+        processor = vision_tower.image_processor
+
+    if hasattr(model.config, "max_sequence_length"):
+        context_len = model.config.max_sequence_length
+    else:
+        context_len = 2048
+
+    return tokenizer, model, processor, context_len
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/BEATs.py b/third_party/VideoLLaMA2/videollama2/model/beats/BEATs.py
new file mode 100644
index 0000000000000000000000000000000000000000..c77c6aa953c04000bdd4df70d7deba0ea3588065
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/beats/BEATs.py
@@ -0,0 +1,185 @@
+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+
+
+import torch
+import torch.nn as nn
+from torch.nn import LayerNorm
+import torchaudio.compliance.kaldi as ta_kaldi
+
+from .backbone import (
+    TransformerEncoder,
+)
+
+import logging
+from typing import Optional
+
+logger = logging.getLogger(__name__)
+
+
+class BEATsConfig:
+    def __init__(self, cfg=None):
+        self.input_patch_size: int = 16  # path size of patch embedding
+        self.embed_dim: int = 512  # patch embedding dimension
+        self.conv_bias: bool = False  # include bias in conv encoder
+
+        self.encoder_layers: int = 12  # num encoder layers in the transformer
+        self.hidden_size: int = 4096  # 3584 for Qwen2
+        self.encoder_embed_dim: int = 768  # encoder embedding dimension
+        self.encoder_ffn_embed_dim: int = 3072  # encoder embedding dimension for FFN
+        self.encoder_attention_heads: int = 12  # num encoder attention heads
+        self.activation_fn: str = "gelu"  # activation function to use
+
+        self.layer_wise_gradient_decay_ratio: float = 0.6  # ratio for layer-wise gradient decay
+        self.layer_norm_first: bool = False  # apply layernorm first in the transformer
+        self.deep_norm: bool = True  # apply deep_norm first in the transformer
+
+        # dropouts
+        self.dropout: float = 0.0  # dropout probability for the transformer
+        self.attention_dropout: float = 0.0  # dropout probability for attention weights
+        self.activation_dropout: float = 0.0  # dropout probability after activation in FFN
+        self.encoder_layerdrop: float = 0.05  # probability of dropping a tarnsformer layer
+        self.dropout_input: float = 0.0  # dropout to apply to the input (after feat extr)
+
+        # positional embeddings
+        self.conv_pos: int = 128  # number of filters for convolutional positional embeddings
+        self.conv_pos_groups: int = 16  # number of groups for convolutional positional embedding
+
+        # relative position embedding
+        self.relative_position_embedding: bool = True  # apply relative position embedding
+        self.num_buckets: int = 320  # number of buckets for relative position embedding
+        self.max_distance: int = 800  # maximum distance for relative position embedding
+        self.gru_rel_pos: bool = True  # apply gated relative position embedding
+
+        # label predictor
+        self.finetuned_model: bool = True  # whether the model is a fine-tuned model.
+        self.predictor_dropout: float = 0.0  # dropout probability for the predictor
+        self.predictor_class: int = 527  # target class number for the predictor
+
+        if cfg is not None:
+            self.update(cfg)
+
+    def update(self, cfg: dict):
+        self.__dict__.update(cfg)
+
+
+class BEATs(nn.Module):
+    def __init__(
+            self,
+            cfg: BEATsConfig,
+    ) -> None:
+        super().__init__()
+        logger.info(f"BEATs Config: {cfg.__dict__}")
+
+        self.cfg = cfg
+
+        self.embed = cfg.embed_dim
+        self.post_extract_proj = (
+            nn.Linear(self.embed, cfg.encoder_embed_dim)
+            if self.embed != cfg.encoder_embed_dim
+            else None
+        )
+
+        self.input_patch_size = cfg.input_patch_size
+        self.patch_embedding = nn.Conv2d(1, self.embed, kernel_size=self.input_patch_size, stride=self.input_patch_size,
+                                         bias=cfg.conv_bias)
+
+        self.dropout_input = nn.Dropout(cfg.dropout_input)
+
+        assert not cfg.deep_norm or not cfg.layer_norm_first
+        self.encoder = TransformerEncoder(cfg)
+        self.layer_norm = LayerNorm(self.embed)
+
+        if cfg.finetuned_model:
+            self.predictor_dropout = nn.Dropout(cfg.predictor_dropout)
+            self.predictor = nn.Linear(cfg.encoder_embed_dim, cfg.predictor_class)
+        else:
+            self.predictor = None
+
+    def forward_padding_mask(
+            self,
+            features: torch.Tensor,
+            padding_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        extra = padding_mask.size(1) % features.size(1)
+        if extra > 0:
+            padding_mask = padding_mask[:, :-extra]
+        padding_mask = padding_mask.view(
+            padding_mask.size(0), features.size(1), -1
+        )
+        padding_mask = padding_mask.all(-1)
+        return padding_mask
+
+    def preprocess(
+            self,
+            source: torch.Tensor,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+    ) -> torch.Tensor:
+        '''
+        fbanks = []
+        for waveform in source:
+            waveform = waveform.unsqueeze(0) * 2 ** 15
+            fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10)
+            fbanks.append(fbank)
+        fbank = torch.stack(fbanks, dim=0)
+        '''
+        fbank = source
+        fbank = (fbank - fbank_mean) / (2 * fbank_std)
+        return fbank
+
+    def extract_features(
+            self,
+            source: torch.Tensor,
+            padding_mask: Optional[torch.Tensor] = None,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+            feature_only=True,
+    ):
+        fbank = self.preprocess(source, fbank_mean=fbank_mean, fbank_std=fbank_std)
+
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(fbank, padding_mask)
+
+        fbank = fbank.unsqueeze(1)
+        features = self.patch_embedding(fbank)
+        T = features.shape[2]
+        F = features.shape[3]
+        features = features.reshape(features.shape[0], features.shape[1], -1)
+        features = features.transpose(1, 2)
+        features = self.layer_norm(features)
+
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(features, padding_mask)
+
+        if self.post_extract_proj is not None:
+            features = self.post_extract_proj(features)
+
+        x = self.dropout_input(features)
+
+        x, layer_results = self.encoder(
+            x,
+            padding_mask=padding_mask,
+        )
+        if not feature_only and self.predictor is not None:
+            x = self.predictor_dropout(x)
+            logits = self.predictor(x)
+
+            if padding_mask is not None and padding_mask.any():
+                logits[padding_mask] = 0
+                logits = logits.sum(dim=1)
+                logits = logits / (~padding_mask).sum(dim=1).unsqueeze(-1).expand_as(logits)
+            else:
+                logits = logits.mean(dim=1)
+
+            lprobs = torch.sigmoid(logits)
+
+            return lprobs, padding_mask
+        else:
+            return x, T, F
\ No newline at end of file
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/LICENSE_beats b/third_party/VideoLLaMA2/videollama2/model/beats/LICENSE_beats
new file mode 100644
index 0000000000000000000000000000000000000000..5ae193c94d0ca44b222ff24d76eaf41709ce2b4f
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/beats/LICENSE_beats
@@ -0,0 +1,21 @@
+The MIT License (MIT)
+
+Copyright (c) Microsoft Corporation
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/Tokenizers.py b/third_party/VideoLLaMA2/videollama2/model/beats/Tokenizers.py
new file mode 100644
index 0000000000000000000000000000000000000000..da53a7bb3eb8d028789e3ff50a7ace798030b908
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/beats/Tokenizers.py
@@ -0,0 +1,172 @@
+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+
+
+import torch
+import torch.nn as nn
+from torch.nn import LayerNorm
+import torchaudio.compliance.kaldi as ta_kaldi
+
+from beats.backbone import (
+    TransformerEncoder,
+)
+from beats.quantizer import (
+    NormEMAVectorQuantizer,
+)
+
+import logging
+from typing import Optional
+
+logger = logging.getLogger(__name__)
+
+
+class TokenizersConfig:
+    def __init__(self, cfg=None):
+        self.input_patch_size: int = -1  # path size of patch embedding
+        self.embed_dim: int = 512  # patch embedding dimension
+        self.conv_bias: bool = False  # include bias in conv encoder
+
+        self.encoder_layers: int = 12  # num encoder layers in the transformer
+        self.encoder_embed_dim: int = 768  # encoder embedding dimension
+        self.encoder_ffn_embed_dim: int = 3072  # encoder embedding dimension for FFN
+        self.encoder_attention_heads: int = 12  # num encoder attention heads
+        self.activation_fn: str = "gelu"  # activation function to use
+
+        self.layer_norm_first: bool = False  # apply layernorm first in the transformer
+        self.deep_norm: bool = False  # apply deep_norm first in the transformer
+
+        # dropouts
+        self.dropout: float = 0.1  # dropout probability for the transformer
+        self.attention_dropout: float = 0.1  # dropout probability for attention weights
+        self.activation_dropout: float = 0.0  # dropout probability after activation in FFN
+        self.encoder_layerdrop: float = 0.0  # probability of dropping a tarnsformer layer
+        self.dropout_input: float = 0.0  # dropout to apply to the input (after feat extr)
+
+        # positional embeddings
+        self.conv_pos: int = 128  # number of filters for convolutional positional embeddings
+        self.conv_pos_groups: int = 16  # number of groups for convolutional positional embedding
+
+        # relative position embedding
+        self.relative_position_embedding: bool = False  # apply relative position embedding
+        self.num_buckets: int = 320  # number of buckets for relative position embedding
+        self.max_distance: int = 1280  # maximum distance for relative position embedding
+        self.gru_rel_pos: bool = False  # apply gated relative position embedding
+
+        # quantizer
+        self.quant_n: int = 1024 # codebook number in quantizer
+        self.quant_dim: int = 256    # codebook dimension in quantizer
+
+        if cfg is not None:
+            self.update(cfg)
+
+    def update(self, cfg: dict):
+        self.__dict__.update(cfg)
+
+
+class Tokenizers(nn.Module):
+    def __init__(
+            self,
+            cfg: TokenizersConfig,
+    ) -> None:
+        super().__init__()
+        logger.info(f"Tokenizers Config: {cfg.__dict__}")
+
+        self.cfg = cfg
+
+        self.embed = cfg.embed_dim
+        self.post_extract_proj = (
+            nn.Linear(self.embed, cfg.encoder_embed_dim)
+            if self.embed != cfg.encoder_embed_dim
+            else None
+        )
+
+        self.input_patch_size = cfg.input_patch_size
+        self.patch_embedding = nn.Conv2d(1, self.embed, kernel_size=self.input_patch_size, stride=self.input_patch_size,
+                                         bias=cfg.conv_bias)
+
+        self.dropout_input = nn.Dropout(cfg.dropout_input)
+
+        assert not cfg.deep_norm or not cfg.layer_norm_first
+        self.encoder = TransformerEncoder(cfg)
+        self.layer_norm = LayerNorm(self.embed)
+
+        self.quantize = NormEMAVectorQuantizer(
+            n_embed=cfg.quant_n, embedding_dim=cfg.quant_dim, beta=1.0, kmeans_init=True, decay=0.99,
+        )
+        self.quant_n = cfg.quant_n
+        self.quantize_layer = nn.Sequential(
+            nn.Linear(cfg.encoder_embed_dim, cfg.encoder_embed_dim),
+            nn.Tanh(),
+            nn.Linear(cfg.encoder_embed_dim, cfg.quant_dim)  # for quantize
+        )
+
+    def forward_padding_mask(
+            self,
+            features: torch.Tensor,
+            padding_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        extra = padding_mask.size(1) % features.size(1)
+        if extra > 0:
+            padding_mask = padding_mask[:, :-extra]
+        padding_mask = padding_mask.view(
+            padding_mask.size(0), features.size(1), -1
+        )
+        padding_mask = padding_mask.all(-1)
+        return padding_mask
+
+    def preprocess(
+            self,
+            source: torch.Tensor,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+    ) -> torch.Tensor:
+        fbanks = []
+        for waveform in source:
+            waveform = waveform.unsqueeze(0) * 2 ** 15
+            fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10)
+            fbanks.append(fbank)
+        fbank = torch.stack(fbanks, dim=0)
+        fbank = (fbank - fbank_mean) / (2 * fbank_std)
+        return fbank
+
+    def extract_labels(
+            self,
+            source: torch.Tensor,
+            padding_mask: Optional[torch.Tensor] = None,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+    ):
+        fbank = self.preprocess(source, fbank_mean=fbank_mean, fbank_std=fbank_std)
+
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(fbank, padding_mask)
+
+        fbank = fbank.unsqueeze(1)
+        features = self.patch_embedding(fbank)
+        features = features.reshape(features.shape[0], features.shape[1], -1)
+        features = features.transpose(1, 2)
+        features = self.layer_norm(features)
+
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(features, padding_mask)
+
+        if self.post_extract_proj is not None:
+            features = self.post_extract_proj(features)
+
+        x = self.dropout_input(features)
+
+        x, layer_results = self.encoder(
+            x,
+            padding_mask=padding_mask,
+        )
+
+        quantize_input = self.quantize_layer(x)
+        quantize_feature, embed_loss, embed_ind = self.quantize(quantize_input)
+
+        return embed_ind
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/__init__.py b/third_party/VideoLLaMA2/videollama2/model/beats/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/backbone.py b/third_party/VideoLLaMA2/videollama2/model/beats/backbone.py
new file mode 100644
index 0000000000000000000000000000000000000000..6530ea4fc5d043a96e02319a79e1ba1af9dbff33
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/beats/backbone.py
@@ -0,0 +1,783 @@
+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+
+import math
+import numpy as np
+from typing import Dict, Optional, Tuple
+import torch
+from torch import Tensor, nn
+import torch.nn.functional as F
+from torch.nn import LayerNorm, Parameter
+from .modules import (
+    GradMultiply,
+    SamePad,
+    get_activation_fn,
+    GLU_Linear,
+    quant_noise,
+)
+from .weight_norm_fix import weight_norm
+
+class TransformerEncoder(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+
+        self.dropout = args.dropout
+        self.embedding_dim = args.encoder_embed_dim
+
+        self.pos_conv = nn.Conv1d(
+            self.embedding_dim,
+            self.embedding_dim,
+            kernel_size=args.conv_pos,
+            padding=args.conv_pos // 2,
+            groups=args.conv_pos_groups,
+        )
+        dropout = 0
+        std = math.sqrt((4 * (1.0 - dropout)) / (args.conv_pos * self.embedding_dim))
+        nn.init.normal_(self.pos_conv.weight, mean=0, std=std)
+        nn.init.constant_(self.pos_conv.bias, 0)
+
+        self.pos_conv = weight_norm(self.pos_conv, name="weight", dim=2)
+        self.pos_conv = nn.Sequential(self.pos_conv, SamePad(args.conv_pos), nn.GELU())
+
+        if hasattr(args, "relative_position_embedding"):
+            self.relative_position_embedding = args.relative_position_embedding
+            self.num_buckets = args.num_buckets
+            self.max_distance = args.max_distance
+        else:
+            self.relative_position_embedding = False
+            self.num_buckets = 0
+            self.max_distance = 0
+
+        self.layers = nn.ModuleList(
+            [
+                TransformerSentenceEncoderLayer(
+                    embedding_dim=self.embedding_dim,
+                    ffn_embedding_dim=args.encoder_ffn_embed_dim,
+                    num_attention_heads=args.encoder_attention_heads,
+                    dropout=self.dropout,
+                    attention_dropout=args.attention_dropout,
+                    activation_dropout=args.activation_dropout,
+                    activation_fn=args.activation_fn,
+                    layer_norm_first=args.layer_norm_first,
+                    deep_norm=args.deep_norm,
+                    has_relative_attention_bias=self.relative_position_embedding,
+                    num_buckets=self.num_buckets,
+                    max_distance=self.max_distance,
+                    gru_rel_pos=args.gru_rel_pos,
+                    encoder_layers=args.encoder_layers,
+                )
+                for i in range(args.encoder_layers)
+            ]
+        )
+        if self.relative_position_embedding:
+            for i in range(1, args.encoder_layers):
+                del self.layers[i].self_attn.relative_attention_bias
+                self.layers[i].self_attn.relative_attention_bias = self.layers[0].self_attn.relative_attention_bias
+
+        self.layer_norm_first = args.layer_norm_first
+        self.layer_norm = LayerNorm(self.embedding_dim)
+        self.layerdrop = args.encoder_layerdrop
+
+        #self.apply(init_bert_params)
+
+        if args.deep_norm:
+            deep_norm_beta = math.pow(8 * args.encoder_layers, -1 / 4)
+            for i in range(args.encoder_layers):
+                nn.init.xavier_normal_(self.layers[i].self_attn.k_proj.weight, gain=1)
+                nn.init.xavier_normal_(self.layers[i].self_attn.v_proj.weight, gain=deep_norm_beta)
+                nn.init.xavier_normal_(self.layers[i].self_attn.q_proj.weight, gain=1)
+                nn.init.xavier_normal_(self.layers[i].self_attn.out_proj.weight, gain=deep_norm_beta)
+                nn.init.xavier_normal_(self.layers[i].fc1.weight, gain=deep_norm_beta)
+                nn.init.xavier_normal_(self.layers[i].fc2.weight, gain=deep_norm_beta)
+
+        self.layer_wise_gradient_decay_ratio = getattr(args, "layer_wise_gradient_decay_ratio", 1)
+
+    def forward(self, x, padding_mask=None, layer=None):
+        x, layer_results = self.extract_features(x, padding_mask, layer)
+
+        if self.layer_norm_first and layer is None:
+            x = self.layer_norm(x)
+
+        return x, layer_results
+
+    def extract_features(self, x, padding_mask=None, tgt_layer=None):
+
+        if padding_mask is not None:
+            x[padding_mask] = 0
+
+        x_conv = self.pos_conv(x.transpose(1, 2))
+        x_conv = x_conv.transpose(1, 2)
+        x = x + x_conv
+
+        if not self.layer_norm_first:
+            x = self.layer_norm(x)
+
+        x = F.dropout(x, p=self.dropout, training=self.training)
+
+        # B x T x C -> T x B x C
+        x = x.transpose(0, 1)
+
+        layer_results = []
+        z = None
+        if tgt_layer is not None:
+            layer_results.append((x, z))
+        r = None
+        pos_bias = None
+        for i, layer in enumerate(self.layers):
+            if self.layer_wise_gradient_decay_ratio != 1.0:
+                x = GradMultiply.apply(x, self.layer_wise_gradient_decay_ratio)
+            dropout_probability = np.random.random()
+            if not self.training or (dropout_probability > self.layerdrop):
+                x, z, pos_bias = layer(x, self_attn_padding_mask=padding_mask, need_weights=False, pos_bias=pos_bias)
+            if tgt_layer is not None:
+                layer_results.append((x, z))
+            if i == tgt_layer:
+                r = x
+                break
+
+        if r is not None:
+            x = r
+
+        # T x B x C -> B x T x C
+        x = x.transpose(0, 1)
+
+        return x, layer_results
+
+
+class TransformerSentenceEncoderLayer(nn.Module):
+    def __init__(
+            self,
+            embedding_dim: float = 768,
+            ffn_embedding_dim: float = 3072,
+            num_attention_heads: float = 8,
+            dropout: float = 0.1,
+            attention_dropout: float = 0.1,
+            activation_dropout: float = 0.1,
+            activation_fn: str = "relu",
+            layer_norm_first: bool = False,
+            deep_norm: bool = False,
+            has_relative_attention_bias: bool = False,
+            num_buckets: int = 0,
+            max_distance: int = 0,
+            rescale_init: bool = False,
+            gru_rel_pos: bool = False,
+            encoder_layers: int = 0,
+    ) -> None:
+
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.dropout = dropout
+        self.activation_dropout = activation_dropout
+
+        self.activation_name = activation_fn
+        self.activation_fn = get_activation_fn(activation_fn)
+        self.self_attn = MultiheadAttention(
+            self.embedding_dim,
+            num_attention_heads,
+            dropout=attention_dropout,
+            self_attention=True,
+            has_relative_attention_bias=has_relative_attention_bias,
+            num_buckets=num_buckets,
+            max_distance=max_distance,
+            rescale_init=rescale_init,
+            gru_rel_pos=gru_rel_pos,
+        )
+
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(self.activation_dropout)
+        self.dropout3 = nn.Dropout(dropout)
+
+        self.layer_norm_first = layer_norm_first
+
+        self.self_attn_layer_norm = LayerNorm(self.embedding_dim)
+
+        if self.activation_name == "glu":
+            self.fc1 = GLU_Linear(self.embedding_dim, ffn_embedding_dim, "swish")
+        else:
+            self.fc1 = nn.Linear(self.embedding_dim, ffn_embedding_dim)
+        self.fc2 = nn.Linear(ffn_embedding_dim, self.embedding_dim)
+
+        self.final_layer_norm = LayerNorm(self.embedding_dim)
+
+        self.deep_norm = deep_norm
+        if self.deep_norm:
+            self.deep_norm_alpha = math.pow(2 * encoder_layers, 1 / 4)
+        else:
+            self.deep_norm_alpha = 1
+
+    def forward(
+            self,
+            x: torch.Tensor,
+            self_attn_mask: torch.Tensor = None,
+            self_attn_padding_mask: torch.Tensor = None,
+            need_weights: bool = False,
+            pos_bias=None
+    ):
+        residual = x
+
+        if self.layer_norm_first:
+            x = self.self_attn_layer_norm(x)
+            x, attn, pos_bias = self.self_attn(
+                query=x,
+                key=x,
+                value=x,
+                key_padding_mask=self_attn_padding_mask,
+                need_weights=False,
+                attn_mask=self_attn_mask,
+                position_bias=pos_bias
+            )
+            x = self.dropout1(x)
+            x = residual + x
+
+            residual = x
+            x = self.final_layer_norm(x)
+            if self.activation_name == "glu":
+                x = self.fc1(x)
+            else:
+                x = self.activation_fn(self.fc1(x))
+            x = self.dropout2(x)
+            x = self.fc2(x)
+            x = self.dropout3(x)
+            x = residual + x
+        else:
+            x, attn, pos_bias = self.self_attn(
+                query=x,
+                key=x,
+                value=x,
+                key_padding_mask=self_attn_padding_mask,
+                need_weights=need_weights,
+                attn_mask=self_attn_mask,
+                position_bias=pos_bias
+            )
+
+            x = self.dropout1(x)
+            x = residual * self.deep_norm_alpha + x
+
+            x = self.self_attn_layer_norm(x)
+
+            residual = x
+            if self.activation_name == "glu":
+                x = self.fc1(x)
+            else:
+                x = self.activation_fn(self.fc1(x))
+            x = self.dropout2(x)
+            x = self.fc2(x)
+            x = self.dropout3(x)
+            x = residual * self.deep_norm_alpha + x
+            x = self.final_layer_norm(x)
+
+        return x, attn, pos_bias
+
+
+class MultiheadAttention(nn.Module):
+    """Multi-headed attention.
+
+    See "Attention Is All You Need" for more details.
+    """
+
+    def __init__(
+            self,
+            embed_dim,
+            num_heads,
+            kdim=None,
+            vdim=None,
+            dropout=0.0,
+            bias=True,
+            add_bias_kv=False,
+            add_zero_attn=False,
+            self_attention=False,
+            encoder_decoder_attention=False,
+            q_noise=0.0,
+            qn_block_size=8,
+            has_relative_attention_bias=False,
+            num_buckets=32,
+            max_distance=128,
+            gru_rel_pos=False,
+            rescale_init=False,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.qkv_same_dim = self.kdim == embed_dim and self.vdim == embed_dim
+
+        self.num_heads = num_heads
+        self.dropout_module = nn.Dropout(dropout)
+
+        self.has_relative_attention_bias = has_relative_attention_bias
+        self.num_buckets = num_buckets
+        self.max_distance = max_distance
+        if self.has_relative_attention_bias:
+            self.relative_attention_bias = nn.Embedding(num_buckets, num_heads)
+
+        self.head_dim = embed_dim // num_heads
+        self.q_head_dim = self.head_dim
+        self.k_head_dim = self.head_dim
+        assert (
+                self.head_dim * num_heads == self.embed_dim
+        ), "embed_dim must be divisible by num_heads"
+        self.scaling = self.head_dim ** -0.5
+
+        self.self_attention = self_attention
+        self.encoder_decoder_attention = encoder_decoder_attention
+
+        assert not self.self_attention or self.qkv_same_dim, (
+            "Self-attention requires query, key and " "value to be of the same size"
+        )
+
+        k_bias = True
+        if rescale_init:
+            k_bias = False
+
+        k_embed_dim = embed_dim
+        q_embed_dim = embed_dim
+
+        self.k_proj = quant_noise(
+            nn.Linear(self.kdim, k_embed_dim, bias=k_bias), q_noise, qn_block_size
+        )
+        self.v_proj = quant_noise(
+            nn.Linear(self.vdim, embed_dim, bias=bias), q_noise, qn_block_size
+        )
+        self.q_proj = quant_noise(
+            nn.Linear(embed_dim, q_embed_dim, bias=bias), q_noise, qn_block_size
+        )
+
+        self.out_proj = quant_noise(
+            nn.Linear(embed_dim, embed_dim, bias=bias), q_noise, qn_block_size
+        )
+
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.Tensor(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.Tensor(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+
+        self.add_zero_attn = add_zero_attn
+
+        self.gru_rel_pos = gru_rel_pos
+        if self.gru_rel_pos:
+            self.grep_linear = nn.Linear(self.q_head_dim, 8)
+            self.grep_a = nn.Parameter(torch.ones(1, num_heads, 1, 1))
+
+        self.reset_parameters()
+
+    def reset_parameters(self):
+        if self.qkv_same_dim:
+            # Empirically observed the convergence to be much better with
+            # the scaled initialization
+            nn.init.xavier_uniform_(self.k_proj.weight, gain=1 / math.sqrt(2))
+            nn.init.xavier_uniform_(self.v_proj.weight, gain=1 / math.sqrt(2))
+            nn.init.xavier_uniform_(self.q_proj.weight, gain=1 / math.sqrt(2))
+        else:
+            nn.init.xavier_uniform_(self.k_proj.weight)
+            nn.init.xavier_uniform_(self.v_proj.weight)
+            nn.init.xavier_uniform_(self.q_proj.weight)
+
+        nn.init.xavier_uniform_(self.out_proj.weight)
+        if self.out_proj.bias is not None:
+            nn.init.constant_(self.out_proj.bias, 0.0)
+        if self.bias_k is not None:
+            nn.init.xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            nn.init.xavier_normal_(self.bias_v)
+        if self.has_relative_attention_bias:
+            nn.init.xavier_normal_(self.relative_attention_bias.weight)
+
+    def _relative_positions_bucket(self, relative_positions, bidirectional=True):
+        num_buckets = self.num_buckets
+        max_distance = self.max_distance
+        relative_buckets = 0
+
+        if bidirectional:
+            num_buckets = num_buckets // 2
+            relative_buckets += (relative_positions > 0).to(torch.long) * num_buckets
+            relative_positions = torch.abs(relative_positions)
+        else:
+            relative_positions = -torch.min(relative_positions, torch.zeros_like(relative_positions))
+
+        max_exact = num_buckets // 2
+        is_small = relative_positions < max_exact
+
+        relative_postion_if_large = max_exact + (
+                torch.log(relative_positions.float() / max_exact)
+                / math.log(max_distance / max_exact)
+                * (num_buckets - max_exact)
+        ).to(torch.long)
+        relative_postion_if_large = torch.min(
+            relative_postion_if_large, torch.full_like(relative_postion_if_large, num_buckets - 1)
+        )
+
+        relative_buckets += torch.where(is_small, relative_positions, relative_postion_if_large)
+        return relative_buckets
+
+    def compute_bias(self, query_length, key_length):
+        context_position = torch.arange(query_length, dtype=torch.long)[:, None]
+        memory_position = torch.arange(key_length, dtype=torch.long)[None, :]
+        relative_position = memory_position - context_position
+        relative_position_bucket = self._relative_positions_bucket(
+            relative_position,
+            bidirectional=True
+        )
+        relative_position_bucket = relative_position_bucket.to(self.relative_attention_bias.weight.device)
+        values = self.relative_attention_bias(relative_position_bucket)
+        values = values.permute([2, 0, 1])
+        return values
+
+    def forward(
+            self,
+            query,
+            key: Optional[Tensor],
+            value: Optional[Tensor],
+            key_padding_mask: Optional[Tensor] = None,
+            incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]] = None,
+            need_weights: bool = True,
+            static_kv: bool = False,
+            attn_mask: Optional[Tensor] = None,
+            before_softmax: bool = False,
+            need_head_weights: bool = False,
+            position_bias: Optional[Tensor] = None
+    ) -> Tuple[Tensor, Optional[Tensor], Optional[Tensor]]:
+        """Input shape: Time x Batch x Channel
+
+        Args:
+            key_padding_mask (ByteTensor, optional): mask to exclude
+                keys that are pads, of shape `(batch, src_len)`, where
+                padding elements are indicated by 1s.
+            need_weights (bool, optional): return the attention weights,
+                averaged over heads (default: False).
+            attn_mask (ByteTensor, optional): typically used to
+                implement causal attention, where the mask prevents the
+                attention from looking forward in time (default: None).
+            before_softmax (bool, optional): return the raw attention
+                weights and values before the attention softmax.
+            need_head_weights (bool, optional): return the attention
+                weights for each head. Implies *need_weights*. Default:
+                return the average attention weights over all heads.
+        """
+        if need_head_weights:
+            need_weights = True
+
+        is_tpu = query.device.type == "xla"
+
+        tgt_len, bsz, embed_dim = query.size()
+        src_len = tgt_len
+        assert embed_dim == self.embed_dim
+        assert list(query.size()) == [tgt_len, bsz, embed_dim]
+        if key is not None:
+            src_len, key_bsz, _ = key.size()
+            if not torch.jit.is_scripting():
+                assert key_bsz == bsz
+                assert value is not None
+                assert src_len, bsz == value.shape[:2]
+
+        if self.has_relative_attention_bias and position_bias is None:
+            position_bias = self.compute_bias(tgt_len, src_len)
+            position_bias = position_bias.unsqueeze(0).repeat(bsz, 1, 1, 1).view(bsz * self.num_heads, tgt_len, src_len)
+
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if saved_state is not None and "prev_key" in saved_state:
+                # previous time steps are cached - no need to recompute
+                # key and value if they are static
+                if static_kv:
+                    assert self.encoder_decoder_attention and not self.self_attention
+                    key = value = None
+        else:
+            saved_state = None
+
+        if self.self_attention:
+            q = self.q_proj(query)
+            k = self.k_proj(query)
+            v = self.v_proj(query)
+        elif self.encoder_decoder_attention:
+            # encoder-decoder attention
+            q = self.q_proj(query)
+            if key is None:
+                assert value is None
+                k = v = None
+            else:
+                k = self.k_proj(key)
+                v = self.v_proj(key)
+
+        else:
+            assert key is not None and value is not None
+            q = self.q_proj(query)
+            k = self.k_proj(key)
+            v = self.v_proj(value)
+        q *= self.scaling
+        alpha = 32
+        q *= 1 / alpha
+
+        if self.bias_k is not None:
+            assert self.bias_v is not None
+            k = torch.cat([k, self.bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, self.bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = torch.cat(
+                    [attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1
+                )
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [
+                        key_padding_mask,
+                        key_padding_mask.new_zeros(key_padding_mask.size(0), 1),
+                    ],
+                    dim=1,
+                )
+
+        q = (
+            q.contiguous()
+                .view(tgt_len, bsz * self.num_heads, self.q_head_dim)
+                .transpose(0, 1)
+        )
+        if k is not None:
+            k = (
+                k.contiguous()
+                    .view(-1, bsz * self.num_heads, self.k_head_dim)
+                    .transpose(0, 1)
+            )
+        if v is not None:
+            v = (
+                v.contiguous()
+                    .view(-1, bsz * self.num_heads, self.head_dim)
+                    .transpose(0, 1)
+            )
+
+        if saved_state is not None:
+            # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)
+            if "prev_key" in saved_state:
+                _prev_key = saved_state["prev_key"]
+                assert _prev_key is not None
+                prev_key = _prev_key.view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    k = prev_key
+                else:
+                    assert k is not None
+                    k = torch.cat([prev_key, k], dim=1)
+                src_len = k.size(1)
+            if "prev_value" in saved_state:
+                _prev_value = saved_state["prev_value"]
+                assert _prev_value is not None
+                prev_value = _prev_value.view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    v = prev_value
+                else:
+                    assert v is not None
+                    v = torch.cat([prev_value, v], dim=1)
+            prev_key_padding_mask: Optional[Tensor] = None
+            if "prev_key_padding_mask" in saved_state:
+                prev_key_padding_mask = saved_state["prev_key_padding_mask"]
+            assert k is not None and v is not None
+            key_padding_mask = MultiheadAttention._append_prev_key_padding_mask(
+                key_padding_mask=key_padding_mask,
+                prev_key_padding_mask=prev_key_padding_mask,
+                batch_size=bsz,
+                src_len=k.size(1),
+                static_kv=static_kv,
+            )
+
+            saved_state["prev_key"] = k.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state["prev_value"] = v.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state["prev_key_padding_mask"] = key_padding_mask
+            # In this branch incremental_state is never None
+            assert incremental_state is not None
+            incremental_state = self._set_input_buffer(incremental_state, saved_state)
+        assert k is not None
+        assert k.size(1) == src_len
+
+        # This is part of a workaround to get around fork/join parallelism
+        # not supporting Optional types.
+        if key_padding_mask is not None and key_padding_mask.dim() == 0:
+            key_padding_mask = None
+
+        if key_padding_mask is not None:
+            assert key_padding_mask.size(0) == bsz
+            assert key_padding_mask.size(1) == src_len
+
+        if self.add_zero_attn:
+            assert v is not None
+            src_len += 1
+            k = torch.cat([k, k.new_zeros((k.size(0), 1) + k.size()[2:])], dim=1)
+            v = torch.cat([v, v.new_zeros((v.size(0), 1) + v.size()[2:])], dim=1)
+            if attn_mask is not None:
+                attn_mask = torch.cat(
+                    [attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1
+                )
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [
+                        key_padding_mask,
+                        torch.zeros(key_padding_mask.size(0), 1).type_as(
+                            key_padding_mask
+                        ),
+                    ],
+                    dim=1,
+                )
+
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        attn_weights = (attn_weights - attn_weights.max(dim=-1, keepdim=True)[0]) * alpha
+        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
+
+        assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+
+        if attn_mask is not None:
+            attn_mask = attn_mask.unsqueeze(0)
+            attn_weights += attn_mask
+
+        if key_padding_mask is not None:
+            # don't attend to padding symbols
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            if not is_tpu:
+                attn_weights = attn_weights.masked_fill(
+                    key_padding_mask.unsqueeze(1).unsqueeze(2).to(torch.bool),
+                    float("-inf"),
+                )
+            else:
+                attn_weights = attn_weights.transpose(0, 2)
+                attn_weights = attn_weights.masked_fill(key_padding_mask, float("-inf"))
+                attn_weights = attn_weights.transpose(0, 2)
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        if before_softmax:
+            return attn_weights, v, position_bias
+
+        if position_bias is not None:
+            attn_mask_rel_pos = position_bias
+            if self.gru_rel_pos == 1:
+                query_layer = q.view(bsz, self.num_heads, tgt_len, self.q_head_dim) * alpha / self.scaling
+                _B, _H, _L, __ = query_layer.size()
+                gate_a, gate_b = torch.sigmoid(self.grep_linear(query_layer).view(
+                    _B, _H, _L, 2, 4).sum(-1, keepdim=False)).chunk(2, dim=-1)
+                gate_a_1 = gate_a * (gate_b * self.grep_a - 1.0) + 2.0
+                attn_mask_rel_pos = gate_a_1.view(bsz * self.num_heads, tgt_len, 1) * position_bias
+
+            attn_mask_rel_pos = attn_mask_rel_pos.view(attn_weights.size())
+
+            attn_weights = attn_weights + attn_mask_rel_pos
+
+        attn_weights_float = F.softmax(
+            attn_weights, dim=-1
+        )
+        attn_weights = attn_weights_float.type_as(attn_weights)
+        attn_probs = self.dropout_module(attn_weights)
+
+        assert v is not None
+        attn = torch.bmm(attn_probs, v)
+        assert list(attn.size()) == [bsz * self.num_heads, tgt_len, self.head_dim]
+        attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+        attn = self.out_proj(attn)
+        attn_weights: Optional[Tensor] = None
+        if need_weights:
+            attn_weights = attn_weights_float.view(
+                bsz, self.num_heads, tgt_len, src_len
+            ).transpose(1, 0)
+            if not need_head_weights:
+                # average attention weights over heads
+                attn_weights = attn_weights.mean(dim=0)
+
+        return attn, attn_weights, position_bias
+
+    @staticmethod
+    def _append_prev_key_padding_mask(
+            key_padding_mask: Optional[Tensor],
+            prev_key_padding_mask: Optional[Tensor],
+            batch_size: int,
+            src_len: int,
+            static_kv: bool,
+    ) -> Optional[Tensor]:
+        # saved key padding masks have shape (bsz, seq_len)
+        if prev_key_padding_mask is not None and static_kv:
+            new_key_padding_mask = prev_key_padding_mask
+        elif prev_key_padding_mask is not None and key_padding_mask is not None:
+            new_key_padding_mask = torch.cat(
+                [prev_key_padding_mask.float(), key_padding_mask.float()], dim=1
+            )
+        # During incremental decoding, as the padding token enters and
+        # leaves the frame, there will be a time when prev or current
+        # is None
+        elif prev_key_padding_mask is not None:
+            if src_len > prev_key_padding_mask.size(1):
+                filler = torch.zeros(
+                    (batch_size, src_len - prev_key_padding_mask.size(1)),
+                    device=prev_key_padding_mask.device,
+                )
+                new_key_padding_mask = torch.cat(
+                    [prev_key_padding_mask.float(), filler.float()], dim=1
+                )
+            else:
+                new_key_padding_mask = prev_key_padding_mask.float()
+        elif key_padding_mask is not None:
+            if src_len > key_padding_mask.size(1):
+                filler = torch.zeros(
+                    (batch_size, src_len - key_padding_mask.size(1)),
+                    device=key_padding_mask.device,
+                )
+                new_key_padding_mask = torch.cat(
+                    [filler.float(), key_padding_mask.float()], dim=1
+                )
+            else:
+                new_key_padding_mask = key_padding_mask.float()
+        else:
+            new_key_padding_mask = prev_key_padding_mask
+        return new_key_padding_mask
+
+    def _get_input_buffer(
+            self, incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]]
+    ) -> Dict[str, Optional[Tensor]]:
+        result = self.get_incremental_state(incremental_state, "attn_state")
+        if result is not None:
+            return result
+        else:
+            empty_result: Dict[str, Optional[Tensor]] = {}
+            return empty_result
+
+    def _set_input_buffer(
+            self,
+            incremental_state: Dict[str, Dict[str, Optional[Tensor]]],
+            buffer: Dict[str, Optional[Tensor]],
+    ):
+        return self.set_incremental_state(incremental_state, "attn_state", buffer)
+
+    def apply_sparse_mask(self, attn_weights, tgt_len: int, src_len: int, bsz: int):
+        return attn_weights
+
+
+def init_bert_params(module):
+    """
+    Initialize the weights specific to the BERT Model.
+    This overrides the default initializations depending on the specified arguments.
+        1. If normal_init_linear_weights is set then weights of linear
+           layer will be initialized using the normal distribution and
+           bais will be set to the specified value.
+        2. If normal_init_embed_weights is set then weights of embedding
+           layer will be initialized using the normal distribution.
+        3. If normal_init_proj_weights is set then weights of
+           in_project_weight for MultiHeadAttention initialized using
+           the normal distribution (to be validated).
+    """
+
+    def normal_(data):
+        # with FSDP, module params will be on CUDA, so we cast them back to CPU
+        # so that the RNG is consistent with and without FSDP
+        data.copy_(
+            data.cpu().normal_(mean=0.0, std=0.02).to(data.device)
+        )
+
+    if isinstance(module, nn.Linear):
+        normal_(module.weight.data)
+        if module.bias is not None:
+            module.bias.data.zero_()
+    if isinstance(module, nn.Embedding):
+        normal_(module.weight.data)
+        if module.padding_idx is not None:
+            module.weight.data[module.padding_idx].zero_()
+    if isinstance(module, MultiheadAttention):
+        normal_(module.q_proj.weight.data)
+        normal_(module.k_proj.weight.data)
+        normal_(module.v_proj.weight.data)
\ No newline at end of file
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/modules.py b/third_party/VideoLLaMA2/videollama2/model/beats/modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..18e2d2066b93139acc9427f0edcdd96b12769f25
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/beats/modules.py
@@ -0,0 +1,218 @@
+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+
+import math
+import warnings
+import torch
+from torch import Tensor, nn
+import torch.nn.functional as F
+
+
+class GradMultiply(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x, scale):
+        ctx.scale = scale
+        res = x.new(x)
+        return res
+
+    @staticmethod
+    def backward(ctx, grad):
+        return grad * ctx.scale, None
+
+
+class SamePad(nn.Module):
+    def __init__(self, kernel_size, causal=False):
+        super().__init__()
+        if causal:
+            self.remove = kernel_size - 1
+        else:
+            self.remove = 1 if kernel_size % 2 == 0 else 0
+
+    def forward(self, x):
+        if self.remove > 0:
+            x = x[:, :, : -self.remove]
+        return x
+
+
+class Swish(nn.Module):
+    def __init__(self):
+        super(Swish, self).__init__()
+        self.act = torch.nn.Sigmoid()
+
+    def forward(self, x):
+        return x * self.act(x)
+
+
+class GLU_Linear(nn.Module):
+    def __init__(self, input_dim, output_dim, glu_type="sigmoid", bias_in_glu=True):
+        super(GLU_Linear, self).__init__()
+
+        self.glu_type = glu_type
+        self.output_dim = output_dim
+
+        if glu_type == "sigmoid":
+            self.glu_act = torch.nn.Sigmoid()
+        elif glu_type == "swish":
+            self.glu_act = Swish()
+        elif glu_type == "relu":
+            self.glu_act = torch.nn.ReLU()
+        elif glu_type == "gelu":
+            self.glu_act = torch.nn.GELU()
+
+        if bias_in_glu:
+            self.linear = nn.Linear(input_dim, output_dim * 2, True)
+        else:
+            self.linear = nn.Linear(input_dim, output_dim * 2, False)
+
+    def forward(self, x):
+        # to be consistent with GLU_Linear, we assume the input always has the #channel (#dim) in the last dimension of the tensor, so need to switch the dimension first for 1D-Conv case
+        x = self.linear(x)
+
+        if self.glu_type == "bilinear":
+            x = (x[:, :, 0:self.output_dim] * x[:, :, self.output_dim:self.output_dim * 2])
+        else:
+            x = (x[:, :, 0:self.output_dim] * self.glu_act(x[:, :, self.output_dim:self.output_dim * 2]))
+
+        return x
+
+
+def gelu_accurate(x):
+    if not hasattr(gelu_accurate, "_a"):
+        gelu_accurate._a = math.sqrt(2 / math.pi)
+    return (
+        0.5 * x * (1 + torch.tanh(gelu_accurate._a * (x + 0.044715 * torch.pow(x, 3))))
+    )
+
+
+def gelu(x: torch.Tensor) -> torch.Tensor:
+    return torch.nn.functional.gelu(x.float()).type_as(x)
+
+
+def get_activation_fn(activation: str):
+    """Returns the activation function corresponding to `activation`"""
+
+    if activation == "relu":
+        return F.relu
+    elif activation == "gelu":
+        return gelu
+    elif activation == "gelu_fast":
+        warnings.warn(
+            "--activation-fn=gelu_fast has been renamed to gelu_accurate"
+        )
+        return gelu_accurate
+    elif activation == "gelu_accurate":
+        return gelu_accurate
+    elif activation == "tanh":
+        return torch.tanh
+    elif activation == "linear":
+        return lambda x: x
+    elif activation == "glu":
+        return lambda x: x
+    else:
+        raise RuntimeError("--activation-fn {} not supported".format(activation))
+
+
+def quant_noise(module, p, block_size):
+    """
+    Wraps modules and applies quantization noise to the weights for
+    subsequent quantization with Iterative Product Quantization as
+    described in "Training with Quantization Noise for Extreme Model Compression"
+
+    Args:
+        - module: nn.Module
+        - p: amount of Quantization Noise
+        - block_size: size of the blocks for subsequent quantization with iPQ
+
+    Remarks:
+        - Module weights must have the right sizes wrt the block size
+        - Only Linear, Embedding and Conv2d modules are supported for the moment
+        - For more detail on how to quantize by blocks with convolutional weights,
+          see "And the Bit Goes Down: Revisiting the Quantization of Neural Networks"
+        - We implement the simplest form of noise here as stated in the paper
+          which consists in randomly dropping blocks
+    """
+
+    # if no quantization noise, don't register hook
+    if p <= 0:
+        return module
+
+    # supported modules
+    assert isinstance(module, (nn.Linear, nn.Embedding, nn.Conv2d))
+
+    # test whether module.weight has the right sizes wrt block_size
+    is_conv = module.weight.ndim == 4
+
+    # 2D matrix
+    if not is_conv:
+        assert (
+            module.weight.size(1) % block_size == 0
+        ), "Input features must be a multiple of block sizes"
+
+    # 4D matrix
+    else:
+        # 1x1 convolutions
+        if module.kernel_size == (1, 1):
+            assert (
+                module.in_channels % block_size == 0
+            ), "Input channels must be a multiple of block sizes"
+        # regular convolutions
+        else:
+            k = module.kernel_size[0] * module.kernel_size[1]
+            assert k % block_size == 0, "Kernel size must be a multiple of block size"
+
+    def _forward_pre_hook(mod, input):
+        # no noise for evaluation
+        if mod.training:
+            if not is_conv:
+                # gather weight and sizes
+                weight = mod.weight
+                in_features = weight.size(1)
+                out_features = weight.size(0)
+
+                # split weight matrix into blocks and randomly drop selected blocks
+                mask = torch.zeros(
+                    in_features // block_size * out_features, device=weight.device
+                )
+                mask.bernoulli_(p)
+                mask = mask.repeat_interleave(block_size, -1).view(-1, in_features)
+
+            else:
+                # gather weight and sizes
+                weight = mod.weight
+                in_channels = mod.in_channels
+                out_channels = mod.out_channels
+
+                # split weight matrix into blocks and randomly drop selected blocks
+                if mod.kernel_size == (1, 1):
+                    mask = torch.zeros(
+                        int(in_channels // block_size * out_channels),
+                        device=weight.device,
+                    )
+                    mask.bernoulli_(p)
+                    mask = mask.repeat_interleave(block_size, -1).view(-1, in_channels)
+                else:
+                    mask = torch.zeros(
+                        weight.size(0), weight.size(1), device=weight.device
+                    )
+                    mask.bernoulli_(p)
+                    mask = (
+                        mask.unsqueeze(2)
+                        .unsqueeze(3)
+                        .repeat(1, 1, mod.kernel_size[0], mod.kernel_size[1])
+                    )
+
+            # scale weights and apply mask
+            mask = mask.to(
+                torch.bool
+            )  # x.bool() is not currently supported in TorchScript
+            s = 1 / (1 - p)
+            mod.weight.data = s * weight.masked_fill(mask, 0)
+
+    module.register_forward_pre_hook(_forward_pre_hook)
+    return module
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/quantizer.py b/third_party/VideoLLaMA2/videollama2/model/beats/quantizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..704be4c357bce7ee425ea2b6737b536333a5a63c
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/beats/quantizer.py
@@ -0,0 +1,215 @@
+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on VQGAN code bases
+# https://github.com/CompVis/taming-transformers
+# --------------------------------------------------------'
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.distributed as distributed
+
+try:
+    from einops import rearrange, repeat
+except ImportError:
+    pass
+
+
+def l2norm(t):
+    return F.normalize(t, p=2, dim=-1)
+
+
+def ema_inplace(moving_avg, new, decay):
+    moving_avg.data.mul_(decay).add_(new, alpha=(1 - decay))
+
+
+def sample_vectors(samples, num):
+    num_samples, device = samples.shape[0], samples.device
+
+    if num_samples >= num:
+        indices = torch.randperm(num_samples, device=device)[:num]
+    else:
+        indices = torch.randint(0, num_samples, (num,), device=device)
+
+    return samples[indices]
+
+
+def kmeans(samples, num_clusters, num_iters=10, use_cosine_sim=False):
+    dim, dtype, device = samples.shape[-1], samples.dtype, samples.device
+
+    means = sample_vectors(samples, num_clusters)
+
+    for _ in range(num_iters):
+        if use_cosine_sim:
+            dists = samples @ means.t()
+        else:
+            diffs = rearrange(samples, 'n d -> n () d') \
+                    - rearrange(means, 'c d -> () c d')
+            dists = -(diffs ** 2).sum(dim=-1)
+
+        buckets = dists.max(dim=-1).indices
+        bins = torch.bincount(buckets, minlength=num_clusters)
+        zero_mask = bins == 0
+        bins_min_clamped = bins.masked_fill(zero_mask, 1)
+
+        new_means = buckets.new_zeros(num_clusters, dim, dtype=dtype)
+        new_means.scatter_add_(0, repeat(buckets, 'n -> n d', d=dim), samples)
+        new_means = new_means / bins_min_clamped[..., None]
+
+        if use_cosine_sim:
+            new_means = l2norm(new_means)
+
+        means = torch.where(zero_mask[..., None], means, new_means)
+
+    return means, bins
+
+
+class EmbeddingEMA(nn.Module):
+    def __init__(self, num_tokens, codebook_dim, decay=0.99, eps=1e-5, kmeans_init=True, codebook_init_path=''):
+        super().__init__()
+        self.num_tokens = num_tokens
+        self.codebook_dim = codebook_dim
+        self.decay = decay
+        self.eps = eps
+        if codebook_init_path == '':
+            if not kmeans_init:
+                weight = torch.randn(num_tokens, codebook_dim)
+                weight = l2norm(weight)
+            else:
+                weight = torch.zeros(num_tokens, codebook_dim)
+            self.register_buffer('initted', torch.Tensor([not kmeans_init]))
+        else:
+            print(f"load init codebook weight from {codebook_init_path}")
+            codebook_ckpt_weight = torch.load(codebook_init_path, map_location='cpu')
+            weight = codebook_ckpt_weight.clone()
+            self.register_buffer('initted', torch.Tensor([True]))
+
+        self.weight = nn.Parameter(weight, requires_grad=False)
+        self.cluster_size = nn.Parameter(torch.zeros(num_tokens), requires_grad=False)
+        self.embed_avg = nn.Parameter(weight.clone(), requires_grad=False)
+        # self.register_buffer('initted', torch.Tensor([not kmeans_init]))
+        self.update = True
+
+    @torch.jit.ignore
+    def init_embed_(self, data):
+        if self.initted:
+            return
+        print("Performing Kemans init for codebook")
+        embed, cluster_size = kmeans(data, self.num_tokens, 10, use_cosine_sim=True)
+        self.weight.data.copy_(embed)
+        self.cluster_size.data.copy_(cluster_size)
+        self.initted.data.copy_(torch.Tensor([True]))
+
+    def forward(self, embed_id):
+        return F.embedding(embed_id, self.weight)
+
+    def cluster_size_ema_update(self, new_cluster_size):
+        self.cluster_size.data.mul_(self.decay).add_(new_cluster_size, alpha=1 - self.decay)
+
+    def embed_avg_ema_update(self, new_embed_avg):
+        self.embed_avg.data.mul_(self.decay).add_(new_embed_avg, alpha=1 - self.decay)
+
+    def weight_update(self, num_tokens):
+        n = self.cluster_size.sum()
+        smoothed_cluster_size = (
+                (self.cluster_size + self.eps) / (n + num_tokens * self.eps) * n
+        )
+        # normalize embedding average with smoothed cluster size
+        embed_normalized = self.embed_avg / smoothed_cluster_size.unsqueeze(1)
+        # embed_normalized = l2norm(self.embed_avg / smoothed_cluster_size.unsqueeze(1))
+        self.weight.data.copy_(embed_normalized)
+
+
+def norm_ema_inplace(moving_avg, new, decay):
+    moving_avg.data.mul_(decay).add_(new, alpha=(1 - decay))
+    moving_avg.data.copy_(l2norm(moving_avg.data))
+
+
+class NormEMAVectorQuantizer(nn.Module):
+    def __init__(self, n_embed, embedding_dim, beta, decay=0.99, eps=1e-5,
+                 statistic_code_usage=True, kmeans_init=False, codebook_init_path=''):
+        super().__init__()
+        self.codebook_dim = embedding_dim
+        self.num_tokens = n_embed
+        self.beta = beta
+        self.decay = decay
+
+        # learnable = True if orthogonal_reg_weight > 0 else False
+        self.embedding = EmbeddingEMA(self.num_tokens, self.codebook_dim, decay, eps, kmeans_init, codebook_init_path)
+
+        self.statistic_code_usage = statistic_code_usage
+        if statistic_code_usage:
+            self.register_buffer('cluster_size', torch.zeros(n_embed))
+        if distributed.is_available() and distributed.is_initialized():
+            print("ddp is enable, so use ddp_reduce to sync the statistic_code_usage for each gpu!")
+            self.all_reduce_fn = distributed.all_reduce
+        else:
+            self.all_reduce_fn = nn.Identity()
+
+    def reset_cluster_size(self, device):
+        if self.statistic_code_usage:
+            self.register_buffer('cluster_size', torch.zeros(self.num_tokens))
+            self.cluster_size = self.cluster_size.to(device)
+
+    def forward(self, z):
+        # reshape z -> (batch, height, width, channel) and flatten
+        # z, 'b c h w -> b h w c'
+        # z = rearrange(z, 'b c h w -> b h w c')
+        # z = z.transpose(1, 2)
+        z = l2norm(z)
+        z_flattened = z.reshape(-1, self.codebook_dim)
+
+        self.embedding.init_embed_(z_flattened)
+
+        d = z_flattened.pow(2).sum(dim=1, keepdim=True) + \
+            self.embedding.weight.pow(2).sum(dim=1) - 2 * \
+            torch.einsum('bd,nd->bn', z_flattened, self.embedding.weight)  # 'n d -> d n'
+
+        encoding_indices = torch.argmin(d, dim=1)
+
+        z_q = self.embedding(encoding_indices).view(z.shape)
+
+        encodings = F.one_hot(encoding_indices, self.num_tokens).type(z.dtype)
+
+        if not self.training:
+            with torch.no_grad():
+                cluster_size = encodings.sum(0)
+                self.all_reduce_fn(cluster_size)
+                ema_inplace(self.cluster_size, cluster_size, self.decay)
+
+        if self.training and self.embedding.update:
+            # EMA cluster size
+
+            bins = encodings.sum(0)
+            self.all_reduce_fn(bins)
+
+            # self.embedding.cluster_size_ema_update(bins)
+            ema_inplace(self.cluster_size, bins, self.decay)
+
+            zero_mask = (bins == 0)
+            bins = bins.masked_fill(zero_mask, 1.)
+
+            embed_sum = z_flattened.t() @ encodings
+            self.all_reduce_fn(embed_sum)
+
+            embed_normalized = (embed_sum / bins.unsqueeze(0)).t()
+            embed_normalized = l2norm(embed_normalized)
+
+            embed_normalized = torch.where(zero_mask[..., None], self.embedding.weight,
+                                           embed_normalized)
+            norm_ema_inplace(self.embedding.weight, embed_normalized, self.decay)
+
+        # compute loss for embedding
+        loss = self.beta * F.mse_loss(z_q.detach(), z)
+
+        # preserve gradients
+        z_q = z + (z_q - z).detach()
+
+        # reshape back to match original input shape
+        # z_q, 'b h w c -> b c h w'
+        # z_q = rearrange(z_q, 'b h w c -> b c h w')
+        # z_q = z_q.transpose(1, 2)
+        return z_q, loss, encoding_indices
\ No newline at end of file
diff --git a/third_party/VideoLLaMA2/videollama2/model/beats/weight_norm_fix.py b/third_party/VideoLLaMA2/videollama2/model/beats/weight_norm_fix.py
new file mode 100644
index 0000000000000000000000000000000000000000..04016fd431ee82a77ca6bbed07273e03d4d34b51
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/beats/weight_norm_fix.py
@@ -0,0 +1,139 @@
+r"""Weight Normalization from https://arxiv.org/abs/1602.07868."""
+from torch.nn.parameter import Parameter, UninitializedParameter
+from torch import norm_except_dim
+from typing import Any, TypeVar
+import warnings
+from torch.nn.modules import Module
+import torch
+
+class WeightNorm:
+    name: str
+    dim: int
+
+    def __init__(self, name: str, dim: int) -> None:
+        if dim is None:
+            dim = -1
+        self.name = name
+        self.dim = dim
+
+    # TODO Make return type more specific
+    def compute_weight(self, module: Module) -> Any:
+        g = getattr(module, self.name + '_g')
+        v = getattr(module, self.name + '_v')
+        
+        input_dtype = v.dtype
+        v = v.to(torch.float32)
+        reduce_dims = list(range(v.dim()))
+        reduce_dims.pop(self.dim)
+        variance = v.pow(2).sum(reduce_dims, keepdim=True)
+        v = v * torch.rsqrt(variance + 1e-6)
+        
+        return g * v.to(input_dtype)
+
+    @staticmethod
+    def apply(module, name: str, dim: int) -> 'WeightNorm':
+        warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
+
+        for hook in module._forward_pre_hooks.values():
+            if isinstance(hook, WeightNorm) and hook.name == name:
+                raise RuntimeError(f"Cannot register two weight_norm hooks on the same parameter {name}")
+
+        if dim is None:
+            dim = -1
+
+        fn = WeightNorm(name, dim)
+
+        weight = getattr(module, name)
+        if isinstance(weight, UninitializedParameter):
+            raise ValueError(
+                'The module passed to `WeightNorm` can\'t have uninitialized parameters. '
+                'Make sure to run the dummy forward before applying weight normalization')
+        # remove w from parameter list
+        del module._parameters[name]
+
+        # add g and v as new parameters and express w as g/||v|| * v
+        module.register_parameter(name + '_g', Parameter(norm_except_dim(weight, 2, dim).data))
+        module.register_parameter(name + '_v', Parameter(weight.data))
+        setattr(module, name, fn.compute_weight(module))
+
+        # recompute weight before every forward()
+        module.register_forward_pre_hook(fn)
+
+        return fn
+
+    def remove(self, module: Module) -> None:
+        weight = self.compute_weight(module)
+        delattr(module, self.name)
+        del module._parameters[self.name + '_g']
+        del module._parameters[self.name + '_v']
+        setattr(module, self.name, Parameter(weight.data))
+
+    def __call__(self, module: Module, inputs: Any) -> None:
+        setattr(module, self.name, self.compute_weight(module))
+
+
+T_module = TypeVar('T_module', bound=Module)
+
+def weight_norm(module: T_module, name: str = 'weight', dim: int = 0) -> T_module:
+    r"""Apply weight normalization to a parameter in the given module.
+
+    .. math::
+         \mathbf{w} = g \dfrac{\mathbf{v}}{\|\mathbf{v}\|}
+
+    Weight normalization is a reparameterization that decouples the magnitude
+    of a weight tensor from its direction. This replaces the parameter specified
+    by :attr:`name` (e.g. ``'weight'``) with two parameters: one specifying the magnitude
+    (e.g. ``'weight_g'``) and one specifying the direction (e.g. ``'weight_v'``).
+    Weight normalization is implemented via a hook that recomputes the weight
+    tensor from the magnitude and direction before every :meth:`~Module.forward`
+    call.
+
+    By default, with ``dim=0``, the norm is computed independently per output
+    channel/plane. To compute a norm over the entire weight tensor, use
+    ``dim=None``.
+
+    See https://arxiv.org/abs/1602.07868
+
+    .. warning::
+
+        This function is deprecated.  Use :func:`torch.nn.utils.parametrizations.weight_norm`
+        which uses the modern parametrization API.  The new ``weight_norm`` is compatible
+        with ``state_dict`` generated from old ``weight_norm``.
+
+        Migration guide:
+
+        * The magnitude (``weight_g``) and direction (``weight_v``) are now expressed
+          as ``parametrizations.weight.original0`` and ``parametrizations.weight.original1``
+          respectively.  If this is bothering you, please comment on
+          https://github.com/pytorch/pytorch/issues/102999
+
+        * To remove the weight normalization reparametrization, use
+          :func:`torch.nn.utils.parametrize.remove_parametrizations`.
+
+        * The weight is no longer recomputed once at module forward; instead, it will
+          be recomputed on every access.  To restore the old behavior, use
+          :func:`torch.nn.utils.parametrize.cached` before invoking the module
+          in question.
+
+    Args:
+        module (Module): containing module
+        name (str, optional): name of weight parameter
+        dim (int, optional): dimension over which to compute the norm
+
+    Returns:
+        The original module with the weight norm hook
+
+    Example::
+
+        >>> m = weight_norm(nn.Linear(20, 40), name='weight')
+        >>> m
+        Linear(in_features=20, out_features=40, bias=True)
+        >>> m.weight_g.size()
+        torch.Size([40, 1])
+        >>> m.weight_v.size()
+        torch.Size([40, 20])
+
+    """
+    WeightNorm.apply(module, name, dim)
+    return module
+
diff --git a/third_party/VideoLLaMA2/videollama2/model/encoder.py b/third_party/VideoLLaMA2/videollama2/model/encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f035ac8049e5690faaf1d34cc787fdde418bc4b
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/encoder.py
@@ -0,0 +1,210 @@
+import os
+
+import torch
+import torch.nn as nn
+
+from transformers import (
+    CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig,
+    SiglipVisionModel, SiglipImageProcessor, SiglipVisionConfig
+)
+from .beats.BEATs import BEATsConfig, BEATs
+
+class CLIPVisionTower(nn.Module):
+
+    def __init__(self, vision_tower, args, delay_load=False):
+        super().__init__()
+
+        self.is_loaded = False
+
+        self.vision_tower_name = vision_tower
+        self.select_layer = args.mm_vision_select_layer
+        self.select_feature = getattr(args, 'mm_vision_select_feature', 'patch')
+
+        if not delay_load:
+            self.load_model()
+        else:
+            self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
+
+    def load_model(self):
+        self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
+
+        self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
+        self.vision_tower.requires_grad_(False)
+
+        self.is_loaded = True
+
+    def feature_select(self, image_forward_outs):
+        image_features = image_forward_outs.hidden_states[self.select_layer]
+        if self.select_feature == 'patch':
+            image_features = image_features[:, 1:]
+        elif self.select_feature == 'cls_patch':
+            image_features = image_features
+        else:
+            raise ValueError(f'Unexpected select feature: {self.select_feature}')
+        return image_features
+
+    @torch.no_grad()
+    def forward(self, images):
+        if type(images) is list:
+            image_features = []
+            for image in images:
+                image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
+                image_feature = self.feature_select(image_forward_out).to(image.dtype)
+                image_features.append(image_feature)
+        else:
+            image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
+            image_features = self.feature_select(image_forward_outs).to(images.dtype)
+
+        return image_features
+
+    @property
+    def dummy_feature(self):
+        return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
+
+    @property
+    def dtype(self):
+        return self.vision_tower.dtype
+
+    @property
+    def device(self):
+        return self.vision_tower.device
+
+    @property
+    def config(self):
+        if self.is_loaded:
+            return self.vision_tower.config
+        else:
+            return self.cfg_only
+
+    @property
+    def hidden_size(self):
+        return self.config.hidden_size
+
+    @property
+    def num_patches(self):
+        return (self.config.image_size // self.config.patch_size) ** 2
+
+    @property
+    def num_patches_per_side(self):
+        return self.config.image_size // self.config.patch_size
+
+    @property
+    def image_size(self):
+        return self.config.image_size
+
+
+class SiglipVisionTower(nn.Module):
+
+    def __init__(self, vision_tower, args, delay_load=False):
+        super().__init__()
+
+        self.is_loaded = False
+
+        self.vision_tower_name = vision_tower
+        self.select_layer = args.mm_vision_select_layer
+        self.select_feature = getattr(args, 'mm_vision_select_feature', 'patch')
+
+        if not delay_load:
+            self.load_model()
+        else:
+            self.cfg_only = SiglipVisionConfig.from_pretrained(self.vision_tower_name)
+
+    def load_model(self):
+        self.image_processor = SiglipImageProcessor.from_pretrained(self.vision_tower_name)
+
+        self.vision_tower = SiglipVisionModel.from_pretrained(self.vision_tower_name)
+        self.vision_tower.requires_grad_(False)
+
+        self.is_loaded = True
+
+    def feature_select(self, image_forward_outs):
+        image_features = image_forward_outs.hidden_states[self.select_layer]
+        if self.select_feature == 'patch':
+            image_features = image_features
+        else:
+            raise ValueError(f'Unexpected select feature: {self.select_feature}')
+        return image_features
+
+    @torch.no_grad()
+    def forward(self, images):
+        if type(images) is list:
+            image_features = []
+            for image in images:
+                image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
+                image_feature = self.feature_select(image_forward_out).to(image.dtype)
+                image_features.append(image_feature)
+        else:
+            image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
+            image_features = self.feature_select(image_forward_outs).to(images.dtype)
+
+        return image_features
+
+    @property
+    def dummy_feature(self):
+        return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
+
+    @property
+    def dtype(self):
+        return self.vision_tower.dtype
+
+    @property
+    def device(self):
+        return self.vision_tower.device
+
+    @property
+    def config(self):
+        if self.is_loaded:
+            return self.vision_tower.config
+        else:
+            return self.cfg_only
+
+    @property
+    def hidden_size(self):
+        return self.config.hidden_size
+
+    @property
+    def num_patches(self):
+        return (self.config.image_size // self.config.patch_size) ** 2
+
+    @property
+    def num_patches_per_side(self):
+        return self.config.image_size // self.config.patch_size
+
+    @property
+    def image_size(self):
+        return self.config.image_size
+
+
+def build_vision_tower(vision_tower_cfg, **kwargs):
+    vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))
+    if  'clip' in vision_tower:
+        vision_tower = CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
+    elif 'siglip' in vision_tower:
+        vision_tower = SiglipVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
+    else:
+        raise ValueError(f'Unknown vision tower: {vision_tower}')
+    return vision_tower
+
+def build_audio_tower(audio_tower_cfg, delay_load=False, **kwargs):
+    audio_tower = getattr(audio_tower_cfg, 'mm_audio_tower', getattr(audio_tower_cfg, 'audio_tower', None))
+    if not delay_load:
+        beats_checkpoint = torch.load(audio_tower, map_location='cpu')
+        if 'cfg' in beats_checkpoint:
+            beats_cfg = BEATsConfig(beats_checkpoint['cfg'])
+        else:
+            beats_cfg = BEATsConfig()
+        beats = BEATs(beats_cfg)
+        if not audio_tower.endswith('.bin'):
+            print(beats.load_state_dict(beats_checkpoint['model']))
+        else:
+            filtered_checkpoint = {}
+            prefix = 'model.audio_tower.'
+            for key, value in beats_checkpoint.items():
+                if key.startswith(prefix):
+                    new_key = key[len(prefix):]  # 去除前缀
+                    filtered_checkpoint[new_key] = value
+            print(beats.load_state_dict(filtered_checkpoint, strict=False))
+    else:
+        beats_cfg = BEATsConfig()
+        beats = BEATs(beats_cfg)
+    return beats, beats_cfg
diff --git a/third_party/VideoLLaMA2/videollama2/model/projector.py b/third_party/VideoLLaMA2/videollama2/model/projector.py
new file mode 100644
index 0000000000000000000000000000000000000000..b481a663f312c7c6d7972c57e1d849353825a958
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/projector.py
@@ -0,0 +1,265 @@
+#    Copyright 2024 Alibaba DAMO Academy
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+import os
+import re
+
+import einops
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from timm.models.regnet import RegStage
+from timm.models.layers import LayerNorm, LayerNorm2d
+from transformers import TRANSFORMERS_CACHE
+
+
+def parse_snapshot_folder(repo_id, cache_dir=None, repo_type="model"):
+    revision = "main"
+    # 1. parse the downloaded cache folder
+    if cache_dir is None:
+        cache_dir = TRANSFORMERS_CACHE
+    else:
+        cache_dir = cache_dir
+    object_id = repo_id.replace("/", "--")
+    repo_cache = os.path.join(cache_dir, f"{repo_type}s--{object_id}")
+    # 2. resolve refs (for instance to convert main to the associated commit sha)
+    refs_dir = os.path.join(repo_cache, "refs")
+    if os.path.isdir(refs_dir):
+        revision_file = os.path.join(refs_dir, revision)
+        if os.path.isfile(revision_file):
+            with open(revision_file) as f:
+                revision = f.read()
+    # 3. acquire the snapshot folder
+    folder = os.path.join(repo_cache, "snapshots", revision)
+
+    return folder
+
+
+def load_mm_projector(model_path, cache_dir=None, token=None):
+    if os.path.exists(os.path.join(model_path, 'mm_projector.bin')):
+        is_local = True
+        folder = model_path
+    else:
+        is_local = False
+        folder = parse_snapshot_folder(model_path, cache_dir=cache_dir, repo_type="model")
+        if not os.path.exists(os.path.join(folder, 'mm_projector.bin')):
+            # downloading from remote repo
+            from huggingface_hub import snapshot_download
+            snapshot_download(repo_id=model_path, cache_dir=cache_dir, token=token)
+
+    mm_projector_weights = torch.load(os.path.join(folder, 'mm_projector.bin'), map_location='cpu')
+    mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
+    return mm_projector_weights
+
+
+class IdentityMap(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, *args, **kwargs):
+        return x
+
+    @property
+    def config(self):
+        return {"mm_projector_type": 'identity'}
+
+
+class SimpleResBlock(nn.Module):
+
+    def __init__(self, channels):
+        super().__init__()
+        self.pre_norm = nn.LayerNorm(channels)
+
+        self.proj = nn.Sequential(
+            nn.Linear(channels, channels),
+            nn.GELU(),
+            nn.Linear(channels, channels)
+        )
+    def forward(self, x):
+        x = self.pre_norm(x)
+        return x + self.proj(x)
+
+
+def build_vision_projector(config, delay_load=False, **kwargs):
+    projector_type = getattr(config, 'mm_projector_type', 'linear')
+    mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
+    if mlp_gelu_match:
+        mlp_depth = int(mlp_gelu_match.group(1))
+        modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
+        for _ in range(1, mlp_depth):
+            modules.append(nn.GELU())
+            modules.append(nn.Linear(config.hidden_size, config.hidden_size))
+        return nn.Sequential(*modules)
+
+    if projector_type == "linear":
+        # NOTE: for both linear and mlp2x_gelu projector type, mean pooling is adopted to aggreate video features
+        return nn.Linear(config.mm_hidden_size, config.hidden_size)
+    elif projector_type == "stc_connector":
+        return STCConnector(config)
+    elif projector_type == "stp_connector":
+        return STPConnector(config)
+    elif projector_type == "stc_connector_v35":
+        return STCConnectorV35(config)
+    elif projector_type == "spatial_conv":
+        return SpatialConv(config)
+    elif projector_type == "spatial_pool":
+        return SpatialPool(config)
+    if projector_type == 'identity':
+        return IdentityMap()
+
+    raise ValueError(f'Unknown projector type: {projector_type}')
+
+def build_audio_projector(config, delay_load=False, **kwargs):
+    projector_type = getattr(config, 'mm_projector_a_type', 'linear')
+    mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
+    if mlp_gelu_match:
+        mlp_depth = int(mlp_gelu_match.group(1))
+        modules = [nn.Linear(config.mm_hidden_size_a, config.hidden_size_a)]
+        for _ in range(1, mlp_depth):
+            modules.append(nn.GELU())
+            modules.append(nn.Linear(config.hidden_size_a, config.hidden_size_a))
+        return nn.Sequential(*modules)
+    if projector_type == "linear":
+        # note that for both linear and mlp2x_gelu projector type, mean pooling is adopted to aggreate video features
+        return nn.Linear(config.mm_hidden_size_a, config.hidden_size_a)
+    if projector_type == 'identity':
+        return IdentityMap()
+        
+def build_mlp(depth, hidden_size, output_hidden_size):
+    modules = [nn.Linear(hidden_size, output_hidden_size)]
+    for _ in range(1, depth):
+        modules.append(nn.GELU())
+        modules.append(nn.Linear(output_hidden_size, output_hidden_size))
+    return nn.Sequential(*modules)
+
+
+class STCConnector(nn.Module):
+
+    def __init__(self, config, downsample=(2, 2, 2), depth=4, mlp_depth=2):
+        """Temporal Convolutional Vision-Language Connector.
+        
+        Args:
+            config: config object.
+            downsample: (temporal, height, width) downsample rate.
+            depth: depth of the spatial interaction blocks.
+            mlp_depth: depth of the vision-language projector layers.
+        """
+        super().__init__()
+        self.encoder_hidden_size = encoder_hidden_size = config.mm_hidden_size
+        self.hidden_size = hidden_size = config.hidden_size
+        self.output_hidden_size = output_hidden_size = config.hidden_size
+        # TODO: make these as config arguments
+        self.depth = depth
+        self.mlp_depth = mlp_depth
+        self.downsample = downsample
+        if depth != 0:
+            self.s1 = RegStage(
+                depth=depth,
+                in_chs=encoder_hidden_size,
+                out_chs=hidden_size,
+                stride=1,
+                dilation=1,
+                act_layer=nn.SiLU,
+                norm_layer=LayerNorm2d,
+            )
+        else:
+            self.s1 = nn.Identity()
+        self.sampler = nn.Sequential(
+            nn.Conv3d(
+                in_channels=hidden_size,
+                out_channels=hidden_size,
+                kernel_size=downsample,
+                stride=downsample,
+                padding=1,
+                bias=True
+            ),
+            nn.SiLU()
+        )
+        if depth != 0:
+            self.s2 = RegStage(
+                depth=depth,
+                in_chs=hidden_size,
+                out_chs=hidden_size,
+                stride=1,
+                dilation=1,
+                act_layer=nn.SiLU,
+                norm_layer=LayerNorm2d,
+            )
+        else:
+            self.s2 = nn.Identity()
+        self.readout = build_mlp(mlp_depth, hidden_size, output_hidden_size)
+
+    def forward(self, x):
+        """Aggregate tokens on the temporal and spatial dimensions.
+        Args:
+            x: input tokens [b, t, h, w, d] / [b, t, l, d]
+        Returns:
+            aggregated tokens [b, l, d]
+        """
+        t = x.size(1)
+        if x.ndim == 4:
+            hw = int(x.size(2) ** 0.5)
+            x = einops.rearrange(x, "b t (h w) d -> b d t h w", h=hw, w=hw)
+        elif x.ndim == 5:
+            x = einops.rearrange(x, "b t h w d -> b d t h w")
+
+        x = einops.rearrange(x, "b d t h w -> (b t) d h w")
+        # 1. the first stage of the adapter
+        x = self.s1(x)
+        x = einops.rearrange(x, "(b t) d h w -> b d t h w", t=t)
+        # 2. downsampler
+        x = self.sampler(x)
+        new_t = x.size(2)
+        # 3. the second stage of the adapter
+        x = einops.rearrange(x, "b d t h w -> (b t) d h w")
+        x = self.s2(x)
+        x = einops.rearrange(x, "(b t) d h w -> b (t h w) d", t=new_t)
+        x = self.readout(x)
+        return x
+
+
+class STPConnector(STCConnector):
+
+    def __init__(self, config, downsample=(2, 2, 2), depth=4, mlp_depth=2):
+        super().__init__(config=config, downsample=downsample, depth=depth, mlp_depth=mlp_depth)
+        self.sampler = nn.Sequential(nn.AvgPool3d(downsample), nn.SiLU())
+
+
+class STCConnectorV35(STCConnector):
+
+    def __init__(self, config, downsample=(2, 2, 2), depth=4, mlp_depth=2):
+        super().__init__(config=config, downsample=downsample, depth=depth, mlp_depth=mlp_depth)
+        self.sampler = nn.Sequential(
+            nn.Conv3d(
+                in_channels=self.hidden_size,
+                out_channels=self.hidden_size,
+                kernel_size=downsample,
+                stride=downsample,
+                padding=0,
+                bias=True
+            ),
+            nn.SiLU())
+
+
+class SpatialConv(STCConnector):
+
+    def __init__(self, config, downsample=(1, 2, 2), depth=0, mlp_depth=2):
+        super().__init__(config=config, downsample=downsample, depth=depth, mlp_depth=mlp_depth)
+
+
+class SpatialPool(STPConnector):
+
+    def __init__(self, config, downsample=(1, 2, 2), depth=0, mlp_depth=2):
+        super().__init__(config=config, downsample=downsample, depth=depth, mlp_depth=mlp_depth)
diff --git a/third_party/VideoLLaMA2/videollama2/model/videollama2_arch.py b/third_party/VideoLLaMA2/videollama2/model/videollama2_arch.py
new file mode 100644
index 0000000000000000000000000000000000000000..12500ca6c4826132ef5990ec7a19c4e124372d9a
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/videollama2_arch.py
@@ -0,0 +1,374 @@
+# Adopted from https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+import os
+from abc import ABC, abstractmethod
+
+import einops
+import torch
+import torch.nn as nn
+
+from .projector import load_mm_projector, build_vision_projector, build_audio_projector
+from .encoder import build_vision_tower, build_audio_tower
+from ..constants import IGNORE_INDEX, NUM_FRAMES, MODAL_INDEX_MAP
+
+
+class Videollama2MetaModel:
+
+    def __init__(self, config):
+        super(Videollama2MetaModel, self).__init__(config)
+
+        if hasattr(config, "mm_vision_tower"):
+            self.vision_tower = build_vision_tower(config, delay_load=True)
+            self.mm_projector = build_vision_projector(config)
+        if hasattr(config, "mm_audio_tower"):
+            self.audio_tower, audio_tower_cfg = build_audio_tower(config, delay_load=True)
+            self.mm_projector_a = build_audio_projector(config)
+
+    def get_vision_tower(self):
+        vision_tower = getattr(self, 'vision_tower', None)
+        if type(vision_tower) is list:
+            vision_tower = vision_tower[0]
+        return vision_tower
+
+    def get_audio_tower(self):
+        audio_tower = getattr(self, 'audio_tower', None)
+        if type(audio_tower) is list:
+            audio_tower = audio_tower[0]
+        return audio_tower
+
+    def initialize_vision_modules(self, model_args, fsdp=None):
+        vision_tower = model_args.vision_tower
+        mm_vision_select_layer = model_args.mm_vision_select_layer
+        mm_vision_select_feature = model_args.mm_vision_select_feature
+        pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
+
+        self.config.mm_vision_tower = vision_tower
+
+        if self.get_vision_tower() is None:
+            vision_tower = build_vision_tower(model_args)
+
+            if fsdp is not None and len(fsdp) > 0:
+                self.vision_tower = [vision_tower]
+            else:
+                self.vision_tower = vision_tower
+        else:
+            if fsdp is not None and len(fsdp) > 0:
+                vision_tower = self.vision_tower[0]
+            else:
+                vision_tower = self.vision_tower
+            vision_tower.load_model()
+
+        self.config.use_mm_proj = True
+        self.config.mm_projector_type = getattr(model_args, 'mm_projector_type', 'linear')
+        self.config.mm_hidden_size = vision_tower.hidden_size
+        self.config.mm_vision_select_layer = mm_vision_select_layer
+        self.config.mm_vision_select_feature = mm_vision_select_feature
+
+        if getattr(self, 'mm_projector', None) is None:
+            self.mm_projector = build_vision_projector(self.config)
+        else:
+            # In case it is frozen by LoRA
+            for p in self.mm_projector.parameters():
+                p.requires_grad = True
+
+        if pretrain_mm_mlp_adapter is not None:
+            if os.path.exists(pretrain_mm_mlp_adapter):
+                is_local = True
+                if os.path.isdir(pretrain_mm_mlp_adapter):
+                    mm_projector_weights = load_mm_projector(pretrain_mm_mlp_adapter)
+                else:
+                    mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
+            else:
+                # Support loading projector weights from remote HuggingFace model hub
+                is_local = False
+                pretrain_mm_mlp_adapter = pretrain_mm_mlp_adapter.replace('mm_projector.bin', '')
+                pretrain_mm_mlp_adapter = pretrain_mm_mlp_adapter.strip('/').strip('\\').strip()
+                mm_projector_weights = load_mm_projector(pretrain_mm_mlp_adapter)
+
+            def get_w(weights, keyword):
+                return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
+
+            # self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
+            # set strict=False to avoid missing key error regarding bert.embeddings.position_ids
+            self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'), strict=False)
+
+
+    def initialize_audio_modules(self, model_args, fsdp=None):
+        audio_tower = model_args.audio_tower
+        pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter_a
+        self.config.mm_audio_tower = audio_tower
+        if self.get_audio_tower() is None:
+            audio_tower, audio_tower_cfg = build_audio_tower(model_args)
+            if fsdp is not None and len(fsdp) > 0:
+                self.audio_tower = [audio_tower]
+            else:
+                self.audio_tower = audio_tower
+        else:
+            if fsdp is not None and len(fsdp) > 0:
+                audio_tower = self.audio_tower[0]
+            else:
+                audio_tower = self.audio_tower
+        self.config.use_mm_proj = True
+        self.config.mm_projector_a_type = getattr(model_args, 'mm_projector_a_type', 'linear')
+        self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim
+        self.config.hidden_size_a = audio_tower_cfg.hidden_size
+        if getattr(self, 'mm_projector_a', None) is None:
+            self.mm_projector_a = build_audio_projector(self.config)
+        else:
+            # In case it is frozen by LoRA
+            for p in self.mm_projector_a.parameters():
+                p.requires_grad = True
+        if pretrain_mm_mlp_adapter is not None:
+            mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
+            def get_w(weights, keyword):
+                return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
+            self.mm_projector_a.load_state_dict(get_w(mm_projector_weights, 'mm_projector_a'), strict=True)
+
+
+class Videollama2MetaForCausalLM(ABC):
+
+    @abstractmethod
+    def get_model(self):
+        pass
+
+    def num_frames(self):
+        if hasattr(self.config, 'num_frames'):
+            return self.config.num_frames
+        else:
+            return NUM_FRAMES
+
+    def get_vision_tower(self):
+        return self.get_model().get_vision_tower()
+
+    def get_audio_tower(self):
+        return self.get_model().get_audio_tower()
+
+    def encode_images_or_videos(self, images):
+        num_frames = self.config.num_frames if hasattr(self.config, 'num_frames') else NUM_FRAMES
+
+        data_batch = []
+        for i, (data, modal) in enumerate(images):
+            if modal == 'image':
+                data = data.expand(num_frames, -1, -1, -1)
+            else:
+                data = data
+            data_batch.append(data)
+
+        data_batch = torch.stack(data_batch, dim=0)
+
+        assert len(data_batch.size()) == 5
+        batch_size = data_batch.size(0)
+
+        frames = einops.rearrange(data_batch, 'b t c h w -> (b t) c h w')
+        frames_features = self.get_model().get_vision_tower()(frames)
+        frames_features = einops.rearrange(frames_features, '(b t) n h -> b t n h', b = batch_size)
+
+        return self.temporal_aggregator(frames_features)
+
+    def temporal_aggregator(self, frames_features):
+        """Temporal aggregation of frame features.
+        Args:
+            frames_features (torch.Tensor): Frame features with shape (b, t, n, h).
+        Returns:
+            torch.Tensor: Video features with shape (b, n, h).
+        """
+        # TODO: improve the merging method.
+        # *********** mean pooling *************
+        if self.config.mm_projector_type == "mlp2x_gelu" or self.config.mm_projector_type == "linear":
+            video_features = self.get_model().mm_projector(frames_features.mean(1))
+        # *********** spatial convolution *************
+        elif self.config.mm_projector_type == "spatial_conv":
+            video_features = self.get_model().mm_projector(frames_features)
+        # *********** spatial pooling *************
+        elif self.config.mm_projector_type == "spatial_pool":
+            video_features = self.get_model().mm_projector(frames_features)
+        # *********** time  ************
+        elif "tc_connector" in self.config.mm_projector_type or "tp_connector" in self.config.mm_projector_type:
+            video_features = self.get_model().mm_projector(frames_features)
+        else:
+            raise Exception(f"Unsupported projector type {self.config.mm_projector_type}!!!")
+
+        return video_features
+
+    def prepare_inputs_labels_for_multimodal(
+        self, input_ids, attention_mask, past_key_values, labels, images
+    ):
+        vision_tower = self.get_vision_tower()
+        audio_tower = self.get_audio_tower()
+        # NOTE: text-only situation
+        if (vision_tower is None and audio_tower is None) or images is None or input_ids.shape[1] == 1:
+            # if past_key_values is not None and vision_tower is not None and Xs is not None and input_ids.shape[1] == 1:
+            #    attention_mask = torch.ones((attention_mask.shape[0], past_key_values[-1][-1].shape[-2] + 1), dtype=attention_mask.dtype, device=attention_mask.device)
+            return input_ids, attention_mask, past_key_values, None, labels
+        if audio_tower is None:
+            mm_features = self.encode_images_or_videos(images)
+        elif audio_tower is not None and vision_tower is not None and any(modal == 'video' for (_, modal) in images):
+            # [tensor, "image"]
+            # [tensor, "audio"]
+            # [tensor, "video"]
+            # [dict(”, audio), "video"]
+
+            X_video = []
+            X_audio = []
+
+            select_audio_id = [] 
+            select_videoimage_id = []
+            for idx, data_list in enumerate(images):
+                if isinstance(data_list[0], dict):
+                    assert data_list[1] == "video"
+                    X_audio.append(data_list[0]["audio"])
+                    select_audio_id.append(True)
+                    X_video.append((data_list[0]["video"], "video"))
+                    select_videoimage_id.append(True)
+                else:
+                    if data_list[1] == "audio":
+                        X_audio.append(data_list[0])
+                        select_audio_id.append(True)
+                        select_videoimage_id.append(False)
+                    elif data_list[1] == "video" or data_list[1] == "image":
+                        X_video.append(data_list)
+                        select_videoimage_id.append(True)
+                        select_audio_id.append(False)
+                    else:
+                        raise NotImplementedError
+
+            if len(X_audio) > 0:
+                Xa_features = torch.cat(X_audio, dim=0)
+                audio_padding_mask = torch.zeros(Xa_features.shape, device=self.device).bool()
+                audio_embedding, T, F = self.get_model().get_audio_tower().extract_features(Xa_features, padding_mask=audio_padding_mask, feature_only=True)
+                Xa_features = self.get_model().mm_projector_a(audio_embedding)
+                Xa_features = Xa_features.view(len(X_audio), -1, Xa_features.shape[-1])
+
+            if len(X_video) > 0:
+                X_features = self.encode_images_or_videos(X_video)
+
+            mm_features = []
+            idx_a, idx_v = 0, 0
+            for audio_idx, videoimage_idx in zip(select_audio_id, select_videoimage_id):
+                if audio_idx and videoimage_idx:
+                    mm_features.append(torch.cat([X_features[idx_v], Xa_features[idx_a]], dim=0))
+                    idx_a += 1
+                    idx_v += 1
+                elif audio_idx:
+                    mm_features.append(Xa_features[idx_a])
+                    idx_a += 1
+                elif videoimage_idx:
+                    mm_features.append(X_features[idx_v])
+                    idx_v += 1
+                else:
+                    raise NotImplementedError
+        else:
+            data_batch = []
+            for i, (data, modal) in enumerate(images):
+                data_batch.append(data)
+            X_features = torch.cat(data_batch, dim=0)
+            audio_padding_mask = torch.zeros(X_features.shape, device=self.device).bool()
+            audio_embedding, T, F = self.get_model().get_audio_tower().extract_features(X_features, 
+                padding_mask=audio_padding_mask, feature_only=True)      
+            mm_features = self.get_model().mm_projector_a(audio_embedding)
+            #X_features = X_features.view(len(X_features), -1, X_features.shape[-1])
+
+        new_input_embeds = []
+        new_labels = [] if labels is not None else None
+        cur_mm_idx = 0
+        # replace image/video/audio tokens with pre-computed embeddings
+        for batch_idx, cur_input_ids in enumerate(input_ids):
+            num_multimodals = sum((cur_input_ids == mm_token_idx).sum() for mm_token_idx in MODAL_INDEX_MAP.values())
+            # pure text input
+            if num_multimodals == 0:
+                half_len = cur_input_ids.shape[0] // 2
+                cur_mm_features = mm_features[cur_mm_idx]
+                cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids[:half_len])
+                cur_input_embeds_2 = self.get_model().embed_tokens(cur_input_ids[half_len:])
+                cur_input_embeds = torch.cat([cur_input_embeds_1, cur_mm_features[0:0], cur_input_embeds_2], dim=0)
+                new_input_embeds.append(cur_input_embeds)
+                if labels is not None:
+                    new_labels.append(labels[batch_idx])
+                cur_mm_idx += 1 
+                continue
+
+            cur_new_input_embeds = []
+            if labels is not None:
+                cur_labels = labels[batch_idx]
+                cur_new_labels = []
+                assert cur_labels.shape == cur_input_ids.shape
+
+            mm_token_indices = torch.where(sum([cur_input_ids == mm_token_idx for mm_token_idx in MODAL_INDEX_MAP.values()]))[0]
+            while mm_token_indices.numel() > 0:
+                cur_mm_features = mm_features[cur_mm_idx]
+                mm_token_start = mm_token_indices[0]
+
+                cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[:mm_token_start])) 
+                cur_new_input_embeds.append(cur_mm_features)
+                if labels is not None:
+                    cur_new_labels.append(cur_labels[:mm_token_start])
+                    cur_new_labels.append(torch.full((cur_mm_features.shape[0],), IGNORE_INDEX, device=labels.device, dtype=labels.dtype))
+                    cur_labels = cur_labels[mm_token_start+1:]
+
+                cur_mm_idx += 1
+                cur_input_ids = cur_input_ids[mm_token_start+1:] 
+                mm_token_indices = torch.where(sum([cur_input_ids == mm_token_idx for mm_token_idx in MODAL_INDEX_MAP.values()]))[0]
+
+            if cur_input_ids.numel() > 0:
+                cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids))
+                if labels is not None:
+                    cur_new_labels.append(cur_labels)
+            cur_new_input_embeds = [x.to(device=self.device) for x in cur_new_input_embeds]
+            # NOTE: one cur_new_input_embeds per each  
+            cur_new_input_embeds = torch.cat(cur_new_input_embeds, dim=0)
+            new_input_embeds.append(cur_new_input_embeds)
+            if labels is not None:
+                cur_new_labels = torch.cat(cur_new_labels, dim=0)
+                new_labels.append(cur_new_labels)
+
+        # padding
+        if any(x.shape != new_input_embeds[0].shape for x in new_input_embeds):
+            max_len = max(x.shape[0] for x in new_input_embeds)
+
+            new_input_embeds_align = []
+            for cur_new_embed in new_input_embeds:
+                cur_new_embed = torch.cat((cur_new_embed, torch.zeros((max_len - cur_new_embed.shape[0], cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0)
+                new_input_embeds_align.append(cur_new_embed)
+            new_input_embeds = torch.stack(new_input_embeds_align, dim=0)
+
+            if labels is not None:
+                new_labels_align = []
+                _new_labels = new_labels
+                for cur_new_label in new_labels:
+                    cur_new_label = torch.cat((cur_new_label, torch.full((max_len - cur_new_label.shape[0],), IGNORE_INDEX, dtype=cur_new_label.dtype, device=cur_new_label.device)), dim=0)
+                    new_labels_align.append(cur_new_label)
+                new_labels = torch.stack(new_labels_align, dim=0)
+
+            if attention_mask is not None:
+                new_attention_mask = []
+                for cur_attention_mask, cur_new_labels, cur_new_labels_align in zip(attention_mask, _new_labels, new_labels):
+                    new_attn_mask_pad_left = torch.full((cur_new_labels.shape[0] - labels.shape[1],), True, dtype=attention_mask.dtype, device=attention_mask.device)
+                    new_attn_mask_pad_right = torch.full((cur_new_labels_align.shape[0] - cur_new_labels.shape[0],), False, dtype=attention_mask.dtype, device=attention_mask.device)
+                    cur_new_attention_mask = torch.cat((new_attn_mask_pad_left, cur_attention_mask, new_attn_mask_pad_right), dim=0)
+                    new_attention_mask.append(cur_new_attention_mask)
+                attention_mask = torch.stack(new_attention_mask, dim=0)
+                assert attention_mask.shape == new_labels.shape
+        else:
+            new_input_embeds = torch.stack(new_input_embeds, dim=0)
+            if labels is not None:
+                new_labels  = torch.stack(new_labels, dim=0)
+
+            if attention_mask is not None:
+                new_attn_mask_pad_left = torch.full((attention_mask.shape[0], new_input_embeds.shape[1] - input_ids.shape[1]), True, dtype=attention_mask.dtype, device=attention_mask.device)
+                attention_mask = torch.cat((new_attn_mask_pad_left, attention_mask), dim=1)
+                assert attention_mask.shape == new_input_embeds.shape[:2]
+
+        return None, attention_mask, past_key_values, new_input_embeds, new_labels
diff --git a/third_party/VideoLLaMA2/videollama2/model/videollama2_gemma2.py b/third_party/VideoLLaMA2/videollama2/model/videollama2_gemma2.py
new file mode 100644
index 0000000000000000000000000000000000000000..b57ad3653ce91efbd8f5f5937dea98c2831372a5
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/videollama2_gemma2.py
@@ -0,0 +1,176 @@
+# Adopted from: https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         Gemma2Config, Gemma2Model, Gemma2ForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
+
+
+class Videollama2Gemma2Config(Gemma2Config):
+    model_type = "videollama2_gemma2"
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.model_type = "videollama2_gemma2"
+
+
+class Videollama2Gemma2Model(Videollama2MetaModel, Gemma2Model):
+    config_class = Videollama2Gemma2Config
+
+    def __init__(self, config: Gemma2Config):
+        super(Videollama2Gemma2Model, self).__init__(config)
+
+
+class Videollama2Gemma2ForCausalLM(Gemma2ForCausalLM, Videollama2MetaForCausalLM):
+    config_class = Videollama2Gemma2Config
+
+    def __init__(self, config, **kwargs):
+        super(Gemma2ForCausalLM, self).__init__(config)
+        self.model = Videollama2Gemma2Model(config)
+        # self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[int] = None,
+        **kwargs
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images
+            )
+
+        outputs = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+
+        outputs.labels = labels
+
+        return outputs
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids=inputs,
+                attention_mask=attention_mask,
+                past_key_values=None,
+                labels=None,
+                images=images
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def _prepare_generated_length(self, model_input_name, inputs_tensor, **kwargs):
+        if model_input_name == "inputs_embeds":
+            self.inputs_embeds_length = inputs_tensor.size(1)
+        else:
+            self.inputs_embeds_length = 0
+        return super()._prepare_generated_length(
+            model_input_name=model_input_name, 
+            inputs_tensor=inputs_tensor, 
+            **kwargs)
+
+    def _get_cache(self, cache_implementation: str, max_batch_size: int, max_cache_len: int, **kwargs):
+        return super()._get_cache(
+            cache_implementation=cache_implementation,
+            max_batch_size=max_batch_size,
+            max_cache_len=max_cache_len + self.inputs_embeds_length,
+            **kwargs)
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        _inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        if images is not None:
+            _inputs['images'] = images
+        return _inputs
+
+
+AutoConfig.register("videollama2_gemma2", Videollama2Gemma2Config)
+AutoModelForCausalLM.register(Videollama2Gemma2Config, Videollama2Gemma2ForCausalLM)
diff --git a/third_party/VideoLLaMA2/videollama2/model/videollama2_llama.py b/third_party/VideoLLaMA2/videollama2/model/videollama2_llama.py
new file mode 100644
index 0000000000000000000000000000000000000000..7846dbb7cff1213102ed7cba7f2ae163ba806cc4
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/videollama2_llama.py
@@ -0,0 +1,157 @@
+# Adopted from: https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         LlamaConfig, LlamaModel, LlamaForCausalLM
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
+
+
+class Videollama2LlamaConfig(LlamaConfig):
+    model_type = "videollama2_llama"
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.model_type = "videollama2_llama"
+
+
+class Videollama2LlamaModel(Videollama2MetaModel, LlamaModel):
+    config_class = Videollama2LlamaConfig
+
+    def __init__(self, config: LlamaConfig):
+        super(Videollama2LlamaModel, self).__init__(config)
+
+
+class Videollama2LlamaForCausalLM(LlamaForCausalLM, Videollama2MetaForCausalLM):
+    config_class = Videollama2LlamaConfig
+
+    def __init__(self, config, **kwargs):
+        super(LlamaForCausalLM, self).__init__(config)
+        self.model = Videollama2LlamaModel(config)
+        self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images
+            )
+
+        outputs = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+
+        outputs.labels = labels
+
+        return outputs
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids=inputs,
+                attention_mask=attention_mask,
+                past_key_values=None,
+                labels=None,
+                images=images
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        _inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        if images is not None:
+            _inputs['images'] = images
+        return _inputs
+
+
+AutoConfig.register("videollama2_llama", Videollama2LlamaConfig)
+AutoModelForCausalLM.register(Videollama2LlamaConfig, Videollama2LlamaForCausalLM)
diff --git a/third_party/VideoLLaMA2/videollama2/model/videollama2_mistral.py b/third_party/VideoLLaMA2/videollama2/model/videollama2_mistral.py
new file mode 100644
index 0000000000000000000000000000000000000000..d384fe8b941b1a22a577faf5892103ca2edd973e
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/videollama2_mistral.py
@@ -0,0 +1,159 @@
+# Adopted from: https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import AutoConfig, AutoModelForCausalLM, PretrainedConfig, \
+                         MistralConfig, MistralModel, MistralForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
+
+
+class Videollama2MistralConfig(MistralConfig):
+    model_type = "videollama2_mistral"
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.model_type = "videollama2_mistral"
+
+
+class Videollama2MistralModel(Videollama2MetaModel, MistralModel):
+    config_class = Videollama2MistralConfig
+
+    def __init__(self, config: MistralConfig):
+        super(Videollama2MistralModel, self).__init__(config)
+
+
+class Videollama2MistralForCausalLM(MistralForCausalLM, Videollama2MetaForCausalLM):
+    config_class = Videollama2MistralConfig
+
+    def __init__(self, config, **kwargs):
+        super(MistralForCausalLM, self).__init__(config)
+        self.model = Videollama2MistralModel(config)
+        # self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[int] = None,
+        **kwargs
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images
+            )
+
+        outputs = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+
+        outputs.labels = labels
+
+        return outputs
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids=inputs,
+                attention_mask=attention_mask,
+                past_key_values=None,
+                labels=None,
+                images=images
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        _inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        if images is not None:
+            _inputs['images'] = images
+        return _inputs
+
+
+AutoConfig.register("videollama2_mistral", Videollama2MistralConfig)
+AutoModelForCausalLM.register(Videollama2MistralConfig, Videollama2MistralForCausalLM)
diff --git a/third_party/VideoLLaMA2/videollama2/model/videollama2_mixtral.py b/third_party/VideoLLaMA2/videollama2/model/videollama2_mixtral.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc9bea7934c17fad925f1c6ee7775d2cc3024be9
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/videollama2_mixtral.py
@@ -0,0 +1,154 @@
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         MixtralConfig, MixtralModel, MixtralForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
+
+
+class Videollama2MixtralConfig(MixtralConfig):
+    model_type = "videollama2_mixtral"
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.model_type = "videollama2_mixtral"
+
+
+class Videollama2MixtralModel(Videollama2MetaModel, MixtralModel):
+    config_class = Videollama2MixtralConfig
+
+    def __init__(self, config: MixtralConfig):
+        super(Videollama2MixtralModel, self).__init__(config)
+
+
+class Videollama2MixtralForCausalLM(MixtralForCausalLM, Videollama2MetaForCausalLM):
+    config_class = Videollama2MixtralConfig
+
+    def __init__(self, config, **kwargs):
+        super(MixtralForCausalLM, self).__init__(config)
+        self.model = Videollama2MixtralModel(config)
+        # self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[int] = None,
+        **kwargs
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images
+            )
+
+        return super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids=inputs,
+                attention_mask=attention_mask,
+                past_key_values=None,
+                labels=None,
+                images=images
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        _inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        if images is not None:
+            _inputs['images'] = images
+        return _inputs
+
+
+AutoConfig.register("videollama2_mixtral", Videollama2MixtralConfig)
+AutoModelForCausalLM.register(Videollama2MixtralConfig, Videollama2MixtralForCausalLM)
diff --git a/third_party/VideoLLaMA2/videollama2/model/videollama2_phi3.py b/third_party/VideoLLaMA2/videollama2/model/videollama2_phi3.py
new file mode 100644
index 0000000000000000000000000000000000000000..f045c09e8fdb2daeab582d91656ab5ad615a40b7
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/videollama2_phi3.py
@@ -0,0 +1,159 @@
+# Adopted from: https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import AutoConfig, AutoModelForCausalLM, PretrainedConfig, \
+                         Phi3Config, Phi3Model, Phi3ForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
+
+
+class Videollama2Phi3Config(Phi3Config):
+    model_type = "videollama2_phi3"
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.model_type = "videollama2_phi3"
+
+
+class Videollama2Phi3Model(Videollama2MetaModel, Phi3Model):
+    config_class = Videollama2Phi3Config
+
+    def __init__(self, config: Phi3Config):
+        super(Videollama2Phi3Model, self).__init__(config)
+
+
+class Videollama2Phi3ForCausalLM(Phi3ForCausalLM, Videollama2MetaForCausalLM):
+    config_class = Videollama2Phi3Config
+
+    def __init__(self, config, **kwargs):
+        super(Phi3ForCausalLM, self).__init__(config)
+        self.model = Videollama2Phi3Model(config)
+        # self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[int] = None,
+        **kwargs
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images
+            )
+
+        outputs = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+
+        outputs.labels = labels
+
+        return outputs
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids=inputs,
+                attention_mask=attention_mask,
+                past_key_values=None,
+                labels=None,
+                images=images
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        _inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        if images is not None:
+            _inputs['images'] = images
+        return _inputs
+
+
+AutoConfig.register("videollama2_phi3", Videollama2Phi3Config)
+AutoModelForCausalLM.register(Videollama2Phi3Config, Videollama2Phi3ForCausalLM)
diff --git a/third_party/VideoLLaMA2/videollama2/model/videollama2_qwen2.py b/third_party/VideoLLaMA2/videollama2/model/videollama2_qwen2.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ccbc1769bd4297d0fc1aa0851aa24976abb0179
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/model/videollama2_qwen2.py
@@ -0,0 +1,153 @@
+# Adopted from: https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         Qwen2Config, Qwen2Model, Qwen2ForCausalLM
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import GenerateOutput
+
+from .videollama2_arch import Videollama2MetaModel, Videollama2MetaForCausalLM
+
+
+class Videollama2Qwen2Config(Qwen2Config):
+    model_type = "videollama2_qwen2"
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.model_type = "videollama2_qwen2"
+
+
+class Videollama2Qwen2Model(Videollama2MetaModel, Qwen2Model):
+    config_class = Videollama2Qwen2Config
+
+    def __init__(self, config: Videollama2Qwen2Config):
+        super(Videollama2Qwen2Model, self).__init__(config)
+
+
+class Videollama2Qwen2ForCausalLM(Qwen2ForCausalLM, Videollama2MetaForCausalLM):
+    config_class = Videollama2Qwen2Config
+
+    def __init__(self, config, **kwargs):
+        super(Qwen2ForCausalLM, self).__init__(config)
+        self.model = Videollama2Qwen2Model(config)
+        # self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[int] = None,
+        **kwargs
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        if inputs_embeds is None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                labels
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids,
+                attention_mask,
+                past_key_values,
+                labels,
+                images
+            )
+
+        return super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[torch.Tensor] = None,
+        images: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        position_ids = kwargs.pop("position_ids", None)
+        attention_mask = kwargs.pop("attention_mask", None)
+        if "inputs_embeds" in kwargs:
+            raise NotImplementedError("`inputs_embeds` is not supported")
+
+        if images is not None:
+            (
+                input_ids,
+                attention_mask,
+                past_key_values,
+                inputs_embeds,
+                _
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids=inputs,
+                attention_mask=attention_mask,
+                past_key_values=None,
+                labels=None,
+                images=images
+            )
+        else:
+            inputs_embeds = self.get_model().embed_tokens(inputs)
+
+        return super().generate(
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            **kwargs
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        images = kwargs.pop("images", None)
+        _inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
+        )
+        if images is not None:
+            _inputs['images'] = images
+        return _inputs
+
+
+AutoConfig.register("videollama2_qwen2", Videollama2Qwen2Config)
+AutoModelForCausalLM.register(Videollama2Qwen2Config, Videollama2Qwen2ForCausalLM)
diff --git a/third_party/VideoLLaMA2/videollama2/train.py b/third_party/VideoLLaMA2/videollama2/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..020b0bdce70ec0b1c2d250875e99831979429c5e
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/train.py
@@ -0,0 +1,683 @@
+# Adopted from https://github.com/haotian-liu/LLaVA. Below is the original copyright:
+# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
+# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
+#    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+import re
+import os
+import copy
+import json
+import random
+import pathlib
+import traceback
+from dataclasses import dataclass, field
+from typing import Dict, Optional, Sequence, List
+
+# torch-related packages
+# NOTE: torch must be imported before transformers. Otherwise, `Segmentation fault (core dumped)` will occur.
+import torch
+from torch.utils.data import Dataset
+
+import transformers
+from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
+
+import sys
+sys.path.append('./')
+from videollama2.model import *
+from videollama2.constants import NUM_FRAMES, IGNORE_INDEX, MODAL_INDEX_MAP
+from videollama2.mm_utils import tokenizer_multimodal_token, process_video, process_image, process_audio_file
+from videollama2.videollama2_trainer import (VideoLLaMA2Trainer,
+    get_peft_state_maybe_zero_3, get_peft_state_non_lora_maybe_zero_3, 
+    find_all_linear_names, safe_save_model_for_hf_trainer
+)
+
+# NOTE: fast tokenizer warning issue: https://github.com/huggingface/transformers/issues/5486   
+os.environ["TOKENIZERS_PARALLELISM"] = "true"
+
+local_rank = None
+
+
+def rank0_print(*args):
+    if local_rank == 0:
+        print(*args)
+
+
+def set_seed(seed=42):
+    """
+    Set the random seed for reproducible results.
+
+    :param seed: An integer value to be used as the random seed.
+    """
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)  # for multi-GPU setups
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+
+
+@dataclass
+class ModelArguments:
+    # LLM Arguments
+    model_type: Optional[str] = field(default="videollama2", metadata={"help": "Model type selected in the list: " + ", ".join(VLLMs.keys())})
+    model_path: Optional[str] = field(default="lmsys/vicuna-7b-v1.5")
+    version: Optional[str] = field(default="v1", metadata={"help": "Version of the conversation template."})
+    freeze_backbone: bool = field(default=False, metadata={"help": "Whether to freeze the LLM backbone."})
+    tune_adapter_llm: bool = field(default=False)
+    # Connector Arguments
+    mm_projector_type: Optional[str] = field(default='linear')
+    mm_projector_a_type: Optional[str] = field(default='linear')
+    tune_mm_mlp_adapter: bool = field(default=False)
+    tune_mm_mlp_adapter_a: bool = field(default=False)
+    pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
+    pretrain_mm_mlp_adapter_a: Optional[str] = field(default=None)
+    # Vision tower Arguments
+    vision_tower: Optional[str] = field(default=None)
+    mm_vision_select_layer: Optional[int] = field(default=-1)
+    mm_vision_select_feature: Optional[str] = field(default="patch")
+    # Audio tower Arguments
+    audio_tower: Optional[str] = field(default=None)
+    tune_audio_tower: bool = field(default=False)
+
+@dataclass
+class DataArguments:
+    # Path Arguments
+    data_path: str = field(default=None, metadata={"help": "Path to the training data."})
+    data_path_a: Optional[str] = field(default=None, metadata={"help": "Path to the audio data."})
+    # image_folder: Optional[str] = field(default=None)
+    # video_folder: Optional[str] = field(default=None)
+    data_folder: Optional[str] = field(default=None)
+    # Loading Arguments
+    is_multimodal: bool = False
+    va: bool = field(default=False)
+    lazy_preprocess: bool = False
+    num_frames: Optional[int] = field(default=None)
+    # Preprocess Arguments
+    image_aspect_ratio: str = 'square'
+
+
+@dataclass
+class TrainingArguments(transformers.TrainingArguments):
+    optim: str = field(default="adamw_torch")
+    mm_projector_lr: Optional[float] = None
+    freeze_mm_mlp_adapter: bool = field(default=False)
+    remove_unused_columns: bool = field(default=False)
+    # Training Data Arguments 
+    group_by_modality_length: bool = field(default=False)
+    model_max_length: int = field(
+        default=512,
+        metadata={
+            "help":
+            "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
+        },
+    )
+    # Lora or Quant Arguments
+    double_quant: bool = field(
+        default=True,
+        metadata={"help": "Compress the quantization statistics through double quantization."}
+    )
+    quant_type: str = field(
+        default="nf4",
+        metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."}
+    )
+    bits: int = field(
+        default=16,
+        metadata={"help": "How many bits to use."}
+    )
+    lora_enable: bool = False
+    lora_r: int = 64
+    lora_alpha: int = 16
+    lora_dropout: float = 0.05
+    lora_weight_path: str = ""
+    lora_bias: str = "none"
+
+
+def preprocess_plain(
+    sources: Sequence[str],
+    tokenizer: transformers.PreTrainedTokenizer,
+    modal_token: str = None,
+) -> Dict:
+    roles = {"human": "user", "gpt": "assistant"}
+    conversations = []
+    input_ids = []
+    targets = []
+    for source in sources:
+        # 1. apply chat template for input conversation
+        assert len(source) == 2
+        assert modal_token in source[0]['value']
+        message = [
+            {'role': 'user', 'content': modal_token},
+            {'role': 'assistant', 'content': source[1]['value']}
+        ]
+        conversation = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False)
+        # 2. tokenize conversations
+        input_ids.append(tokenizer_multimodal_token(conversation, tokenizer, modal_token, return_tensors='pt'))
+        # 3. make targets
+        targets.append(copy.deepcopy(input_ids[-1]))
+        instruction = tokenizer.apply_chat_template(message[:1], tokenize=False, add_generation_prompt=True)
+        instruction_len = len(tokenizer_multimodal_token(instruction, tokenizer, modal_token, return_tensors='pt'))
+        targets[-1][:instruction_len] = IGNORE_INDEX
+        # print("instruction: ----------------")
+        # print(instruction)
+        # print("conversation: ----------------")
+        # print(conversation)
+        # print("training targets: ----------------")
+        # print(tokenizer.decode(targets[-1][instruction_len:]))
+        # print(input_ids[-1])
+        # print(targets[-1])
+    return dict(input_ids=input_ids, labels=targets)
+
+
+def preprocess(
+    sources: Sequence[str],
+    tokenizer: transformers.PreTrainedTokenizer,
+    modal_token: str = None,
+) -> Dict:
+    roles = {"human": "user", "gpt": "assistant"}
+
+    # Apply prompt templates
+    conversations = []
+    input_ids = []
+    targets = []
+    for i, source in enumerate(sources):
+        if roles[source[0]["from"]] != "user":
+            # Skip the first one if it is not from human
+            source = source[1:]
+        message = [{'role': roles[sentence['from']], 'content': sentence['value']} for sentence in source]
+        conversation = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False)
+        input_ids.append(tokenizer_multimodal_token(conversation, tokenizer, modal_token, return_tensors='pt'))
+        targets.append(copy.deepcopy(input_ids[-1]))
+        assert len(source) % 2 == 0, f"Invalid conversation length {len(source)}."
+
+        cur = 0
+        message = []
+        for idx, sentence in enumerate(source):
+            if idx % 2 == 1:
+                tmp_message = [
+                    {'role': roles[source[idx-1]['from']], 'content': source[idx-1]['value']}, 
+                    {'role': roles[sentence['from']], 'content': sentence['value']}
+                ]
+
+                instruction = tokenizer.apply_chat_template(message + tmp_message[:1], tokenize=False, add_generation_prompt=True)
+                conversation = tokenizer.apply_chat_template(message + tmp_message, tokenize=False, add_generation_prompt=False)
+
+                instruction_len = len(tokenizer_multimodal_token(instruction, tokenizer, modal_token, return_tensors='pt'))
+                conversation_len = len(tokenizer_multimodal_token(conversation, tokenizer, modal_token, return_tensors='pt'))
+
+                targets[-1][cur:instruction_len] = IGNORE_INDEX
+
+                cur = conversation_len
+                message += tmp_message
+    return dict(input_ids=input_ids, labels=targets)
+
+
+def preprocess_multimodal(
+    sources: Sequence[str],
+    data_args: DataArguments,
+    modal_token: str = None,
+) -> Dict:
+    is_multimodal = data_args.is_multimodal
+    if not is_multimodal:
+        return sources
+
+    assert modal_token in MODAL_INDEX_MAP, f"Unsupported modal token {modal_token}."
+
+    for source in sources:
+        for sentence in source:
+            if modal_token in sentence['value']:
+                sentence['value'] = sentence['value'].replace(modal_token, '').strip()
+                sentence['value'] = modal_token + '\n' + sentence['value']
+                sentence['value'] = sentence['value'].strip()
+            replace_token = modal_token
+            # TODO: fix this for multimedia, e.g., <video>, <audio>, etc.
+            sentence["value"] = sentence["value"].replace(modal_token, replace_token)
+
+    return sources
+
+
+class LazySupervisedDataset(Dataset):
+    """Dataset for supervised fine-tuning."""
+
+    def __init__(self, data_path: str, data_path_a: str,
+                 tokenizer: transformers.PreTrainedTokenizer,
+                 data_args: DataArguments):
+        super(LazySupervisedDataset, self).__init__()
+        self.mix_sampler_tag = False
+        if data_path is not None and len(data_path.split(",")) == 1:
+            data_path = data_path.split(",")[0]
+            list_data_dict = json.load(open(data_path, "r"))
+        elif data_path is not None and len(data_path.split(",")) > 1:
+            self.mix_sampler_tag = True
+            data_path = data_path.split(",")
+            for path in data_path:
+                if "stage3" in path:
+                    self.av_data = json.load(open(path, "r"))
+                    random.shuffle(self.av_data)
+                elif "stage2" in path and "audio" in path:
+                    self.a_data = json.load(open(path, "r"))
+                    random.shuffle(self.a_data)
+                elif "stage2" in path and "video" in path:
+                    self.v_data = json.load(open(path, "r"))
+                    random.shuffle(self.v_data)
+                else:
+                    raise NotImplementedError
+            list_data_dict = self.av_data + self.a_data + self.v_data
+        if data_path_a is not None:
+            list_data_dict = json.load(open(data_path_a, "r"))
+
+        rank0_print("Formatting inputs...Skip in lazy mode")
+        self.tokenizer = tokenizer
+        self.list_data_dict = list_data_dict
+        self.data_args = data_args
+
+    def __len__(self):
+        return len(self.list_data_dict)
+
+    @property
+    def lengths(self):
+        length_list = []
+        for sample in self.list_data_dict:
+            img_tokens = 576 if 'image' in sample else 0
+            length_list.append(sum(len(conv['value'].split()) for conv in sample['conversations']) + img_tokens)
+        return length_list
+
+    @property
+    def modality_lengths(self):
+        length_list = []
+        for sample in self.list_data_dict:
+            cur_len = sum(len(conv['value'].split()) for conv in sample['conversations'])
+            cur_len = cur_len if 'image' in sample else -cur_len
+            length_list.append(cur_len)
+        return length_list
+
+    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
+        sources = self.list_data_dict[i]
+        if isinstance(i, int):
+            sources = [sources]
+        assert len(sources) == 1, "Don't know why it is wrapped to a list"  # FIXME
+        if self.data_args.data_path is not None:
+            image_processor = self.data_args.image_processor
+            video_processor = self.data_args.video_processor
+
+        num_frames = NUM_FRAMES if self.data_args.num_frames is None else self.data_args.num_frames
+
+        if 'image' in sources[0]:
+            image_file = self.list_data_dict[i]['image']
+            image_folder = self.data_args.data_folder
+            image_file = os.path.join(image_folder, image_file)
+
+            try:
+                image = process_image(image_file, image_processor, aspect_ratio=self.data_args.image_aspect_ratio)
+            except:
+                traceback.print_exc()
+                backup_idx = random.randint(0, len(self.list_data_dict) - 1)
+                print(f"Encounted error when reading image {image_file}, use {backup_idx}-th example instead!!!")
+                return self.__getitem__(backup_idx)
+
+            # place <image> tag to question head.
+            modal_token = "<image>"
+            sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args, modal_token)
+        elif 'video' in sources[0]:
+            video_file = self.list_data_dict[i]['video']
+            video_folder = self.data_args.data_folder
+            if video_folder:
+                video_file = os.path.join(video_folder, video_file)
+
+            try:
+                video = process_video(video_file, video_processor, aspect_ratio=self.data_args.image_aspect_ratio, num_frames=num_frames, va = self.data_args.va if not self.mix_sampler_tag else (i < len(self.av_data)))
+            except Exception as e:
+                traceback.print_exc()
+                backup_idx = random.randint(0, len(self.list_data_dict) - 1)
+                print(f"Encounted error when reading video {video_file}, use {backup_idx}-th example instead!!!")
+                return self.__getitem__(backup_idx)
+
+            # place <video> tag to question head.
+            modal_token = "<video>"
+            sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args, modal_token)
+
+        elif 'audio' in sources[0]:
+            audio_file = self.list_data_dict[i]['audio']
+            #audio_folder = self.data_args.base_folder
+            #print(audio_file)
+            try:
+                audio = process_audio_file(audio_file)
+            except Exception as e:
+                print(e)
+                backup_idx = random.randint(0, len(self.list_data_dict)-1)
+                print(f"Encounted error when reading audio {audio_file}, use {backup_idx}-th example instead!!!")
+                return self.__getitem__(backup_idx)
+            modal_token = "<audio>"
+            sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args, modal_token)
+
+        else:
+            modal_token = None
+            sources = copy.deepcopy([e["conversations"] for e in sources])
+
+        if self.data_args.is_pretraining:
+            data_dict = preprocess_plain(sources, self.tokenizer, modal_token=modal_token)
+        else:
+            data_dict = preprocess(sources, self.tokenizer, modal_token=modal_token)
+
+        if isinstance(i, int):
+            data_dict = dict(input_ids=data_dict["input_ids"][0], labels=data_dict["labels"][0])
+
+        # image exist in the data
+        if 'image' in self.list_data_dict[i]:
+            data_dict['image'] = image
+        elif 'video' in self.list_data_dict[i]:
+            data_dict['video'] = video
+        elif 'audio' in self.list_data_dict[i]:
+            data_dict['audio'] = audio
+        elif self.data_args.data_path_a:
+            # image does not exist in the data, but the model is multimodal
+            data_dict['audio'] = torch.zeros(1, 2998, 128)
+        elif self.data_args.is_multimodal:
+            # image does not exist in the data, but the model is multimodal
+            data_dict['image'] = torch.zeros(3, self.data_args.image_size, self.data_args.image_size)
+        return data_dict
+
+
+@dataclass
+class DataCollatorForSupervisedDataset(object):
+    """Collate examples for supervised fine-tuning."""
+
+    tokenizer: transformers.PreTrainedTokenizer
+
+    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
+        input_ids, labels = tuple([instance[key] for instance in instances]
+                                  for key in ("input_ids", "labels"))
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            input_ids,
+            batch_first=True,
+            padding_value=self.tokenizer.pad_token_id)
+        labels = torch.nn.utils.rnn.pad_sequence(labels,
+                                                 batch_first=True,
+                                                 padding_value=IGNORE_INDEX)
+        input_ids = input_ids[:, :self.tokenizer.model_max_length]
+        labels = labels[:, :self.tokenizer.model_max_length]
+        batch = dict(
+            input_ids=input_ids,
+            labels=labels,
+            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
+        )
+
+        # work for 'images' argument in `prepare_inputs_labels_for_multimodal` of LlavaMetaForCausalLM in llava_arch.py
+        batch['images'] = []
+        for instance in instances:
+            for modal_token in MODAL_INDEX_MAP.keys():
+                modal_token = modal_token.lower()
+                # MODAL_TOKEN shape like: <image>, <video>, ...
+                modal_name = re.findall(f'[<](.*)[>]', modal_token)
+                assert len(modal_name) == 1
+                modal_name = modal_name[0]
+                if modal_name in instance:
+                    batch['images'].append((instance[modal_name], modal_name))
+
+        return batch
+
+
+def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer,
+                                data_args) -> Dict:
+    """Make dataset and collator for supervised fine-tuning."""
+    train_dataset = LazySupervisedDataset(
+        tokenizer=tokenizer,
+        data_path=data_args.data_path,
+        data_path_a=data_args.data_path_a,
+        data_args=data_args
+    )
+    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
+    return dict(train_dataset=train_dataset,
+                eval_dataset=None,
+                data_collator=data_collator)
+
+
+def train(attn_implementation=None):
+    global local_rank
+    set_seed(42)
+
+    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    local_rank = training_args.local_rank
+    compute_dtype = (torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
+
+    bnb_model_from_pretrained_args = {}
+    if training_args.bits in [4, 8]:
+        from transformers import BitsAndBytesConfig
+        bnb_model_from_pretrained_args.update(dict(
+            # device_map={"": training_args.device},
+            # BUG: High version transformers report error: 
+            # ValueError: You can't pass `load_in_4bit`or `load_in_8bit` as a kwarg when passing `quantization_config` argument at the same time
+            # load_in_4bit=training_args.bits == 4,
+            # load_in_8bit=training_args.bits == 8,
+            quantization_config=BitsAndBytesConfig(
+                load_in_4bit=training_args.bits == 4,
+                load_in_8bit=training_args.bits == 8,
+                llm_int8_skip_modules=["mm_projector"],
+                llm_int8_threshold=6.0,
+                llm_int8_has_fp16_weight=False,
+                bnb_4bit_compute_dtype=compute_dtype,
+                bnb_4bit_use_double_quant=training_args.double_quant,
+                bnb_4bit_quant_type=training_args.quant_type, # {'fp4', 'nf4'}
+                bnb_4bit_quant_storage=compute_dtype,
+            )
+        ))
+
+    config = VLLMConfigs[model_args.model_type].from_pretrained(model_args.model_path, trust_remote_code=True)
+    if 'gemma2' in model_args.model_type:
+        config._attn_implementation = 'eager'
+    else:
+        config._attn_implementation = attn_implementation
+
+    if model_args.vision_tower is not None or model_args.audio_tower is not None:
+        model = VLLMs[model_args.model_type].from_pretrained(
+            model_args.model_path,
+            config=config,
+            torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
+            do_sample=True,
+            **bnb_model_from_pretrained_args
+        )
+        if 'mixtral' in model_args.model_type:
+            import deepspeed
+            deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
+    else:
+        model = transformers.LlamaForCausalLM.from_pretrained(
+            model_args.model_path,
+            config=config,
+            torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
+            do_sample=True,
+            **bnb_model_from_pretrained_args
+        )
+    model.config.use_cache = False
+
+
+    if training_args.bits in [4, 8]:
+        from peft import prepare_model_for_kbit_training
+        model.config.torch_dtype=(torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
+        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
+
+    if training_args.gradient_checkpointing:
+        if hasattr(model, "enable_input_require_grads"):
+            model.enable_input_require_grads()
+        else:
+            def make_inputs_require_grad(module, input, output):
+                output.requires_grad_(True)
+            model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
+
+    if training_args.lora_enable:
+        from peft import LoraConfig, get_peft_model
+        lora_config = LoraConfig(
+            r=training_args.lora_r,
+            lora_alpha=training_args.lora_alpha,
+            target_modules=find_all_linear_names(model),
+            lora_dropout=training_args.lora_dropout,
+            bias=training_args.lora_bias,
+            task_type="CAUSAL_LM",
+        )
+        if training_args.bits == 16:
+            if training_args.bf16:
+                model.to(torch.bfloat16)
+            if training_args.fp16:
+                model.to(torch.float16)
+        rank0_print("Adding LoRA adapters...")
+        model = get_peft_model(model, lora_config)
+
+
+    tokenizer = transformers.AutoTokenizer.from_pretrained(
+        model_args.model_path,
+        model_max_length=training_args.model_max_length,
+        padding_side="right",
+        use_fast=True,
+    )
+
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.unk_token
+
+    if model_args.vision_tower is not None:
+        # initialize vision encoder + multi-modal projector
+        model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
+
+        vision_tower = model.get_vision_tower()
+        vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
+
+        data_args.image_size = vision_tower.image_size
+
+        data_args.image_processor = vision_tower.image_processor
+        data_args.video_processor = vision_tower.video_processor if hasattr(vision_tower, "video_processor") else vision_tower.image_processor
+
+        data_args.is_multimodal = True
+
+        model.config.image_aspect_ratio = data_args.image_aspect_ratio
+        model.config.tokenizer_padding_side = tokenizer.padding_side
+        model.config.tokenizer_model_max_length = tokenizer.model_max_length
+
+        model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
+        if model_args.tune_mm_mlp_adapter:
+            model.requires_grad_(False)
+            for p in model.get_model().mm_projector.parameters():
+                p.requires_grad = True
+
+        if model_args.tune_mm_mlp_adapter:
+            data_args.is_pretraining = True
+        else:
+            data_args.is_pretraining = False
+
+        model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
+        if training_args.freeze_mm_mlp_adapter:
+            for p in model.get_model().mm_projector.parameters():
+                p.requires_grad = False
+
+        if training_args.bits in [4, 8]:
+            model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
+
+        model.config.mm_projector_lr = training_args.mm_projector_lr
+        model.config.num_frames = NUM_FRAMES if data_args.num_frames is None else data_args.num_frames
+    
+
+    if model_args.audio_tower is not None:
+        # initialize audio encoder + multi-modal projector
+        model.get_model().initialize_audio_modules(
+            model_args=model_args,
+            fsdp=training_args.fsdp
+        )
+        
+        audio_tower = model.get_audio_tower()
+        audio_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
+
+        data_args.is_multimodal = True
+
+        model.config.tokenizer_padding_side = tokenizer.padding_side
+        model.config.tokenizer_model_max_length = tokenizer.model_max_length
+
+        model.config.tune_mm_mlp_adapter_a = training_args.tune_mm_mlp_adapter_a = model_args.tune_mm_mlp_adapter_a
+        training_args.pretrain_mm_mlp_adapter_a = model_args.pretrain_mm_mlp_adapter_a
+        training_args.tune_audio_tower = model_args.tune_audio_tower
+        # only update mm_mlp's parameters while the remaining ones are kept frozen
+        if model_args.tune_mm_mlp_adapter_a:
+            model.requires_grad_(False)
+            for p in model.get_model().mm_projector_a.parameters():
+                p.requires_grad = True
+        
+        if model_args.tune_audio_tower or model_args.tune_adapter_llm:
+            data_args.is_pretraining = False
+        else:
+            data_args.is_pretraining = True
+
+        model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
+        if training_args.freeze_mm_mlp_adapter:
+            for p in model.get_model().mm_projector_a.parameters():
+                p.requires_grad = False
+        
+        if model_args.tune_adapter_llm:
+            model.requires_grad_(True)
+            if hasattr(model.get_model(), 'vision_tower'):
+                for p in model.get_model().vision_tower.parameters():
+                    p.requires_grad = False
+            for p in model.get_model().audio_tower.parameters():
+                p.requires_grad = False   
+                
+        if model_args.freeze_backbone:
+            model.requires_grad_(False)
+
+        if model_args.tune_audio_tower:
+            for p in model.get_model().audio_tower.parameters():
+                p.requires_grad = True
+        else:
+            for p in model.get_model().audio_tower.parameters():
+                p.requires_grad = False
+
+        if training_args.bits in [4, 8]:
+            model.get_model().mm_projector_a.to(dtype=compute_dtype, device=training_args.device)
+
+        model.config.mm_projector_lr = training_args.mm_projector_lr
+
+    if training_args.bits in [4, 8]:
+        from peft.tuners.lora import LoraLayer
+        for name, module in model.named_modules():
+            if isinstance(module, LoraLayer):
+                if training_args.bf16:
+                    module = module.to(torch.bfloat16)
+            if 'norm' in name:
+                module = module.to(torch.float32)
+            if 'lm_head' in name or 'embed_tokens' in name:
+                if hasattr(module, 'weight'):
+                    if training_args.bf16 and module.weight.dtype == torch.float32:
+                        module = module.to(torch.bfloat16)
+
+    print("Current model:", model)
+
+    data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
+    # select a Trainer
+    trainer = VideoLLaMA2Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
+    if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
+        trainer.train(resume_from_checkpoint=True)
+    else:
+        trainer.train()
+    trainer.save_state()
+
+    model.config.use_cache = True
+
+    if training_args.lora_enable:
+        state_dict = get_peft_state_maybe_zero_3(model.named_parameters(), training_args.lora_bias)
+        non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters())
+        if training_args.local_rank == 0 or training_args.local_rank == -1:
+            model.config.save_pretrained(training_args.output_dir)
+            model.save_pretrained(training_args.output_dir, state_dict=state_dict)
+            torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, 'non_lora_trainables.bin'))
+    else:
+        safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
+
+
+if __name__ == "__main__":
+    train()
diff --git a/third_party/VideoLLaMA2/videollama2/utils.py b/third_party/VideoLLaMA2/videollama2/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9f161afab1cf556197a578ef2e90ef973b7574d
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/utils.py
@@ -0,0 +1,126 @@
+import datetime
+import logging
+import logging.handlers
+import os
+import sys
+
+import requests
+
+from .constants import LOGDIR
+
+server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
+moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE TRY AGAIN."
+
+handler = None
+
+
+def build_logger(logger_name, logger_filename):
+    global handler
+
+    formatter = logging.Formatter(
+        fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+
+    # Set the format of root handlers
+    if not logging.getLogger().handlers:
+        logging.basicConfig(level=logging.INFO)
+    logging.getLogger().handlers[0].setFormatter(formatter)
+
+    # Redirect stdout and stderr to loggers
+    stdout_logger = logging.getLogger("stdout")
+    stdout_logger.setLevel(logging.INFO)
+    sl = StreamToLogger(stdout_logger, logging.INFO)
+    sys.stdout = sl
+
+    stderr_logger = logging.getLogger("stderr")
+    stderr_logger.setLevel(logging.ERROR)
+    sl = StreamToLogger(stderr_logger, logging.ERROR)
+    sys.stderr = sl
+
+    # Get logger
+    logger = logging.getLogger(logger_name)
+    logger.setLevel(logging.INFO)
+
+    # Add a file handler for all loggers
+    if handler is None:
+        os.makedirs(LOGDIR, exist_ok=True)
+        filename = os.path.join(LOGDIR, logger_filename)
+        handler = logging.handlers.TimedRotatingFileHandler(
+            filename, when='D', utc=True, encoding='UTF-8')
+        handler.setFormatter(formatter)
+
+        for name, item in logging.root.manager.loggerDict.items():
+            if isinstance(item, logging.Logger):
+                item.addHandler(handler)
+
+    return logger
+
+
+class StreamToLogger(object):
+    """
+    Fake file-like stream object that redirects writes to a logger instance.
+    """
+    def __init__(self, logger, log_level=logging.INFO):
+        self.terminal = sys.stdout
+        self.logger = logger
+        self.log_level = log_level
+        self.linebuf = ''
+
+    def __getattr__(self, attr):
+        return getattr(self.terminal, attr)
+
+    def write(self, buf):
+        temp_linebuf = self.linebuf + buf
+        self.linebuf = ''
+        for line in temp_linebuf.splitlines(True):
+            # From the io.TextIOWrapper docs:
+            #   On output, if newline is None, any '\n' characters written
+            #   are translated to the system default line separator.
+            # By default sys.stdout.write() expects '\n' newlines and then
+            # translates them so this is still cross platform.
+            if line[-1] == '\n':
+                self.logger.log(self.log_level, line.rstrip())
+            else:
+                self.linebuf += line
+
+    def flush(self):
+        if self.linebuf != '':
+            self.logger.log(self.log_level, self.linebuf.rstrip())
+        self.linebuf = ''
+
+
+def disable_torch_init():
+    """
+    Disable the redundant torch default initialization to accelerate model creation.
+    """
+    import torch
+    setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
+    setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
+
+
+def violates_moderation(text):
+    """
+    Check whether the text violates OpenAI moderation API.
+    """
+    url = "https://api.openai.com/v1/moderations"
+    headers = {"Content-Type": "application/json",
+               "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
+    text = text.replace("\n", "")
+    data = "{" + '"input": ' + f'"{text}"' + "}"
+    data = data.encode("utf-8")
+    try:
+        ret = requests.post(url, headers=headers, data=data, timeout=5)
+        flagged = ret.json()["results"][0]["flagged"]
+    except requests.exceptions.RequestException as e:
+        flagged = False
+    except KeyError as e:
+        flagged = False
+
+    return flagged
+
+
+def pretty_print_semaphore(semaphore):
+    if semaphore is None:
+        return "None"
+    return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
diff --git a/third_party/VideoLLaMA2/videollama2/videollama2_trainer.py b/third_party/VideoLLaMA2/videollama2/videollama2_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff9274536ff78c00e21aa49947d4d0150d678313
--- /dev/null
+++ b/third_party/VideoLLaMA2/videollama2/videollama2_trainer.py
@@ -0,0 +1,447 @@
+# Adopted from: https://github.com/haotian-liu/LLaVA/blob/main/llava/train/llava_trainer.py
+import os
+import logging
+from typing import List, Optional
+
+import torch
+import torch.nn as nn
+from torch.utils.data import Sampler
+
+from transformers import Trainer
+from transformers.trainer import (
+    is_sagemaker_mp_enabled,
+    get_parameter_names,
+    has_length,
+    ALL_LAYERNORM_LAYERS,
+    logger,
+    TRAINER_STATE_NAME,
+)
+
+
+def maybe_zero_3(param, ignore_status=False, name=None):
+    from deepspeed import zero
+    from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
+    if hasattr(param, "ds_id"):
+        if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
+            if not ignore_status:
+                logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
+        with zero.GatheredParameters([param]):
+            param = param.data.detach().cpu().clone()
+    else:
+        param = param.detach().cpu().clone()
+    return param
+
+
+def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
+    to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
+    to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
+    return to_return
+
+
+# Borrowed from peft.utils.get_peft_model_state_dict
+def get_peft_state_maybe_zero_3(named_params, bias):
+    if bias == "none":
+        to_return = {k: t for k, t in named_params if "lora_" in k}
+    elif bias == "all":
+        to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
+    elif bias == "lora_only":
+        to_return = {}
+        maybe_lora_bias = {}
+        lora_bias_names = set()
+        for k, t in named_params:
+            if "lora_" in k:
+                to_return[k] = t
+                bias_name = k.split("lora_")[0] + "bias"
+                lora_bias_names.add(bias_name)
+            elif "bias" in k:
+                maybe_lora_bias[k] = t
+        for k, t in maybe_lora_bias:
+            if bias_name in lora_bias_names:
+                to_return[bias_name] = t
+    else:
+        raise NotImplementedError
+    to_return = {k: maybe_zero_3(v, ignore_status=True) for k, v in to_return.items()}
+    return to_return
+
+
+def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
+    to_return = {k: t for k, t in named_params if "lora_" not in k}
+    if require_grad_only:
+        to_return = {k: t for k, t in to_return.items() if t.requires_grad}
+    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
+    return to_return
+
+
+def find_all_linear_names(model):
+    cls = torch.nn.Linear
+    lora_module_names = set()
+    multimodal_keywords = ['mm_projector', 'vision_tower', 'vision_resampler']
+    for name, module in model.named_modules():
+        if any(mm_keyword in name for mm_keyword in multimodal_keywords):
+            continue
+        if isinstance(module, cls):
+            names = name.split('.')
+            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
+
+    if 'lm_head' in lora_module_names: # needed for 16-bit
+        lora_module_names.remove('lm_head')
+    return list(lora_module_names)
+
+
+def safe_save_model_for_hf_trainer(trainer: Trainer,
+                                   output_dir: str):
+    """Collects the state dict and dump to disk."""
+
+    if getattr(trainer.args, "tune_mm_mlp_adapter", False):
+        # Only save Adapter
+        keys_to_match = ['mm_projector']
+
+        weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
+        trainer.model.config.save_pretrained(output_dir)
+
+        current_folder = output_dir.split('/')[-1]
+        parent_folder = os.path.dirname(output_dir)
+        if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
+            if current_folder.startswith('checkpoint-'):
+                mm_projector_folder = os.path.join(parent_folder, "mm_projector")
+                os.makedirs(mm_projector_folder, exist_ok=True)
+                torch.save(weight_to_save, os.path.join(mm_projector_folder, f'{current_folder}.bin'))
+            else:
+                torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
+        return
+
+    elif getattr(trainer.args, "tune_mm_mlp_adapter_a", False):
+        # Only save Adapter
+        keys_to_match = ['mm_projector_a']
+        if getattr(trainer.args, "use_im_start_end", False):
+            keys_to_match.extend(['embed_tokens', 'embed_in'])
+
+        weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
+        trainer.model.config.save_pretrained(output_dir)
+
+        current_folder = output_dir.split('/')[-1]
+        parent_folder = os.path.dirname(output_dir)
+        if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
+            if current_folder.startswith('checkpoint-'):
+                mm_projector_folder = os.path.join(parent_folder, "mm_projector_a")
+                os.makedirs(mm_projector_folder, exist_ok=True)
+                torch.save(weight_to_save, os.path.join(mm_projector_folder, f'{current_folder}.bin'))
+            else:
+                torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector_a.bin'))
+
+    elif getattr(trainer.args, "pretrain_mm_mlp_adapter_a", False):
+        # Only save Adapter
+        keys_to_match = ['mm_projector_a']
+        if getattr(trainer.args, "use_im_start_end", False):
+            keys_to_match.extend(['embed_tokens', 'embed_in'])
+
+        weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
+        trainer.model.config.save_pretrained(output_dir)
+
+        current_folder = output_dir.split('/')[-1]
+        parent_folder = os.path.dirname(output_dir)
+        if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
+            if current_folder.startswith('checkpoint-'):
+                mm_projector_folder = os.path.join(parent_folder, "mm_projector_a")
+                os.makedirs(mm_projector_folder, exist_ok=True)
+                torch.save(weight_to_save, os.path.join(mm_projector_folder, f'{current_folder}.bin'))
+            else:
+                torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector_a.bin'))
+
+    if getattr(trainer.args, "tune_audio_tower", False):
+        # Only save Adapter
+        keys_to_match = ['audio_tower']
+        weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
+        trainer.model.config.save_pretrained(output_dir)
+
+        current_folder = output_dir.split('/')[-1]
+        parent_folder = os.path.dirname(output_dir)
+        if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
+            if current_folder.startswith('checkpoint-'):
+                mm_projector_folder = os.path.join(parent_folder, "audio_tower")
+                os.makedirs(mm_projector_folder, exist_ok=True)
+                torch.save(weight_to_save, os.path.join(mm_projector_folder, f'{current_folder}.bin'))
+            else:
+                torch.save(weight_to_save, os.path.join(output_dir, f'audio_tower.bin'))
+                
+    if trainer.deepspeed:
+        torch.cuda.synchronize()
+        trainer.save_model(output_dir)
+        return
+
+    state_dict = trainer.model.state_dict()
+    if trainer.args.should_save:
+        cpu_state_dict = {
+            key: value.cpu()
+            for key, value in state_dict.items()
+        }
+        del state_dict
+        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa
+
+
+def split_to_even_chunks(indices, lengths, num_chunks):
+    """
+    Split a list of indices into `chunks` chunks of roughly equal lengths.
+    """
+    if len(indices) % num_chunks != 0:
+        return [indices[i::num_chunks] for i in range(num_chunks)]
+    num_indices_per_chunk = len(indices) // num_chunks
+    chunks = [[] for _ in range(num_chunks)]
+    chunks_lengths = [0 for _ in range(num_chunks)]
+    for index in indices:
+        shortest_chunk = chunks_lengths.index(min(chunks_lengths))
+        chunks[shortest_chunk].append(index)
+        chunks_lengths[shortest_chunk] += lengths[index]
+        if len(chunks[shortest_chunk]) == num_indices_per_chunk:
+            chunks_lengths[shortest_chunk] = float("inf")
+    return chunks
+
+
+def get_modality_length_grouped_indices(lengths, batch_size, world_size, generator=None):
+    # We need to use torch for the random part as a distributed sampler will set the random seed for torch.
+    assert all(l != 0 for l in lengths), "Should not have zero length."
+    if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
+        # all samples are in the same modality
+        return get_length_grouped_indices(lengths, batch_size, world_size, generator=generator)
+    mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
+    lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
+
+    mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices(mm_lengths, batch_size, world_size, generator=None)]
+    lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices(lang_lengths, batch_size, world_size, generator=None)]
+    megabatch_size = world_size * batch_size
+    mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
+    lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
+
+    last_mm = mm_megabatches[-1]
+    last_lang = lang_megabatches[-1]
+    additional_batch = last_mm + last_lang
+    megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
+    megabatch_indices = torch.randperm(len(megabatches), generator=generator)
+    megabatches = [megabatches[i] for i in megabatch_indices]
+
+    if len(additional_batch) > 0:
+        megabatches.append(sorted(additional_batch))
+
+    return [i for megabatch in megabatches for i in megabatch]
+
+
+def get_length_grouped_indices(lengths, batch_size, world_size, generator=None, merge=True):
+    # We need to use torch for the random part as a distributed sampler will set the random seed for torch.
+    indices = torch.randperm(len(lengths), generator=generator)
+    megabatch_size = world_size * batch_size
+    megabatches = [indices[i : i + megabatch_size].tolist() for i in range(0, len(lengths), megabatch_size)]
+    megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
+    megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
+    return [i for megabatch in megabatches for batch in megabatch for i in batch]
+
+
+class LengthGroupedSampler(Sampler):
+    r"""
+    Sampler that samples indices in a way that groups together features of the dataset of roughly the same length while
+    keeping a bit of randomness.
+    """
+
+    def __init__(
+        self,
+        batch_size: int,
+        world_size: int,
+        lengths: Optional[List[int]] = None,
+        generator=None,
+        group_by_modality: bool = False,
+    ):
+        if lengths is None:
+            raise ValueError("Lengths must be provided.")
+
+        self.batch_size = batch_size
+        self.world_size = world_size
+        self.lengths = lengths
+        self.generator = generator
+        self.group_by_modality = group_by_modality
+
+    def __len__(self):
+        return len(self.lengths)
+
+    def __iter__(self):
+        if self.group_by_modality:
+            indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
+        else:
+            indices = get_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
+        return iter(indices)
+
+
+class MixSampler(Sampler):
+    def __init__(self, dataset, batch_size=4):
+        self.dataset = dataset
+        self.av_count = len(dataset.av_data)
+        self.a_count = len(dataset.a_data)
+        self.v_count = len(dataset.v_data)
+        self.batch_size = batch_size
+
+    def __iter__(self): 
+        for i in range(0, self.av_count, 2):
+            if i + 1 == self.av_count:
+                break
+            batch_ids = [i, i+1]
+            
+            audio_index = i % self.a_count
+            batch_ids.append(self.av_count + audio_index)
+            video_index = i % self.v_count
+            batch_ids.append(self.av_count + self.a_count + video_index)
+
+            for x in batch_ids:
+                yield x
+
+    def __len__(self):
+        return self.av_count * 2
+
+
+class VideoLLaMA2Trainer(Trainer):
+
+    def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
+        if self.train_dataset is None or not has_length(self.train_dataset):
+            return None
+        if self.train_dataset.mix_sampler_tag:
+            assert self.args.train_batch_size % 4 == 0
+            return MixSampler(self.train_dataset, self.args.train_batch_size * self.args.gradient_accumulation_steps)
+
+        if self.args.group_by_modality_length:
+            lengths = self.train_dataset.modality_lengths
+            return LengthGroupedSampler(
+                self.args.train_batch_size,
+                world_size=self.args.world_size * self.args.gradient_accumulation_steps,
+                lengths=lengths,
+                group_by_modality=True,
+            )
+        else:
+            return super()._get_train_sampler()
+
+    def create_optimizer(self):
+        """
+        Setup the optimizer.
+
+        We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
+        Trainer's init through `optimizers`, or subclass and override this method in a subclass.
+        """
+        if is_sagemaker_mp_enabled():
+            return super().create_optimizer()
+
+        opt_model = self.model
+
+        if self.optimizer is None:
+            decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS)
+            decay_parameters = [name for name in decay_parameters if "bias" not in name]
+            if self.args.mm_projector_lr is not None:
+                projector_parameters = [name for name, _ in opt_model.named_parameters() if "mm_projector" in name]
+                optimizer_grouped_parameters = [
+                    {
+                        "params": [
+                            p for n, p in opt_model.named_parameters() if (n in decay_parameters and n not in projector_parameters and p.requires_grad)
+                        ],
+                        "weight_decay": self.args.weight_decay,
+                    },
+                    {
+                        "params": [
+                            p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n not in projector_parameters and p.requires_grad)
+                        ],
+                        "weight_decay": 0.0,
+                    },
+                    {
+                        "params": [
+                            p for n, p in opt_model.named_parameters() if (n in decay_parameters and n in projector_parameters and p.requires_grad)
+                        ],
+                        "weight_decay": self.args.weight_decay,
+                        "lr": self.args.mm_projector_lr,
+                    },
+                    {
+                        "params": [
+                            p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n in projector_parameters and p.requires_grad)
+                        ],
+                        "weight_decay": 0.0,
+                        "lr": self.args.mm_projector_lr,
+                    },
+                ]
+            else:
+                optimizer_grouped_parameters = [
+                    {
+                        "params": [
+                            p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad)
+                        ],
+                        "weight_decay": self.args.weight_decay,
+                    },
+                    {
+                        "params": [
+                            p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad)
+                        ],
+                        "weight_decay": 0.0,
+                    },
+                ]
+
+            optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args)
+
+            self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
+            if optimizer_cls.__name__ == "Adam8bit":
+                import bitsandbytes
+
+                manager = bitsandbytes.optim.GlobalOptimManager.get_instance()
+
+                skipped = 0
+                for module in opt_model.modules():
+                    if isinstance(module, nn.Embedding):
+                        skipped += sum({p.data_ptr(): p.numel() for p in module.parameters()}.values())
+                        logger.info(f"skipped {module}: {skipped/2**20}M params")
+                        manager.register_module_override(module, "weight", {"optim_bits": 32})
+                        logger.debug(f"bitsandbytes: will optimize {module} in fp32")
+                logger.info(f"skipped: {skipped/2**20}M params")
+
+        return self.optimizer
+
+    def _save_checkpoint(self, model, trial, metrics=None):
+        if getattr(self.args, 'tune_mm_mlp_adapter', False):
+            from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
+            checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
+
+            run_dir = self._get_output_dir(trial=trial)
+            output_dir = os.path.join(run_dir, checkpoint_folder)
+
+            # Only save Adapter
+            keys_to_match = ['mm_projector', 'vision_resampler']
+
+            weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
+
+            if self.args.local_rank == 0 or self.args.local_rank == -1:
+                self.model.config.save_pretrained(output_dir)
+                torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
+            # Save optimizer and scheduler
+            self._save_optimizer_and_scheduler(output_dir)
+            # Save RNG state
+            self._save_rng_state(output_dir)
+            self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
+            self.args.distributed_state.wait_for_everyone()
+        else:
+            # NOTE: Supporting save complete lora checkpoint during training.
+            if self.args.lora_enable:
+                from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
+                checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
+
+                run_dir = self._get_output_dir(trial=trial)
+                output_dir = os.path.join(run_dir, checkpoint_folder)
+
+                state_dict = get_peft_state_maybe_zero_3(self.model.named_parameters(), self.args.lora_bias)
+                non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(self.model.named_parameters())
+                if self.args.local_rank == 0 or self.args.local_rank == -1:
+                    # save for acquring `config.json`
+                    self.model.config.save_pretrained(output_dir)
+                    # save for acquring `adapter_config.json`, `adapter_model.bin`
+                    # self.model.save_pretrained(output_dir, state_dict=state_dict)
+                    torch.save(non_lora_state_dict, os.path.join(output_dir, 'non_lora_trainables.bin'))
+
+                # save for acquring lora adapter parameters & trainer states: `adapter_config.json`, `adapter_model.safetensors`
+                super(VideoLLaMA2Trainer, self)._save_checkpoint(model, trial, metrics)
+            else:
+                super(VideoLLaMA2Trainer, self)._save_checkpoint(model, trial, metrics)
+
+    def _save(self, output_dir: Optional[str] = None, state_dict=None):
+        if getattr(self.args, 'tune_mm_mlp_adapter', False):
+            pass
+        else:
+            super(VideoLLaMA2Trainer, self)._save(output_dir, state_dict)
diff --git a/v2a_models/__init__.py b/v2a_models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/v2a_models/__pycache__/__init__.cpython-310.pyc b/v2a_models/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a19b42ba0dc4e253f96422796ab9db619244621b
Binary files /dev/null and b/v2a_models/__pycache__/__init__.cpython-310.pyc differ
diff --git a/v2a_models/__pycache__/v2a_mmaudio.cpython-310.pyc b/v2a_models/__pycache__/v2a_mmaudio.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..450e2c03e987288b27b99a5a0a00484734c26916
Binary files /dev/null and b/v2a_models/__pycache__/v2a_mmaudio.cpython-310.pyc differ
diff --git a/v2a_models/__pycache__/v2a_mmaudio.cpython-38.pyc b/v2a_models/__pycache__/v2a_mmaudio.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..83d08e6e8579d9ca96b4635e8f6f7af0943dd345
Binary files /dev/null and b/v2a_models/__pycache__/v2a_mmaudio.cpython-38.pyc differ
diff --git a/v2a_models/v2a_foleycrafter.py b/v2a_models/v2a_foleycrafter.py
new file mode 100644
index 0000000000000000000000000000000000000000..17150d2a05c9c649fdc4e85311e42e6c3bd86c3d
--- /dev/null
+++ b/v2a_models/v2a_foleycrafter.py
@@ -0,0 +1,178 @@
+#coding=utf-8
+import logging
+import os
+from pathlib import Path
+import torch
+from huggingface_hub import snapshot_download
+import os
+from pathlib import Path
+
+import soundfile as sf
+import torch
+import torchvision
+from huggingface_hub import snapshot_download
+from moviepy.editor import AudioFileClip, VideoFileClip
+from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
+
+from third_party.FoleyCrafter.foleycrafter.models.onset import torch_utils
+from third_party.FoleyCrafter.foleycrafter.models.time_detector.model import VideoOnsetNet
+from third_party.FoleyCrafter.foleycrafter.pipelines.auffusion_pipeline import Generator, denormalize_spectrogram
+from third_party.FoleyCrafter.foleycrafter.utils.util import build_foleycrafter, read_frames_with_moviepy
+
+
+vision_transform_list = [
+    torchvision.transforms.Resize((128, 128)),
+    torchvision.transforms.CenterCrop((112, 112)),
+    torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+]
+video_transform = torchvision.transforms.Compose(vision_transform_list)
+
+model_base_dir = "pretrained/v2a/foleycrafter"
+
+class V2A_FoleyCrafter:
+    def __init__(self, 
+                pretrained_model_name_or_path: str=f"{model_base_dir}/checkpoints/auffusion",
+                ckpt: str=f"{model_base_dir}/checkpoints",):
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        self.log.info(f"The V2A model uses FoleyCrafter, init...")
+
+        self.device = 'cpu'
+        if torch.cuda.is_available():
+            self.device = 'cuda'
+        elif torch.backends.mps.is_available():
+            self.device = 'mps'
+        else:
+            self.log.warning('CUDA/MPS are not available, running on CPU')
+        
+        # download ckpt
+        if not os.path.isdir(pretrained_model_name_or_path):
+            pretrained_model_name_or_path = snapshot_download(pretrained_model_name_or_path)
+
+
+        # ckpt path
+        temporal_ckpt_path = os.path.join(ckpt, "temporal_adapter.ckpt")
+
+        # load vocoder
+        self.vocoder = Generator.from_pretrained(ckpt, subfolder="vocoder").to(self.device)
+
+        # load time_detector
+        time_detector_ckpt = os.path.join(ckpt, "timestamp_detector.pth.tar")
+        self.time_detector = VideoOnsetNet(False)
+        self.time_detector, _ = torch_utils.load_model(time_detector_ckpt, self.time_detector, device=self.device, strict=True)
+
+        # load adapters
+        self.pipe = build_foleycrafter().to(self.device)
+        ckpt = torch.load(temporal_ckpt_path)
+
+        # load temporal adapter
+        if "state_dict" in ckpt.keys():
+            ckpt = ckpt["state_dict"]
+        load_gligen_ckpt = {}
+        for key, value in ckpt.items():
+            if key.startswith("module."):
+                load_gligen_ckpt[key[len("module.") :]] = value
+            else:
+                load_gligen_ckpt[key] = value
+        m, u = self.pipe.controlnet.load_state_dict(load_gligen_ckpt, strict=False)
+        print(f"### Control Net missing keys: {len(m)}; \n### unexpected keys: {len(u)};")
+
+        # load semantic adapter
+        self.pipe.load_ip_adapter(
+            os.path.join(ckpt, "semantic"), subfolder="", weight_name="semantic_adapter.bin", image_encoder_folder=None
+        )
+        # ip_adapter_weight = semantic_scale
+        # self.pipe.set_ip_adapter_scale(ip_adapter_weight)
+
+        self.generator = torch.Generator(device=self.device)
+        # self.generator.manual_seed(seed)
+        self.image_processor = CLIPImageProcessor()
+        self.image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+            "h94/IP-Adapter", subfolder="models/image_encoder"
+        ).to(self.device)
+
+    @torch.no_grad()
+    def generate_audio(self, 
+                       video_path,
+                       output_dir,
+                       prompt: str='', 
+                       negative_prompt: str='',
+                       seed: int=42,
+                       temporal_scale: float=0.2,
+                       semantic_scale: float=1.0,
+                       is_postp=False,):
+        
+        self.pipe.set_ip_adapter_scale(semantic_scale)
+        self.generator.manual_seed(seed)
+        
+        video_path = Path(video_path).expanduser()
+        output_dir = Path(output_dir).expanduser()
+        self.log.info(f"Loading video: {video_path}")
+        output_dir.mkdir(parents=True, exist_ok=True)
+
+        frames, duration = read_frames_with_moviepy(video_path, max_frame_nums=150)
+        time_frames = torch.FloatTensor(frames).permute(0, 3, 1, 2)
+        time_frames = video_transform(time_frames)
+        time_frames = {"frames": time_frames.unsqueeze(0).permute(0, 2, 1, 3, 4)}
+        preds = self.time_detector(time_frames)
+        preds = torch.sigmoid(preds)
+        time_condition = [
+            -1 if preds[0][int(i / (1024 / 10 * duration) * 150)] < 0.5 else 1
+            for i in range(int(1024 / 10 * duration))
+        ]
+        time_condition = time_condition + [-1] * (1024 - len(time_condition))
+        # w -> b c h w
+        time_condition = (
+            torch.FloatTensor(time_condition)
+            .unsqueeze(0)
+            .unsqueeze(0)
+            .unsqueeze(0)
+            .repeat(1, 1, 256, 1)
+            .to("cuda")
+        )
+        images = self.image_processor(images=frames, return_tensors="pt").to("cuda")
+        image_embeddings = self.image_encoder(**images).image_embeds
+        image_embeddings = torch.mean(image_embeddings, dim=0, keepdim=True).unsqueeze(0).unsqueeze(0)
+        neg_image_embeddings = torch.zeros_like(image_embeddings)
+        image_embeddings = torch.cat([neg_image_embeddings, image_embeddings], dim=1)
+
+
+        sample = self.pipe(
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            ip_adapter_image_embeds=image_embeddings,
+            image=time_condition,
+            controlnet_conditioning_scale=temporal_scale,
+            num_inference_steps=25,
+            height=256,
+            width=1024,
+            output_type="pt",
+            generator=self.generator,
+        )
+
+        audio_img = sample.images[0]
+        audio = denormalize_spectrogram(audio_img)
+        audio = self.vocoder.inference(audio, lengths=160000)[0]
+        audio = audio[: int(duration * 16000)]
+        
+        if is_postp:
+            audio_save_path = output_dir / f'{video_path.stem}.neg.wav'
+            video_save_path = output_dir / f'{video_path.stem}.neg.mp4'
+        else:
+            audio_save_path = output_dir / f'{video_path.stem}.step1.wav'
+            video_save_path = output_dir / f'{video_path.stem}.step1.mp4'
+
+        
+        self.log.info(f"Saving generated audio and video to {output_dir}")
+        sf.write(audio_save_path, audio, 16000)
+
+        audio = AudioFileClip(audio_save_path)
+        video = VideoFileClip(video_path)
+        duration = min(audio.duration, video.duration)
+        audio = audio.subclip(0, duration)
+        video.audio = audio
+        video = video.subclip(0, duration)
+        video.write_videofile(video_save_path)
+        self.log.info(f'Video saved to {video_save_path}')
+
+        return audio_save_path, video_save_path
diff --git a/v2a_models/v2a_mmaudio.py b/v2a_models/v2a_mmaudio.py
new file mode 100644
index 0000000000000000000000000000000000000000..6bceb79507e148157e815ab157f57a6a23584466
--- /dev/null
+++ b/v2a_models/v2a_mmaudio.py
@@ -0,0 +1,158 @@
+#coding=utf-8
+import logging
+from pathlib import Path
+import torch
+import torchaudio
+
+
+from third_party.MMAudio.mmaudio.eval_utils import ModelConfig, all_model_cfg, generate, load_video, make_video, setup_eval_logging
+from third_party.MMAudio.mmaudio.model.flow_matching import FlowMatching
+from third_party.MMAudio.mmaudio.model.networks import MMAudio, get_my_mmaudio
+from third_party.MMAudio.mmaudio.model.utils.features_utils import FeaturesUtils
+
+
+class V2A_MMAudio:
+    def __init__(self, 
+                variant: str="large_44k",
+                num_steps: int=25,
+                full_precision: bool=False,):
+        
+        self.log = logging.getLogger(self.__class__.__name__)
+        self.log.setLevel(logging.INFO)
+        self.log.info(f"The V2A model uses MMAudio {variant}, init...")
+        
+        self.device = 'cpu'
+        if torch.cuda.is_available():
+            self.device = 'cuda'
+        elif torch.backends.mps.is_available():
+            self.device = 'mps'
+        else:
+            self.log.warning('CUDA/MPS are not available, running on CPU')
+        self.dtype = torch.float32 if full_precision else torch.bfloat16
+
+        if variant not in all_model_cfg:
+            raise ValueError(f'Unknown model variant: {variant}')
+        self.model: ModelConfig = all_model_cfg[variant]
+        self.model.download_if_needed()
+
+        self.net: MMAudio= get_my_mmaudio(self.model.model_name).to(self.device, self.dtype).eval()
+        self.net.load_weights(torch.load(self.model.model_path, map_location=self.device, weights_only=True))
+
+        # Flow Matching
+        self.fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=num_steps)
+
+        # Feature utils setup
+        self.feature_utils = FeaturesUtils(tod_vae_ckpt=self.model.vae_path,
+                                           synchformer_ckpt=self.model.synchformer_ckpt,
+                                           enable_conditions=True,
+                                           mode=self.model.mode,
+                                           bigvgan_vocoder_ckpt=self.model.bigvgan_16k_path,
+                                           need_vae_encoder=False)
+        self.feature_utils = self.feature_utils.to(self.device, self.dtype).eval()
+
+
+    @torch.no_grad()
+    def generate_audio(self, 
+                       video_path,
+                       output_dir,
+                       prompt: str='', 
+                       negative_prompt: str='',
+                       duration: int=10,
+                       seed: int=42,
+                       cfg_strength: float=4.5,
+                       mask_away_clip: bool=False,
+                       is_postp=False,):
+        
+        video_path = Path(video_path).expanduser()
+        output_dir = Path(output_dir).expanduser()
+        self.log.info(f"Loading video: {video_path}")
+        output_dir.mkdir(parents=True, exist_ok=True)
+
+        video_info = load_video(video_path, duration)
+        clip_frames = video_info.clip_frames
+        sync_frames = video_info.sync_frames
+        duration = video_info.duration_sec
+
+        # Setup random generator for reproducibility
+        rng = torch.Generator(device=self.device)
+        rng.manual_seed(seed)
+
+        if mask_away_clip:
+            clip_frames = None
+        else:
+            clip_frames = clip_frames.unsqueeze(0)
+        sync_frames = sync_frames.unsqueeze(0)
+
+        seq_cfg = self.model.seq_cfg
+        seq_cfg.duration = duration
+        self.net.update_seq_lengths(seq_cfg.latent_seq_len, seq_cfg.clip_seq_len, seq_cfg.sync_seq_len)
+
+        self.log.info(f'Prompt: {prompt}')
+        self.log.info(f'Negative prompt: {negative_prompt}')
+        
+        self.log.info(f"Generating Audio...")
+        audios = generate(
+            clip_frames,
+            sync_frames,
+            [prompt],
+            negative_text=[negative_prompt],
+            feature_utils=self.feature_utils,
+            net=self.net,
+            fm=self.fm,
+            rng=rng,
+            cfg_strength=cfg_strength)
+        audio = audios.float().cpu()[0]
+        
+        if is_postp:
+            audio_save_path = output_dir / f'{video_path.stem}.neg.wav'
+            video_save_path = output_dir / f'{video_path.stem}.neg.mp4'
+        else:
+            audio_save_path = output_dir / f'{video_path.stem}.step1.wav'
+            video_save_path = output_dir / f'{video_path.stem}.step1.mp4'
+
+        
+        self.log.info(f"Saving generated audio and video to {output_dir}")
+        torchaudio.save(str(audio_save_path), audio, seq_cfg.sampling_rate)
+        self.log.info(f'Audio saved to {audio_save_path}')
+        make_video(video_info, str(video_save_path), audio, sampling_rate=seq_cfg.sampling_rate)
+        self.log.info(f'Video saved to {video_save_path}')
+
+        return audio_save_path, video_save_path
+
+
+
+# def main():
+#     # 初始化日志（如果你有 logger.py，推荐只做一次初始化）
+#     setup_eval_logging()
+
+#     # 初始化模型
+#     v2a_model = V2A_MMAudio(
+#         variant="large_44k_v2",     # 这个是你模型的版本名
+#         num_steps=25,               # 采样步数
+#         seed=42,                    # 随机种子
+#         full_precision=False        # 是否使用全精度
+#     )
+
+#     # 视频路径（换成你的真实路径）
+#     video_path = "ZxiXftx2EMg_000477.mp4"
+
+#     # 输出目录
+#     output_dir = "outputs"
+
+#     # 提示词（控制生成内容）
+#     prompt = ""
+#     negative_prompt = ""
+
+#     # 生成音频 + 视频
+#     audio_save_path, video_save_path = v2a_model.generate_audio(
+#         video_path=video_path,
+#         output_dir=output_dir,
+#         prompt=prompt,
+#         negative_prompt=negative_prompt,
+#         duration=10,            # 秒
+#         cfg_strength=4.5,       # 指导强度
+#         mask_away_clip=False    # 是否移除 clip
+#     )
+
+# if __name__ == "__main__":
+#     main()
\ No newline at end of file