DenseAV

No application file

App Files Files Community

lorocksUMD commited on Mar 25

Commit

f2b5019

verified ·

1 Parent(s): e1b5568

Delete DenseAV

Browse files

Files changed (32) hide show

DenseAV/.gitignore +0 -5
DenseAV/LICENSE +0 -22
DenseAV/README.md +0 -172
DenseAV/__init__.py +0 -0
DenseAV/demo.ipynb +0 -0
DenseAV/denseav/__init__.py +0 -0
DenseAV/denseav/aggregators.py +0 -517
DenseAV/denseav/aligners.py +0 -300
DenseAV/denseav/configs/av_align.yaml +0 -125
DenseAV/denseav/constants.py +0 -12
DenseAV/denseav/data/AVDatasets.py +0 -1249
DenseAV/denseav/data/__init__.py +0 -0
DenseAV/denseav/data/make_tarballs.py +0 -108
DenseAV/denseav/eval_utils.py +0 -135
DenseAV/denseav/evaluate.py +0 -87
DenseAV/denseav/featurizers/AudioMAE.py +0 -570
DenseAV/denseav/featurizers/CAVMAE.py +0 -1082
DenseAV/denseav/featurizers/CLIP.py +0 -50
DenseAV/denseav/featurizers/DAVENet.py +0 -162
DenseAV/denseav/featurizers/DINO.py +0 -451
DenseAV/denseav/featurizers/DINOv2.py +0 -49
DenseAV/denseav/featurizers/Hubert.py +0 -70
DenseAV/denseav/featurizers/ImageBind.py +0 -2033
DenseAV/denseav/featurizers/__init__.py +0 -0
DenseAV/denseav/plotting.py +0 -244
DenseAV/denseav/saved_models.py +0 -262
DenseAV/denseav/shared.py +0 -739
DenseAV/denseav/train.py +0 -1213
DenseAV/gradio_app.py +0 -196
DenseAV/hubconf.py +0 -25
DenseAV/samples/puppies.mp4 +0 -3
DenseAV/setup.py +0 -37

DenseAV/.gitignore DELETED Viewed

@@ -1,5 +0,0 @@
-# Created by .ignore support plugin (hsz.mobi)
-results/attention/*
-results/features/*
-.env

DenseAV/LICENSE DELETED Viewed

@@ -1,22 +0,0 @@
-MIT License
-Copyright (c) Mark Hamilton. All rights reserved.
-Permission is hereby granted, free of charge, to any person obtaining a
-copy of this software and associated documentation files (the
-"Software"), to deal in the Software without restriction, including
-without limitation the rights to use, copy, modify, merge, publish,
-distribute, sublicense, and/or sell copies of the Software, and to
-permit persons to whom the Software is furnished to do so, subject to
-the following conditions:
-The above copyright notice and this permission notice shall be included
-in all copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
-OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
-MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
-NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
-LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
-OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
-WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

DenseAV/README.md DELETED Viewed

@@ -1,172 +0,0 @@
-# Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
-###  CVPR 2024
-[![Website](https://img.shields.io/badge/DenseAV-%F0%9F%8C%90Website-purple?style=flat)](https://aka.ms/denseav) [![arXiv](https://img.shields.io/badge/arXiv-2406.05629-b31b1b.svg)](https://arxiv.org/abs/2406.05629) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mhamilton723/DenseAV/blob/main/demo.ipynb)
-[![Huggingface](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DenseAV-orange)](https://huggingface.co/spaces/mhamilton723/DenseAV)
-[//]: # ([![Huggingface]&#40;https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper%20Page-orange&#41;]&#40;https://huggingface.co/papers/2403.10516&#41;)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/separating-the-chirp-from-the-chat-self/speech-prompted-semantic-segmentation-on)](https://paperswithcode.com/sota/speech-prompted-semantic-segmentation-on?p=separating-the-chirp-from-the-chat-self)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/separating-the-chirp-from-the-chat-self/sound-prompted-semantic-segmentation-on)](https://paperswithcode.com/sota/sound-prompted-semantic-segmentation-on?p=separating-the-chirp-from-the-chat-self)
-[Mark Hamilton](https://mhamilton.net/),
-[Andrew Zisserman](https://www.robots.ox.ac.uk/~az/),
-[John R. Hershey](https://research.google/people/john-hershey/),
-[William T. Freeman](https://billf.mit.edu/about/bio)
-![DenseAV Overview Graphic](https://mhamilton.net/images/hero_fig_black.jpg)
-**TL;DR**:Our model, DenseAV, learns the meaning of words and the location of sounds (visual grounding) without supervision or text.
-https://github.com/mhamilton723/DenseAV/assets/6456637/ba908ab5-9618-42f9-8d7a-30ecb009091f
-## Contents
-<!--ts-->
-   * [Install](#install)
-   * [Model Zoo](#model-zoo)
-   * [Getting Datasets](#getting-atasets)
-   * [Evaluate Models](#evaluate-models)
-   * [Train a Model](#train-model)
-   * [Local Gradio Demo](#local-gradio-demo)
-   * [Coming Soon](coming-soon)
-   * [Citation](#citation)
-   * [Contact](#contact)
-<!--te-->
-## Install
-To use DenseAV locally clone the repository:
-```shell script
-git clone https://github.com/mhamilton723/DenseAV.git
-cd DenseAV
-pip install -e .
-```
-## Model Zoo
-To see examples of pretrained model usage please see our [Collab notebook](https://colab.research.google.com/github/mhamilton723/DenseAV/blob/main/demo.ipynb). We currently supply the following pretrained models:
-| Model Name                    | Checkpoint                                                                                                                       | Torch Hub Repository | Torch Hub Name     |
-|-------------------------------|----------------------------------------------------------------------------------------------------------------------------------|----------------------|--------------------|
-| Sound                         | [Download](https://marhamilresearch4.blob.core.windows.net/denseav-public/hub/denseav_sound.ckpt) | mhamilton723/DenseAV | sound              |
-| Language                      | [Download](https://marhamilresearch4.blob.core.windows.net/denseav-public/hub/denseav_language.ckpt) | mhamilton723/DenseAV | language           |
-| Sound + Language (Two Headed) | [Download](https://marhamilresearch4.blob.core.windows.net/denseav-public/hub/denseav_2head.ckpt)   | mhamilton723/DenseAV | sound_and_language |
-For example, to load the model trained on both sound and language:
-```python
-model = torch.hub.load("mhamilton723/DenseAV", 'sound_and_language')
-```
-### Load from HuggingFace
-```python
-from denseav.train import LitAVAligner
-model1 = LitAVAligner.from_pretrained("mhamilton723/DenseAV-sound")
-model2 = LitAVAligner.from_pretrained("mhamilton723/DenseAV-language")
-model3 = LitAVAligner.from_pretrained("mhamilton723/DenseAV-sound-language")
-```
-## Getting Datasets
-Our code assumes that all data lives in a common directory on your system, in these examples we use `/path/to/your/data`. Our code will often reference this directory as the `data_root`
-### Speech and Sound Prompted ADE20K
-To download our new Speech and Sound prompted ADE20K Dataset:
-```bash
-cd /path/to/your/data
-wget https://marhamilresearch4.blob.core.windows.net/denseav-public/datasets/ADE20KSoundPrompted.zip
-unzip ADE20KSoundPrompted.zip
-wget https://marhamilresearch4.blob.core.windows.net/denseav-public/datasets/ADE20KSpeechPrompted.zip
-unzip ADE20KSpeechPrompted.zip
-```
-### Places Audio
-First download the places audio dataset from its [original source](https://groups.csail.mit.edu/sls/downloads/placesaudio/downloads.cgi).
-To run the code the data will need to be processed to be of the form:
-```
-[Instructions coming soon]
-```
-### Audioset
-Because of copyright issues we cannot make [Audioset](https://research.google.com/audioset/dataset/index.html) easily availible to download.
-First download this dataset through appropriate means. [This other project](https://github.com/ktonal/audioset-downloader) appears to make this simple.
-To run the code the data will need to be processed to be of the form:
-```
-[Instructions coming soon]
-```
-## Evaluate Models
-To evaluate a trained model first clone the repository for
-[local development](#local-development). Then run
-```shell
-cd featup
-python evaluate.py
-```
-After evaluation, see the results in tensorboard's hparams tab.
-```shell
-cd ../logs/evaluate
-tensorboard --logdir .
-```
-Then visit [https://localhost:6006](https://localhost:6006) and click on hparams to browse results. We report "advanced" speech metrics and "basic" sound metrics in our paper.
-## Train a Model
-```shell
-cd denseav
-python train.py
-```
-## Local Gradio Demo
-To run our [HuggingFace Spaces hosted DenseAV demo](https://huggingface.co/spaces/mhamilton723/FeatUp) locally first install DenseAV for local development. Then  run:
-```shell
-python gradio_app.py
-```
-Wait a few seconds for the demo to spin up, then navigate to [http://localhost:7860/](http://localhost:7860/) to view the demo.
-## Coming Soon:
-- Bigger models!
-## Citation
-```
-@misc{hamilton2024separating,
-      title={Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language},
-      author={Mark Hamilton and Andrew Zisserman and John R. Hershey and William T. Freeman},
-      year={2024},
-      eprint={2406.05629},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV}
-}
-```
-## Contact
-For feedback, questions, or press inquiries please contact [Mark Hamilton](mailto:[email protected])

DenseAV/__init__.py DELETED Viewed

File without changes

DenseAV/demo.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

DenseAV/denseav/__init__.py DELETED Viewed

File without changes

DenseAV/denseav/aggregators.py DELETED Viewed

@@ -1,517 +0,0 @@
-from abc import abstractmethod
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from tqdm import tqdm
-from denseav.constants import *
-@torch.jit.script
-def masked_mean(x: torch.Tensor, mask: torch.Tensor, dim: int):
-    mask = mask.to(x)
-    return (x * mask).sum(dim, keepdim=True) / mask.sum(dim, keepdim=True).clamp_min(.001)
-@torch.jit.script
-def masked_max(x: torch.Tensor, mask: torch.Tensor, dim: int):
-    mask = mask.to(torch.bool)
-    eps = 1e7
-    # eps = torch.finfo(x.dtype).max
-    return (x - (~mask) * eps).max(dim, keepdim=True).values
-def masked_lse(x: torch.Tensor, mask: torch.Tensor, dim: int, temp):
-    x = x.to(torch.float32)
-    mask = mask.to(torch.float32)
-    x_masked = (x - (1 - mask) * torch.finfo(x.dtype).max)
-    return (torch.logsumexp(x_masked * temp, dim, keepdim=True) - torch.log(mask.sum(dim, keepdim=True))) / temp
-class BaseAggregator(torch.nn.Module):
-    def __init__(self, nonneg_sim, mask_silence, num_heads, head_agg, use_cls):
-        super().__init__()
-        self.nonneg_sim = nonneg_sim
-        self.mask_silence = mask_silence
-        self.num_heads = num_heads
-        self.head_agg = head_agg
-        self.use_cls = use_cls
-    @abstractmethod
-    def _agg_sim(self, sim, mask):
-        pass
-    def prepare_sims(self, sim, mask, agg_sim, agg_heads):
-        sim_size = sim.shape
-        assert len(mask.shape) == 2
-        assert len(sim_size) in {6, 7}, f"sim has wrong number of dimensions: {sim.shape}"
-        pairwise = len(sim_size) == 6
-        if self.mask_silence:
-            mask = mask
-        else:
-            mask = torch.ones_like(mask)
-        if self.nonneg_sim:
-            sim = sim.clamp_min(0)
-        if pairwise:
-            head_dim = 1
-        else:
-            head_dim = 2
-        if self.head_agg == "max_elementwise" and agg_heads:
-            sim = sim.max(head_dim, keepdim=True).values
-        if agg_sim:
-            sim = self._agg_sim(sim, mask)
-        if agg_heads:
-            if self.head_agg == "sum" or self.head_agg == "max_elementwise":
-                sim = sim.sum(head_dim)
-            elif self.head_agg == "max":
-                sim = sim.max(head_dim).values
-            else:
-                raise ValueError(f"Unknown head_agg: {self.head_agg}")
-        return sim
-    def _get_full_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim or agg_heads or raw:
-            assert (agg_sim or agg_heads) != raw, "Cannot have raw on at the same time as agg_sim or agg_heads"
-        audio_feats = preds[AUDIO_FEATS]
-        audio_mask = preds[AUDIO_MASK]
-        image_feats = preds[IMAGE_FEATS]
-        b1, c2, f, t1 = audio_feats.shape
-        b2, t2 = audio_mask.shape
-        d, c1, h, w = image_feats.shape
-        assert b1 == b2 and c1 == c2 and t1 == t2
-        assert c1 % self.num_heads == 0
-        new_c = c1 // self.num_heads
-        audio_feats = audio_feats.reshape(b1, self.num_heads, new_c, f, t1)
-        image_feats = image_feats.reshape(d, self.num_heads, new_c, h, w)
-        raw_sims = torch.einsum(
-            "akcft,vkchw->avkhwft",
-            audio_feats.to(torch.float32),
-            image_feats.to(torch.float32))
-        if self.use_cls:
-            audio_cls = preds[AUDIO_CLS].reshape(b1, self.num_heads, new_c)
-            image_cls = preds[IMAGE_CLS].reshape(d, self.num_heads, new_c)
-            cls_sims = torch.einsum(
-                "akc,vkc->avk",
-                audio_cls.to(torch.float32),
-                image_cls.to(torch.float32))
-            raw_sims += cls_sims.reshape(b1, d, self.num_heads, 1, 1, 1, 1)
-        if raw:
-            return raw_sims
-        else:
-            return self.prepare_sims(raw_sims, audio_mask, agg_sim, agg_heads)
-    def get_pairwise_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim or agg_heads or raw:
-            assert (agg_sim or agg_heads) != raw, "Cannot have raw on at the same time as agg_sim or agg_heads"
-        audio_feats = preds[AUDIO_FEATS]
-        audio_mask = preds[AUDIO_MASK]
-        image_feats = preds[IMAGE_FEATS]
-        a1, c1, f, t1 = audio_feats.shape
-        a2, t2 = audio_mask.shape
-        assert c1 % self.num_heads == 0
-        new_c = c1 // self.num_heads
-        audio_feats = audio_feats.reshape(a1, self.num_heads, new_c, f, t1)
-        if len(image_feats.shape) == 5:
-            print("Using similarity for video, should only be called during plotting")
-            v, vt, c2, h, w = image_feats.shape
-            image_feats = image_feats.reshape(v, vt, self.num_heads, new_c, h, w)
-            raw_sims = torch.einsum(
-                "bkcft,bskchw,bt->bskhwft",
-                audio_feats.to(torch.float32),
-                image_feats.to(torch.float32),
-                audio_mask.to(torch.float32))
-            if self.use_cls:
-                audio_cls = preds[AUDIO_CLS].reshape(v, self.num_heads, new_c)
-                image_cls = preds[IMAGE_CLS].reshape(v, vt, self.num_heads, new_c)
-                cls_sims = torch.einsum(
-                    "bkc,bskc->bsk",
-                    audio_cls.to(torch.float32),
-                    image_cls.to(torch.float32))
-                raw_sims += cls_sims.reshape(v, vt, self.num_heads, 1, 1, 1, 1)
-        elif len(image_feats.shape) == 4:
-            v, c2, h, w = image_feats.shape
-            image_feats = image_feats.reshape(v, self.num_heads, new_c, h, w)
-            raw_sims = torch.einsum(
-                "bkcft,bkchw,bt->bkhwft",
-                audio_feats.to(torch.float32),
-                image_feats.to(torch.float32),
-                audio_mask.to(torch.float32))
-            if self.use_cls:
-                audio_cls = preds[AUDIO_CLS].reshape(v, self.num_heads, new_c)
-                image_cls = preds[IMAGE_CLS].reshape(v, self.num_heads, new_c)
-                cls_sims = torch.einsum(
-                    "bkc,bkc->bk",
-                    audio_cls.to(torch.float32),
-                    image_cls.to(torch.float32))
-                raw_sims += cls_sims.reshape(v, self.num_heads, 1, 1, 1, 1)
-        else:
-            raise ValueError(f"Improper image shape: {image_feats.shape}")
-        assert a1 == a2 and c2 == c2 and t1 == t2
-        if raw:
-            return raw_sims
-        else:
-            return self.prepare_sims(raw_sims, audio_mask, agg_sim, agg_heads)
-    def forward(self, preds, agg_heads):
-        return self._get_full_sims(
-            preds, raw=False, agg_sim=True, agg_heads=agg_heads)
-    def forward_batched(self, preds, agg_heads, batch_size):
-        new_preds = {k: v for k, v in preds.items()}
-        big_image_feats = new_preds.pop(IMAGE_FEATS)
-        if self.use_cls:
-            big_image_cls = new_preds.pop(IMAGE_CLS)
-        n = big_image_feats.shape[0]
-        n_steps = math.ceil(n / batch_size)
-        outputs = []
-        for step in tqdm(range(n_steps), "Calculating Sim", leave=False):
-            new_preds[IMAGE_FEATS] = big_image_feats[step * batch_size:(step + 1) * batch_size].cuda()
-            if self.use_cls:
-                new_preds[IMAGE_CLS] = big_image_cls[step * batch_size:(step + 1) * batch_size].cuda()
-            sim = self.forward(new_preds, agg_heads=agg_heads)
-            outputs.append(sim.cpu())
-        return torch.cat(outputs, dim=1)
-class ImageThenAudioAggregator(BaseAggregator):
-    def __init__(self, image_agg_type, audio_agg_type, nonneg_sim, mask_silence, num_heads, head_agg, use_cls):
-        super().__init__(nonneg_sim, mask_silence, num_heads, head_agg, use_cls)
-        if image_agg_type == "max":
-            self.image_agg = lambda x, dim: x.max(dim=dim, keepdim=True).values
-        elif image_agg_type == "avg":
-            self.image_agg = lambda x, dim: x.mean(dim=dim, keepdim=True)
-        else:
-            raise ValueError(f"Unknown image_agg_type {image_agg_type}")
-        if audio_agg_type == "max":
-            self.time_agg = masked_max
-        elif audio_agg_type == "avg":
-            self.time_agg = masked_mean
-        else:
-            raise ValueError(f"Unknown audio_agg_type {audio_agg_type}")
-        self.freq_agg = lambda x, dim: x.mean(dim=dim, keepdim=True)
-    def _agg_sim(self, sim, mask):
-        sim_shape = sim.shape
-        new_mask_shape = [1] * len(sim_shape)
-        new_mask_shape[0] = sim_shape[0]
-        new_mask_shape[-1] = sim_shape[-1]
-        mask = mask.reshape(new_mask_shape)
-        sim = self.image_agg(sim, -3)
-        sim = self.image_agg(sim, -4)
-        sim = self.freq_agg(sim, -2)
-        sim = self.time_agg(sim, mask, -1)
-        return sim.squeeze(-1).squeeze(-1).squeeze(-1).squeeze(-1)
-class PairedAggregator(BaseAggregator):
-    def __init__(self, nonneg_sim, mask_silence, num_heads, head_agg, use_cls):
-        super().__init__(nonneg_sim, mask_silence, num_heads, head_agg, use_cls)
-        self.image_agg_max = lambda x, dim: x.max(dim=dim, keepdim=True).values
-        self.image_agg_mean = lambda x, dim: x.mean(dim=dim, keepdim=True)
-        self.time_agg_max = masked_max
-        self.time_agg_mean = masked_mean
-        self.freq_agg = lambda x, dim: x.mean(dim=dim, keepdim=True)
-    def _agg_sim(self, sim, mask):
-        sim_shape = sim.shape
-        new_mask_shape = [1] * len(sim_shape)
-        new_mask_shape[0] = sim_shape[0]
-        new_mask_shape[-1] = sim_shape[-1]
-        mask = mask.reshape(new_mask_shape)
-        sim_1 = self.image_agg_max(sim, -3)
-        sim_1 = self.image_agg_max(sim_1, -4)
-        sim_1 = self.freq_agg(sim_1, -2)
-        sim_1 = self.time_agg_mean(sim_1, mask, -1)
-        sim_2 = self.freq_agg(sim, -2)
-        sim_2 = self.time_agg_max(sim_2, mask, -1)
-        sim_2 = self.image_agg_mean(sim_2, -3)
-        sim_2 = self.image_agg_mean(sim_2, -4)
-        sim = 1 / 2 * (sim_1 + sim_2)
-        return sim.squeeze(-1).squeeze(-1).squeeze(-1).squeeze(-1)
-class CAVMAEAggregator(BaseAggregator):
-    def __init__(self, *args, **kwargs):
-        super().__init__(False, False, 1, "sum", False)
-    def _get_full_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim:
-            audio_feats = preds[AUDIO_FEATS]
-            image_feats = preds[IMAGE_FEATS]
-            pool_audio_feats = F.normalize(audio_feats.mean(dim=[-1, -2]), dim=1)
-            pool_image_feats = F.normalize(image_feats.mean(dim=[-1, -2]), dim=1)
-            sims = torch.einsum(
-                "bc,dc->bd",
-                pool_audio_feats.to(torch.float32),
-                pool_image_feats.to(torch.float32))
-            if agg_heads:
-                return sims
-            else:
-                return sims.unsqueeze(-1)
-        else:
-            return BaseAggregator._get_full_sims(self, preds, raw, agg_sim, agg_heads)
-    def get_pairwise_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim:
-            audio_feats = preds[AUDIO_FEATS]
-            image_feats = preds[IMAGE_FEATS]
-            pool_audio_feats = F.normalize(audio_feats.mean(dim=[-1, -2]), dim=1)
-            pool_image_feats = F.normalize(image_feats.mean(dim=[-1, -2]), dim=1)
-            sims = torch.einsum(
-                "bc,bc->b",
-                pool_audio_feats.to(torch.float32),
-                pool_image_feats.to(torch.float32))
-            if agg_heads:
-                return sims
-            else:
-                return sims.unsqueeze(-1)
-        else:
-            return BaseAggregator.get_pairwise_sims(self, preds, raw, agg_sim, agg_heads)
-class ImageBindAggregator(BaseAggregator):
-    def __init__(self, num_heads, *args, **kwargs):
-        super().__init__(False, False, num_heads, "sum", False)
-    def _get_full_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim:
-            sims = torch.einsum(
-                "bc,dc->bd",
-                preds[AUDIO_CLS].to(torch.float32),
-                preds[IMAGE_CLS].to(torch.float32))
-            if agg_heads:
-                return sims
-            else:
-                sims = sims.unsqueeze(-1)
-                return sims.repeat(*([1] * (sims.dim() - 1)), self.num_heads)
-        else:
-            return BaseAggregator._get_full_sims(self, preds, raw, agg_sim, agg_heads)
-    def get_pairwise_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim:
-            sims = torch.einsum(
-                "bc,dc->b",
-                preds[AUDIO_CLS].to(torch.float32),
-                preds[IMAGE_CLS].to(torch.float32))
-            if agg_heads:
-                return sims
-            else:
-                sims = sims.unsqueeze(-1)
-                return sims.repeat(*([1] * (sims.dim() - 1)), self.num_heads)
-        else:
-            return BaseAggregator.get_pairwise_sims(self, preds, raw, agg_sim, agg_heads)
-    def forward_batched(self, preds, agg_heads, batch_size):
-        return self.forward(preds, agg_heads)
-class SimPool(nn.Module):
-    def __init__(self, dim, num_heads=1, qkv_bias=False, qk_scale=None, gamma=None, use_beta=False):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        self.scale = qk_scale or head_dim ** -0.5
-        self.norm_patches = nn.LayerNorm(dim, eps=1e-6)
-        self.wq = nn.Linear(dim, dim, bias=qkv_bias)
-        self.wk = nn.Linear(dim, dim, bias=qkv_bias)
-        if gamma is not None:
-            self.gamma = torch.tensor([gamma])
-            if use_beta:
-                self.beta = nn.Parameter(torch.tensor([0.0]))
-        self.eps = torch.tensor([1e-6])
-        self.gamma = gamma
-        self.use_beta = use_beta
-    def prepare_input(self, x):
-        if len(x.shape) == 3:  # Transformer
-            # Input tensor dimensions:
-            # x: (B, N, d), where B is batch size, N are patch tokens, d is depth (channels)
-            B, N, d = x.shape
-            gap_cls = x.mean(-2)  # (B, N, d) -> (B, d)
-            gap_cls = gap_cls.unsqueeze(1)  # (B, d) -> (B, 1, d)
-            return gap_cls, x
-        if len(x.shape) == 4:  # CNN
-            # Input tensor dimensions:
-            # x: (B, d, H, W), where B is batch size, d is depth (channels), H is height, and W is width
-            B, d, H, W = x.shape
-            gap_cls = x.mean([-2, -1])  # (B, d, H, W) -> (B, d)
-            x = x.reshape(B, d, H * W).permute(0, 2, 1)  # (B, d, H, W) -> (B, d, H*W) -> (B, H*W, d)
-            gap_cls = gap_cls.unsqueeze(1)  # (B, d) -> (B, 1, d)
-            return gap_cls, x
-        else:
-            raise ValueError(f"Unsupported number of dimensions in input tensor: {len(x.shape)}")
-    def forward(self, x):
-        self.eps = self.eps.to(x.device)
-        # Prepare input tensor and perform GAP as initialization
-        gap_cls, x = self.prepare_input(x)
-        # Prepare queries (q), keys (k), and values (v)
-        q, k, v = gap_cls, self.norm_patches(x), self.norm_patches(x)
-        # Extract dimensions after normalization
-        Bq, Nq, dq = q.shape
-        Bk, Nk, dk = k.shape
-        Bv, Nv, dv = v.shape
-        # Check dimension consistency across batches and channels
-        assert Bq == Bk == Bv
-        assert dq == dk == dv
-        # Apply linear transformation for queries and keys then reshape
-        qq = self.wq(q).reshape(Bq, Nq, self.num_heads, dq // self.num_heads).permute(0, 2, 1,
-                                                                                      3)  # (Bq, Nq, dq) -> (B, num_heads, Nq, dq/num_heads)
-        kk = self.wk(k).reshape(Bk, Nk, self.num_heads, dk // self.num_heads).permute(0, 2, 1,
-                                                                                      3)  # (Bk, Nk, dk) -> (B, num_heads, Nk, dk/num_heads)
-        vv = v.reshape(Bv, Nv, self.num_heads, dv // self.num_heads).permute(0, 2, 1,
-                                                                             3)  # (Bv, Nv, dv) -> (B, num_heads, Nv, dv/num_heads)
-        # Compute attention scores
-        attn = (qq @ kk.transpose(-2, -1)) * self.scale
-        # Apply softmax for normalization
-        attn = attn.softmax(dim=-1)
-        # If gamma scaling is used
-        if self.gamma is not None:
-            # Apply gamma scaling on values and compute the weighted sum using attention scores
-            x = torch.pow(attn @ torch.pow((vv - vv.min() + self.eps), self.gamma),
-                          1 / self.gamma)  # (B, num_heads, Nv, dv/num_heads) -> (B, 1, 1, d)
-            # If use_beta, add a learnable translation
-            if self.use_beta:
-                x = x + self.beta
-        else:
-            # Compute the weighted sum using attention scores
-            x = (attn @ vv).transpose(1, 2).reshape(Bq, Nq, dq)
-        return x.squeeze()
-class SimPoolAggregator(BaseAggregator):
-    def __init__(self, num_heads, dim, *args, **kwargs):
-        super().__init__(False, False, num_heads, "sum", False)
-        self.pool = SimPool(dim, gamma=1.25)
-    def _get_full_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim:
-            device = self.pool.wq.weight.data.device
-            pooled_audio = self.pool(preds[AUDIO_FEATS].to(torch.float32).to(device))
-            pooled_image = self.pool(preds[IMAGE_FEATS].to(torch.float32).to(device))
-            sims = torch.einsum(
-                "bc,dc->bd",
-                pooled_audio,
-                pooled_image)
-            if agg_heads:
-                return sims
-            else:
-                sims = sims.unsqueeze(-1)
-                return sims.repeat(*([1] * (sims.dim() - 1)), self.num_heads)
-        else:
-            return BaseAggregator._get_full_sims(self, preds, raw, agg_sim, agg_heads)
-    def get_pairwise_sims(self, preds, raw, agg_sim, agg_heads):
-        if agg_sim:
-            device = self.pool.wq.weight.data.device
-            pooled_audio = self.pool(preds[AUDIO_FEATS].to(torch.float32).to(device))
-            pooled_image = self.pool(preds[IMAGE_FEATS].to(torch.float32).to(device))
-            sims = torch.einsum(
-                "bc,dc->b",
-                pooled_audio,
-                pooled_image)
-            if agg_heads:
-                return sims
-            else:
-                sims = sims.unsqueeze(-1)
-                return sims.repeat(*([1] * (sims.dim() - 1)), self.num_heads)
-        else:
-            return BaseAggregator.get_pairwise_sims(self, preds, raw, agg_sim, agg_heads)
-    def forward_batched(self, preds, agg_heads, batch_size):
-        return self.forward(preds, agg_heads)
-def get_aggregator(sim_agg_type, nonneg_sim, mask_silence, num_heads, head_agg, use_cls, dim):
-    shared_args = dict(
-        nonneg_sim=nonneg_sim,
-        mask_silence=mask_silence,
-        num_heads=num_heads,
-        head_agg=head_agg,
-        use_cls=use_cls,
-    )
-    if sim_agg_type == "paired":
-        agg1 = PairedAggregator(**shared_args)
-    elif sim_agg_type == "misa":
-        agg1 = ImageThenAudioAggregator("max", "avg", **shared_args)
-    elif sim_agg_type == "mima":
-        agg1 = ImageThenAudioAggregator("max", "max", **shared_args)
-    elif sim_agg_type == "sisa":
-        agg1 = ImageThenAudioAggregator("avg", "avg", **shared_args)
-    elif sim_agg_type == "cavmae":
-        agg1 = CAVMAEAggregator()
-    elif sim_agg_type == "imagebind":
-        agg1 = ImageBindAggregator(num_heads=shared_args["num_heads"])
-    elif sim_agg_type == "simpool":
-        agg1 = SimPoolAggregator(num_heads=shared_args["num_heads"], dim=dim)
-    else:
-        raise ValueError(f"Unknown loss_type {sim_agg_type}")
-    return agg1

DenseAV/denseav/aligners.py DELETED Viewed

@@ -1,300 +0,0 @@
-from functools import partial
-import torch
-import torch.nn.functional as F
-from torch.nn import ModuleList
-from denseav.featurizers.DINO import Block
-class ChannelNorm(torch.nn.Module):
-    def __init__(self, dim, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.norm = torch.nn.LayerNorm(dim, eps=1e-4)
-    def forward_spatial(self, x):
-        return self.norm(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
-    def forward(self, x, cls):
-        return self.forward_spatial(x), self.forward_cls(cls)
-    def forward_cls(self, cls):
-        if cls is not None:
-            return self.norm(cls)
-        else:
-            return None
-def id_conv(dim, strength=.9):
-    conv = torch.nn.Conv2d(dim, dim, 1, padding="same")
-    start_w = conv.weight.data
-    conv.weight.data = torch.nn.Parameter(
-        torch.eye(dim, device=start_w.device).unsqueeze(-1).unsqueeze(-1) * strength + start_w * (1 - strength))
-    conv.bias.data = torch.nn.Parameter(conv.bias.data * (1 - strength))
-    return conv
-class LinearAligner(torch.nn.Module):
-    def __init__(self, in_dim, out_dim, use_norm=True):
-        super().__init__()
-        self.in_dim = in_dim
-        self.out_dim = out_dim
-        if use_norm:
-            self.norm = ChannelNorm(in_dim)
-        else:
-            self.norm = Identity2()
-        if in_dim == out_dim:
-            self.layer = id_conv(in_dim, 0)
-        else:
-            self.layer = torch.nn.Conv2d(in_dim, out_dim, kernel_size=1, stride=1)
-        self.cls_layer = torch.nn.Linear(in_dim, out_dim)
-    def forward(self, spatial, cls):
-        norm_spatial, norm_cls = self.norm(spatial, cls)
-        if cls is not None:
-            aligned_cls = self.cls_layer(cls)
-        else:
-            aligned_cls = None
-        return self.layer(norm_spatial), aligned_cls
-class IdLinearAligner(torch.nn.Module):
-    def __init__(self, in_dim, out_dim):
-        super().__init__()
-        self.in_dim = in_dim
-        self.out_dim = out_dim
-        assert self.out_dim == self.in_dim
-        self.layer = id_conv(in_dim, 1.0)
-    def forward(self, spatial, cls):
-        return self.layer(spatial), cls
-class FrequencyAvg(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-    def forward(self, spatial, cls):
-        return spatial.mean(2, keepdim=True), cls
-class LearnedTimePool(torch.nn.Module):
-    def __init__(self, dim, width, maxpool):
-        super().__init__()
-        self.dim = dim
-        self.width = width
-        self.norm = ChannelNorm(dim)
-        if maxpool:
-            self.layer = torch.nn.Sequential(
-                torch.nn.Conv2d(dim, dim, kernel_size=width, stride=1, padding="same"),
-                torch.nn.MaxPool2d(kernel_size=(1, width), stride=(1, width))
-            )
-        else:
-            self.layer = torch.nn.Conv2d(dim, dim, kernel_size=(1, width), stride=(1, width))
-    def forward(self, spatial, cls):
-        norm_spatial, norm_cls = self.norm(spatial, cls)
-        return self.layer(norm_spatial), norm_cls
-class LearnedTimePool2(torch.nn.Module):
-    def __init__(self, in_dim, out_dim, width, maxpool, use_cls_layer):
-        super().__init__()
-        self.in_dim = in_dim
-        self.out_dim = out_dim
-        self.width = width
-        if maxpool:
-            self.layer = torch.nn.Sequential(
-                torch.nn.Conv2d(in_dim, out_dim, kernel_size=width, stride=1, padding="same"),
-                torch.nn.MaxPool2d(kernel_size=(1, width), stride=(1, width))
-            )
-        else:
-            self.layer = torch.nn.Conv2d(in_dim, out_dim, kernel_size=(1, width), stride=(1, width))
-        self.use_cls_layer = use_cls_layer
-        if use_cls_layer:
-            self.cls_layer = torch.nn.Linear(in_dim, out_dim)
-    def forward(self, spatial, cls):
-        if cls is not None:
-            if self.use_cls_layer:
-                aligned_cls = self.cls_layer(cls)
-            else:
-                aligned_cls = cls
-        else:
-            aligned_cls = None
-        return self.layer(spatial), aligned_cls
-class Sequential2(torch.nn.Module):
-    def __init__(self, *modules):
-        super().__init__()
-        self.mod_list = ModuleList(modules)
-    def forward(self, x, y):
-        results = (x, y)
-        for m in self.mod_list:
-            results = m(*results)
-        return results
-class ProgressiveGrowing(torch.nn.Module):
-    def __init__(self, stages, phase_lengths):
-        super().__init__()
-        self.stages = torch.nn.ModuleList(stages)
-        self.phase_lengths = torch.tensor(phase_lengths)
-        assert len(self.phase_lengths) + 1 == len(self.stages)
-        self.phase_boundaries = self.phase_lengths.cumsum(0)
-        self.register_buffer('phase', torch.tensor([1]))
-    def maybe_change_phase(self, global_step):
-        needed_phase = (global_step >= self.phase_boundaries).to(torch.int64).sum().item() + 1
-        if needed_phase != self.phase.item():
-            print(f"Changing aligner phase to {needed_phase}")
-            self.phase.copy_(torch.tensor([needed_phase]).to(self.phase.device))
-            return True
-        else:
-            return False
-    def parameters(self, recurse: bool = True):
-        phase = self.phase.item()
-        used_stages = self.stages[:phase]
-        print(f"Progressive Growing at stage {phase}")
-        all_params = []
-        for stage in used_stages:
-            all_params.extend(stage.parameters(recurse))
-        return iter(all_params)
-    def forward(self, spatial, cls):
-        pipeline = Sequential2(*self.stages[:self.phase.item()])
-        return pipeline(spatial, cls)
-class Identity2(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-    def forward(self, x, y):
-        return x, y
-class SelfAttentionAligner(torch.nn.Module):
-    def __init__(self, dim):
-        super().__init__()
-        self.dim = dim
-        self.num_heads = 6
-        if dim % self.num_heads != 0:
-            self.padding = self.num_heads - (dim % self.num_heads)
-        else:
-            self.padding = 0
-        self.block = Block(
-            dim + self.padding,
-            num_heads=self.num_heads,
-            mlp_ratio=4,
-            qkv_bias=True,
-            qk_scale=None,
-            drop=0.0,
-            attn_drop=0.0,
-            drop_path=0.0,
-            norm_layer=partial(torch.nn.LayerNorm, eps=1e-4))
-    def forward(self, spatial, cls):
-        padded_feats = F.pad(spatial, [0, 0, 0, 0, self.padding, 0])
-        B, C, H, W = padded_feats.shape
-        proj_feats = padded_feats.reshape(B, C, H * W).permute(0, 2, 1)
-        if cls is not None:
-            assert len(cls.shape) == 2
-            padded_cls = F.pad(cls, [self.padding, 0])
-            proj_feats = torch.cat([padded_cls.unsqueeze(1), proj_feats], dim=1)
-        aligned_feat, attn, qkv = self.block(proj_feats, return_qkv=True)
-        if cls is not None:
-            aligned_cls = aligned_feat[:, 0, :]
-            aligned_spatial = aligned_feat[:, 1:, :]
-        else:
-            aligned_cls = None
-            aligned_spatial = aligned_feat
-        aligned_spatial = aligned_spatial.reshape(B, H, W, self.dim + self.padding).permute(0, 3, 1, 2)
-        aligned_spatial = aligned_spatial[:, self.padding:, :, :]
-        if aligned_cls is not None:
-            aligned_cls = aligned_cls[:, self.padding:]
-        return aligned_spatial, aligned_cls
-def get_aligner(aligner_type, in_dim, out_dim, **kwargs):
-    if aligner_type is None:
-        return Identity2()
-    if "prog" in aligner_type:
-        phase_length = kwargs["phase_length"]
-    if aligner_type == "image_linear":
-        return LinearAligner(in_dim, out_dim)
-    elif aligner_type == "image_idlinear":
-        return IdLinearAligner(in_dim, out_dim)
-    elif aligner_type == "image_linear_no_norm":
-        return LinearAligner(in_dim, out_dim, use_norm=False)
-    elif aligner_type == "image_id":
-        return Identity2()
-    elif aligner_type == "image_norm":
-        return ChannelNorm(in_dim)
-    elif aligner_type == "audio_linear":
-        return Sequential2(
-            LinearAligner(in_dim, out_dim),
-            FrequencyAvg())
-    elif aligner_type == "audio_sa":
-        return Sequential2(
-            LinearAligner(in_dim, out_dim),
-            FrequencyAvg(),
-            SelfAttentionAligner(out_dim)
-        )
-    elif aligner_type == "audio_sa_sa":
-        return Sequential2(
-            FrequencyAvg(),
-            LinearAligner(in_dim, out_dim),
-            SelfAttentionAligner(out_dim),
-            SelfAttentionAligner(out_dim)
-        )
-    elif aligner_type == "audio_3_3_pool":
-        return Sequential2(
-            LinearAligner(in_dim, out_dim),
-            FrequencyAvg(),
-            LearnedTimePool(out_dim, 3, False),
-            LearnedTimePool(out_dim, 3, False),
-        )
-    elif aligner_type == "audio_sa_3_3_pool":
-        return Sequential2(
-            LinearAligner(in_dim, out_dim),
-            FrequencyAvg(),
-            LearnedTimePool(out_dim, 3, False),
-            LearnedTimePool(out_dim, 3, False),
-            SelfAttentionAligner(out_dim)
-        )
-    elif aligner_type == "audio_sa_3_3_pool_2":
-        return Sequential2(
-            FrequencyAvg(),
-            ChannelNorm(in_dim),
-            LearnedTimePool2(in_dim, out_dim, 3, False, True),
-            LearnedTimePool2(out_dim, out_dim, 3, False, False),
-            SelfAttentionAligner(out_dim)
-        )
-    else:
-        raise ValueError(f"Unknown aligner type {aligner_type}")

DenseAV/denseav/configs/av_align.yaml DELETED Viewed

@@ -1,125 +0,0 @@
-# Model args
-code_dim: 384
-image_model_type: "dino8"
-image_model_token_type: "token"
-image_aligner_type: "image_linear"
-image_pool_width: 2
-audio_model_type: "hubert"
-audio_aligner_type: "audio_sa_3_3_pool_2"
-audio_pool_width: 1
-learn_audio_cls: True
-#code_dim: 1024
-#image_model_type: "imagebind"
-#image_model_token_type: "token"
-#image_aligner_type: "image_linear"
-#image_pool_width: 1
-#
-#audio_model_type: "imagebind"
-#audio_aligner_type: "audio_sa"
-#audio_pool_width: 1
-#
-#learn_audio_cls: False
-audio_lora: False
-audio_lora_rank: 8
-image_lora: True
-image_lora_rank: 8
-spatial_dropout: 0.0
-channel_dropout: 0.0
-quad_mixup: 0.1
-bg_mixup: 0.0
-patch_mixup: 0.0
-mixup_weight: 0.1
-sim_agg_type: "misa"
-sim_agg_heads: 1
-sim_use_cls: False
-cal_init: 1.0
-cal_balance_weight: 0.1
-nonneg_sim: False
-nonneg_pressure: 0.01
-silence_l1: 0.01
-silence_l2: 0.0
-tv_weight: 0.01
-specialization_weight: 0.05
-head_agg: "max_elementwise"
-disentangle_weight: 0.0
-norm_vectors: False
-neg_audio: true
-neg_audio_weight: 0.01
-pretrain_steps: 3000
-pretrain_lr: .5e-4
-# Loss args
-lr: .5e-4
-lr_warmup: 1000
-#lr_warmup: 100
-lr_schedule: ~
-lr_cycle_length: 50000
-optimizer: "adam"
-gradient_clipping: 10.0
-adaptive_clipping: True
-gather_tensors: True
-loss_type: "nce"
-loss_leak: 0.0
-loss_margin: 0.0
-mask_silence: true
-extra_audio_masking: true
-max_steps: 1000001
-finetune_image_model: False
-finetune_audio_model: True
-# Checkpointing args
-load_strict: true
-starting_weights: ~
-auto_resume: false
-grouping_name: "foo"
-resume_prefix: "imagebind_exp2"
-# Data Args
-#dataset_name: "sample-audio"
-dataset_name: "places-audio"
-#dataset_name: "mixed"
-#dataset_name: "audio-set-full"
-use_extra_val_sets: true
-batch_size: 10
-load_size: 224
-image_aug: true
-audio_aug: false
-audio_level: false
-memory_buffer_size: 0
-val_check_interval: 10000 #0
-use_cached_embs: false
-num_workers: 12
-num_gpus: 4
-num_sanity_val_steps: 0 #-1
-seed: 0
-# Env args
-output_root: '../'
-pytorch_data_dir: '/pytorch-data/'
-submitting_to_aml: false
-hydra:
-  run:
-    dir: "."
-  output_subdir: ~

DenseAV/denseav/constants.py DELETED Viewed

@@ -1,12 +0,0 @@
-IMAGE_INPUT = "frames"
-IMAGE_FEATS = "image_feats"
-IMAGE_CLS = "image_cls"
-IMAGE_MASK = "image_masks"
-AUDIO_FEATS = "audio_feats"
-AUDIO_CLS = "audio_cls"
-AUDIO_MASK = "audio_mask"
-AUDIO_POS_MASK = "audio_pos_mask"
-DATA_SOURCE = "source"

DenseAV/denseav/data/AVDatasets.py DELETED Viewed

@@ -1,1249 +0,0 @@
-import glob
-import os
-from abc import ABC, abstractmethod
-from glob import glob
-from os.path import join
-from pathlib import Path
-from typing import List, Set
-import audioread
-import numpy as np
-import pandas as pd
-import pytorch_lightning as pl
-import torch
-import torch.nn.functional as F
-import torchaudio
-import torchvision.transforms as T
-from PIL import Image
-from torch.utils.data import Dataset, DataLoader, default_collate, Subset, ConcatDataset
-from tqdm import tqdm
-from denseav.constants import AUDIO_MASK, AUDIO_POS_MASK, IMAGE_MASK, IMAGE_INPUT
-from denseav.data.make_tarballs import untar_all
-from denseav.shared import norm, prep_waveform
-def sample_choice(choices, probs):
-    # Check that probabilities sum to 1 and are non-negative
-    assert sum(probs) == 1, "Probabilities must sum to 1"
-    assert all(p >= 0 for p in probs), "Probabilities cannot be negative"
-    # Convert probs to a tensor
-    probs_tensor = torch.tensor(probs)
-    # Sample a choice according to the probabilities
-    index = torch.multinomial(probs_tensor, 1).item()
-    # Return the sampled choice
-    return choices[index]
-def grid_frames(frames):
-    top_row = torch.cat([frames[0], frames[1]], dim=2)
-    bottom_row = torch.cat([frames[2], frames[3]], dim=2)
-    return torch.cat([top_row, bottom_row], dim=3)
-def create_mixed_image(pos_frame, neg_frame, patch_size):
-    # Step 1: Check that patch_size evenly divides the image dimensions
-    b, c, h, w = pos_frame.shape
-    assert h % patch_size == 0 and w % patch_size == 0, "Patch size must evenly divide image dimensions"
-    # Step 2: Create a random binary mask with the same number of patches as the image
-    mask = torch.randint(0, 2, (b, 1, h // patch_size, w // patch_size))
-    # Step 3: Create a new image using patches from pos_frame and neg_frame according to the mask
-    # Upscale the mask to the size of the image
-    mask_upscaled = F.interpolate(mask.to(torch.float32), scale_factor=patch_size)
-    # Use the mask to create a mixed frame
-    mixed_frame = mask_upscaled * pos_frame + (1 - mask_upscaled) * neg_frame
-    return mixed_frame, mask_upscaled
-class AVDataset(ABC, Dataset):
-    @abstractmethod
-    def _dataset_folder(self) -> str:
-        pass
-    @abstractmethod
-    def _load_info(self, split) -> pd.DataFrame:
-        """
-        This function should return a dataframe with at least a column "id"
-        @return:
-        """
-        pass
-    @abstractmethod
-    def _missing_threshold(self) -> float:
-        pass
-    @abstractmethod
-    def default_target_length(self) -> int:
-        pass
-    def target_length(self):
-        if self.override_target_length is not None:
-            return self.override_target_length
-        else:
-            return self.default_target_length()
-    def _frame_root(self) -> str:
-        return join(self.root, "frames", self.split)
-    def _video_root(self) -> str:
-        return join(self.root, "videos", self.split)
-    def _audio_root(self) -> str:
-        return join(self.root, "audio", self.split)
-    def _semseg_root(self) -> str:
-        return join(self.root, "annotations", self.split)
-    def _embed_root(self) -> str:
-        return join(self.root, "embedding", self.audio_embed_model, self.split)
-    def _label_root(self) -> str:
-        return join(self.root, "pseudo-labels")
-    def _hn_root(self) -> str:
-        return join(self.root, "hard_negatives")
-    def _all_video_files(self) -> Set[str]:
-        return set(str(p) for p in Path(join(self._video_root())).rglob('*'))
-    def _all_frame_files(self) -> Set[str]:
-        return set(str(p) for p in Path(join(self._frame_root())).rglob('*'))
-    def _all_audio_files(self) -> Set[str]:
-        return set(str(p) for p in Path(join(self._audio_root())).rglob('*'))
-    def _all_embed_files(self) -> Set[str]:
-        return set(str(p) for p in Path(join(self._embed_root())).rglob('*'))
-    def _get_frame_files(self, row) -> List[str]:
-        return [self._frame_root() + "/" + row["id"] + f"_{i}.jpg" for i in range(self._expected_num_frames())]
-    def _get_semseg_file(self, row) -> str:
-        raise NotImplementedError("Class has not implemented _get_semseg_files")
-    def _get_audio_file(self, row) -> str:
-        return self._audio_root() + "/" + row["id"] + ".mp3"
-    def _get_video_file(self, row) -> str:
-        return self._video_root() + "/" + row["id"] + ".mp4"
-    def _get_embed_file(self, row) -> str:
-        return self._embed_root() + "/" + row["id"] + ".npz"
-    def _add_files_to_metadata(self, df) -> pd.DataFrame:
-        tqdm.pandas()
-        if self.use_audio_embed:
-            df["embed_file"] = df.progress_apply(self._get_embed_file, axis=1)
-        if self.use_audio or self.use_spec:
-            df["audio_file"] = df.progress_apply(self._get_audio_file, axis=1)
-        if self.use_frames:
-            df["frame_files"] = df.progress_apply(self._get_frame_files, axis=1)
-        if self.use_semseg:
-            df["semseg_file"] = df.progress_apply(self._get_semseg_file, axis=1)
-        df = self._filter_valid_metadata(df)
-        if self.use_hn:
-            loaded = np.load(join(self._hn_root(), "original", f"{self.split}_hard_negatives.npz"))
-            df["hn0"] = [t for t in torch.tensor(loaded["indices_0"])]
-            df["hn1"] = [t for t in torch.tensor(loaded["indices_1"])]
-        return df
-    def _split_name(self, split):
-        return split
-    def _filter_valid_metadata(self, df: pd.DataFrame) -> pd.DataFrame:
-        print("MY_DIR ", list(glob(join(self.root, "*"))))
-        if self.use_audio_embed:
-            missing_embed_files = set(df['embed_file']) - self.all_embed_files
-            valid_audio = ~df['embed_file'].isin(missing_embed_files)
-            print("ALL EMBED ", len(self.all_embed_files))
-        elif self.use_audio or self.use_spec:
-            missing_audio_files = set(df['audio_file']) - self.all_audio_files
-            valid_audio = ~df['audio_file'].isin(missing_audio_files)
-            print("ALL AUDIO ", len(self.all_audio_files))
-        if self.use_frames:
-            missing_frame_files = set(
-                item for sublist in df['frame_files'].tolist() for item in sublist) - self.all_frame_files
-            valid_frames = df['frame_files'].apply(lambda x: not any(file in missing_frame_files for file in x))
-            print("ALL FRAMES ", len(self.all_frame_files))
-            df["is_valid"] = valid_audio & valid_frames
-        else:
-            df["is_valid"] = valid_audio
-        percent_missing = (1 - (df["is_valid"].sum() / len(df)))
-        assert percent_missing <= self._missing_threshold(), \
-            f"Too many missing files: %{round(percent_missing * 100.0, 2)}"
-        assert len(df) > 0, "No files found"
-        return df[df["is_valid"]]
-    def __init__(
-            self,
-            root: str,
-            split: str = "train",
-            use_frames=False,
-            frame_transform=None,
-            use_audio=False,
-            use_spec=False,
-            use_audio_embed=False,
-            use_hn=False,
-            use_caption=False,
-            use_semseg=False,
-            neg_audio=False,
-            use_davenet_spec=False,
-            use_fnac_spec=False,
-            n_label_frames=196,
-            label_transform=None,
-            audio_embed_model="hubert",
-            n_frames=1,
-            audio_transform=None,
-            audio_aug=False,
-            spec_transform=None,
-            spec_mel_bins=128,
-            spec_mean=-6.6268077,
-            spec_std=5.358466,
-            sample_rate=16000,
-            override_target_length=None,
-            use_tags=False,
-            extra_audio_masking=False,
-            audio_level=False,
-            quad_mixup=0.0,
-            bg_mixup=0.0,
-            patch_mixup=0.0,
-            patch_size=8,
-    ):
-        super(AVDataset).__init__()
-        self.pytorch_data_dir = root
-        self.split = self._split_name(split)
-        self.root = join(root, self._dataset_folder())
-        self.use_frames = use_frames
-        self.frame_transform = frame_transform
-        self.use_audio = use_audio
-        self.use_spec = use_spec
-        self.use_audio_embed = use_audio_embed
-        self.use_davenet_spec = use_davenet_spec
-        self.use_fnac_spec = use_fnac_spec
-        self.use_hn = use_hn
-        self.use_caption = use_caption
-        self.label_transform = label_transform
-        self.audio_embed_model = audio_embed_model
-        self.audio_aug = audio_aug
-        self.n_frames = n_frames
-        self.audio_transform = audio_transform
-        self.spec_transform = spec_transform
-        self.spec_mel_bins = spec_mel_bins
-        self.spec_mean = spec_mean
-        self.spec_std = spec_std
-        self.use_semseg = use_semseg
-        self.override_target_length = override_target_length
-        self.use_tags = use_tags
-        self.extra_audio_masking = extra_audio_masking
-        self.neg_audio = neg_audio
-        self.audio_level = audio_level
-        self.quad_mixup = quad_mixup
-        self.bg_mixup = bg_mixup
-        self.patch_mixup = patch_mixup
-        self.patch_size = patch_size
-        self.sample_rate = sample_rate
-        self.n_label_frames = n_label_frames
-        if self.use_audio_embed:
-            self.all_embed_files = self._all_embed_files()
-        if self.use_audio or self.use_spec:
-            self.all_audio_files = self._all_audio_files()
-        if self.use_frames:
-            self.all_frame_files = self._all_frame_files()
-        self.metadata = self._add_files_to_metadata(self._load_info(self.split))
-        assert len(self.metadata) > 0
-    def __len__(self):
-        return len(self.metadata)
-    @abstractmethod
-    def _expected_num_frames(self) -> int:
-        pass
-    def get_audio_mask(self, real_length, padded_length, target_size):
-        if not isinstance(real_length, torch.Tensor):
-            real_length = torch.tensor(real_length)
-            padded_length = torch.tensor(padded_length)
-        n_frames = ((real_length / padded_length) * target_size).to(torch.int64)
-        oh = F.one_hot(n_frames, num_classes=target_size + 1)
-        if len(oh.shape) == 1:
-            oh = oh.unsqueeze(0)
-        return (1 - torch.cumsum(oh, dim=1))[:, :-1].to(torch.bool)
-    def _base_get_item(self, item):
-        id = self.metadata["id"].iloc[item]
-        data_dict = {"metadata": {"id": id, "index": item}}
-        if self.use_tags and "tags" in self.metadata:
-            tags = torch.tensor(self.metadata["tags"].iloc[item])
-            tag_oh = torch.zeros(self.num_tags, dtype=torch.float32)
-            tag_oh[tags] += 1
-            data_dict["tags"] = tag_oh
-        if self.use_audio or self.use_spec:
-            audio_file = self.metadata["audio_file"].iloc[item]
-            data_dict["metadata"]["audio_file"] = audio_file
-            loaded_waveform, obs_sr = torchaudio.load(audio_file)
-            loaded_waveform = loaded_waveform[0]
-            if self.neg_audio:
-                neg_audio_file = self.metadata["audio_file"].iloc[torch.randint(0, len(self), size=(1,)).item()]
-                data_dict["metadata"]["neg_audio_file"] = neg_audio_file
-                neg_waveform, neg_obs_sr = torchaudio.load(neg_audio_file)
-                neg_waveform = neg_waveform[0]
-            else:
-                neg_waveform, neg_obs_sr = None, None
-            (waveform,
-             spectrogram,
-             audio_length,
-             total_length,
-             original_length,
-             mask,
-             pos_mask) = prep_waveform(
-                loaded_waveform,
-                obs_sr,
-                self.target_length(),
-                self.spec_mel_bins,
-                self.spec_mean,
-                self.spec_std,
-                self.sample_rate,
-                self.use_spec,
-                False,
-                self.extra_audio_masking,
-                neg_waveform,
-                neg_obs_sr,
-                self.audio_level,
-                self.audio_aug
-            )
-            if self.spec_transform is not None and spectrogram is not None:
-                spectrogram = self.spec_transform(spectrogram)
-            if self.audio_transform is not None:
-                waveform = self.audio_transform(waveform)
-            data_dict["audio"] = waveform
-            data_dict[AUDIO_MASK] = mask
-            data_dict[AUDIO_POS_MASK] = pos_mask
-            data_dict["audio_length"] = audio_length
-            data_dict["original_length"] = original_length
-            data_dict["total_length"] = total_length
-            if spectrogram is not None:
-                data_dict["spec"] = spectrogram
-            if mask.mean() < .04:
-                return None
-        if self.use_davenet_spec:
-            from data.DavenetUtilities import davenet_load_audio
-            audio_file = self.metadata["audio_file"].iloc[item]
-            spec, n_frames = davenet_load_audio(audio_file)
-            data_dict["davenet_spec"] = spec
-        if self.use_fnac_spec:
-            from featurizers.FNACAVL import load_spectrogram as fnac_load_spectrogram
-            audio_file = self.metadata["audio_file"].iloc[item]
-            data_dict["fnac_spec"] = fnac_load_spectrogram(audio_file, 3)
-        if self.use_audio_embed:
-            loaded = np.load(self.metadata["embed_file"].iloc[item])
-            data_dict["audio_emb"] = loaded["feat"]
-            data_dict["audio_length"] = loaded["audio_length"]
-            data_dict["total_length"] = loaded["total_length"]
-            data_dict["original_length"] = loaded["original_length"]
-            data_dict[AUDIO_MASK] = self.get_audio_mask(
-                data_dict["audio_length"],
-                data_dict["total_length"],
-                data_dict["audio_emb"].shape[-1]) \
-                .squeeze().to(torch.float32)
-            data_dict[AUDIO_POS_MASK] = data_dict[AUDIO_MASK].to(torch.float32)
-        if self.use_frames:
-            def get_frames(item):
-                file_group = self.metadata["frame_files"].iloc[item]
-                if self.n_frames is not None:
-                    selected_frames = torch.randperm(len(file_group))[:self.n_frames]
-                    file_group = [file_group[i] for i in selected_frames]
-                data_dict["metadata"]["frame_files"] = file_group
-                images = [Image.open(file).convert("RGB") for file in file_group]
-                if self.frame_transform is not None:
-                    images = torch.cat([self.frame_transform(img).unsqueeze(0) for img in images], dim=0)
-                return images, file_group
-            no_mixup = 1.0 - (self.bg_mixup + self.quad_mixup + self.patch_mixup)
-            mixup_type = sample_choice(
-                ["quad", "bg", "patch", None],
-                [self.quad_mixup, self.bg_mixup, self.patch_mixup, no_mixup]
-            )
-            if mixup_type == "quad":
-                indices = [item] + torch.randint(0, len(self), size=(3,)).numpy().tolist()
-                frames_and_files = [get_frames(i) for i in indices]
-                file_group = frames_and_files[0][1]
-                perm = torch.randperm(4)
-                all_frames = [F.interpolate(frames_and_files[i][0], scale_factor=0.5, mode="bilinear") for i in
-                              perm]
-                b, c, h, w = all_frames[0].shape
-                indices = [indices[p] for p in perm]
-                masks = [(torch.ones(b, 1, h, w) if index == item else torch.zeros(b, 1, h, w)) for index in
-                         indices]
-                data_dict[IMAGE_INPUT] = grid_frames(all_frames)
-                data_dict[IMAGE_MASK] = grid_frames(masks)
-            elif mixup_type == "bg":
-                neg_item = torch.randint(0, len(self), size=(1,)).item()
-                neg_frame, _ = get_frames(neg_item)
-                pos_frame, file_group = get_frames(item)
-                b, c, h, w = neg_frame.shape
-                neg_mask = torch.zeros(b, 1, h, w)
-                pos_mask = torch.ones(b, 1, h, w)
-                if torch.rand(1).item() > 0.5:
-                    bg_frame = neg_frame
-                    bg_mask = neg_mask
-                    fg_frame = F.interpolate(pos_frame, scale_factor=0.5, mode="bilinear")
-                    fg_mask = F.interpolate(pos_mask, scale_factor=0.5, mode="bilinear")
-                else:
-                    bg_frame = pos_frame
-                    bg_mask = pos_mask
-                    fg_frame = F.interpolate(neg_frame, scale_factor=0.5, mode="bilinear")
-                    fg_mask = F.interpolate(neg_mask, scale_factor=0.5, mode="bilinear")
-                start_h = torch.randint(0, h // 2, size=(1,))
-                start_w = torch.randint(0, w // 2, size=(1,))
-                bg_frame[:, :, start_h:start_h + fg_frame.shape[2], start_w:start_w + fg_frame.shape[3]] = fg_frame
-                bg_mask[:, :, start_h:start_h + fg_frame.shape[2], start_w:start_w + fg_frame.shape[3]] = fg_mask
-                data_dict["frames"] = bg_frame
-                data_dict["image_masks"] = bg_mask
-            elif mixup_type == "patch":
-                neg_item = torch.randint(0, len(self), size=(1,)).item()
-                neg_frame, _ = get_frames(neg_item)
-                pos_frame, file_group = get_frames(item)
-                frames, masks = create_mixed_image(pos_frame, neg_frame, self.patch_size)
-                data_dict["frames"] = frames
-                data_dict["image_masks"] = masks
-            elif mixup_type is None:
-                frames, file_group = get_frames(item)
-                data_dict["frames"] = frames
-                b, c, h, w = frames.shape
-                data_dict["image_masks"] = torch.ones(b, 1, h, w)
-            else:
-                raise ValueError(f"Unknown mixup type {mixup_type}")
-            if "original_length" in data_dict:
-                if self._expected_num_frames() == 1:
-                    frame_nums = torch.tensor([0])
-                else:
-                    frame_nums = torch.tensor([
-                        int(f.split("/")[-1].split("_")[-1].split(".")[0]) for f in file_group])
-                data_dict["frame_nums"] = frame_nums
-                frame_fracs = ((frame_nums + .5) / (self._expected_num_frames()))
-                frame_position = (frame_fracs * data_dict["original_length"]) / data_dict["total_length"]
-                data_dict["frame_position"] = frame_position
-        if self.use_caption:
-            if "word" in self.metadata:
-                words = self.metadata["word"].iloc[item]
-                start = self.metadata["start"].iloc[item]
-                end = self.metadata["end"].iloc[item]
-                if isinstance(words, float):
-                    words = [""]
-                    start = [0.0]
-                    end = [-1.0]
-                data_dict["caption"] = {
-                    "words": words,
-                    "start": start,
-                    "end": end,
-                }
-            if "text" in self.metadata:
-                data_dict["text"] = self.metadata["text"].iloc[item]
-        if self.use_semseg:
-            semseg_path = join(self._semseg_root(), self.metadata["semseg_file"].iloc[item])
-            semseg = Image.open(semseg_path)
-            if self.label_transform is not None:
-                semseg = np.array(self.label_transform(semseg))
-            data_dict["semseg"] = semseg
-            data_dict["metadata"]["semseg_file"] = semseg_path
-            # if hasattr(self, "num_classes"):
-            #     data_dict["num_pixels_per_class"] = F.one_hot(
-            #         torch.tensor(semseg).to(torch.int64), self.num_classes() + 1).sum(dim=[0, 1])
-        return data_dict
-    def __getitem__(self, item):
-        try:
-            data_dict = self._base_get_item(item)
-            if self.use_hn:
-                indices = torch.cat([self.metadata["hn0"].iloc[item], self.metadata["hn1"].iloc[item]], dim=0)
-                neg_index = indices[torch.randint(0, indices.shape[0], (1,))]
-                negative_dict = self._base_get_item(neg_index)
-                data_dict["negatives"] = negative_dict
-            return data_dict
-        except (audioread.exceptions.NoBackendError, EOFError) as e:
-            # raise e
-            bad_path = self.metadata["audio_file"].iloc[item]
-            print(e)
-            print(f"Removing bad audio file {bad_path}")
-            # os.remove(bad_path)
-            return None
-        except ValueError as e:
-            # raise e
-            bad_path = self.metadata["audio_file"].iloc[item]
-            if "Input signal length=0" in str(e):
-                print(e)
-                print(f"Removing bad file {bad_path} due to input signal length=0")
-            #     os.remove(bad_path)
-            return None
-        except OSError as e:
-            # raise e
-            bad_paths = self.metadata["frame_files"].iloc[item]
-            for bad_path in bad_paths:
-                print(e)
-                print(f"Removing bad frame file {bad_path}")
-            return None
-        except RuntimeError as e:
-            # raise e
-            bad_path = self.metadata["audio_file"].iloc[item]
-            print(e)
-            print(f"Removing bad audio file {bad_path}")
-            # os.remove(bad_path)
-            return None
-class PlacesAudio(AVDataset):
-    def _load_info(self, split) -> pd.DataFrame:
-        df = pd.read_json(join(os.path.dirname(self._audio_root()), "metadata", f"{split}.json"))
-        df["id"] = df["data"].apply(lambda d: d["wav"][5:-4])
-        if self.use_caption:
-            if split == "train":
-                word_df = pd.read_json(
-                    join(os.path.dirname(self._audio_root()), "metadata", f"word-alignment-{split}.json")
-                )
-            else:
-                word_df = pd.read_csv(
-                    join(os.path.dirname(self._audio_root()), "metadata", f"word-alignment-{split}.csv")) \
-                    .groupby("id").aggregate(lambda g: list(g)).reset_index().drop("Unnamed: 0", axis=1)
-            df = pd.merge(df, word_df, on="id", how="outer")
-        return df
-    def _missing_threshold(self) -> float:
-        # return 0.0
-        return 0.97  # TODO fix
-    def _expected_num_frames(self):
-        return 1
-    def default_target_length(self) -> int:
-        return 20
-    def _frame_root(self) -> str:
-        return join(os.path.dirname(self.root), "places_subset")
-    def _audio_root(self) -> str:
-        return join(self.root, "wavs")
-    def _embed_root(self) -> str:
-        return join(self.root, "embedding", self.audio_embed_model)
-    def _dataset_folder(self) -> str:
-        return "PlacesAudio_400k_distro"
-    def _get_audio_file(self, row) -> str:
-        return join(self._audio_root(), row["id"] + ".wav")
-    def _get_frame_files(self, row) -> List[str]:
-        return [join(self._frame_root(), row["data"]["image"])]
-    def _get_embed_file(self, row) -> str:
-        return join(self._embed_root(), row["id"] + ".npz")
-class AudioSet(AVDataset):
-    def _expected_num_frames(self):
-        return 10
-    def default_target_length(self) -> int:
-        return 20
-    def _dataset_folder(self) -> str:
-        return "audioset-raw"
-    def _missing_threshold(self) -> float:
-        if self.split == "val" or self.split == "test":
-            return 0.02
-        else:
-            return 0.17
-    def train_seg_file(self):
-        return "unbalanced_train_segments.csv"
-    def _load_info(self, split) -> pd.DataFrame:
-        if split == "train":
-            df = pd.read_csv(join(self.root, "metadata", self.train_seg_file()))
-        elif split == "val" or split == "test":
-            df = pd.read_csv(join(self.root, "metadata", "eval_segments_subset.csv"))
-        else:
-            raise ValueError(f"Unknown split {split}")
-        labels = pd.read_csv(join(self.root, "metadata", "class_labels_indices.csv"))
-        mid_to_index = dict(zip(labels["mid"], labels["index"]))
-        df["tags"] = df["positive_labels"].apply(lambda l: [mid_to_index[e] for e in l.strip('"').split(",")])
-        self.num_tags = max(*[i for k, i in mid_to_index.items()]) + 1
-        df["id"] = df.apply(lambda r: f"{r.YTID}_{r.start_seconds}_{r.end_seconds}", axis=1)
-        return df
-    def _frame_root(self) -> str:
-        return join(self.root, "frames")
-    def _audio_root(self) -> str:
-        return join(self.root, "audio")
-    def _all_frame_files(self) -> Set[str]:
-        frame_files = set()
-        for entry in os.scandir(self._frame_root()):
-            if entry.is_file():
-                frame_files.add(entry.path)
-            elif entry.is_dir():
-                for subentry in os.scandir(entry.path):
-                    if subentry.is_file():
-                        frame_files.add(subentry.path)
-        return frame_files
-    def _all_audio_files(self) -> Set[str]:
-        return set(entry.path for entry in os.scandir(self._audio_root()) if entry.is_file())
-    def _all_embed_files(self) -> Set[str]:
-        return set(entry.path for entry in os.scandir(self._embed_root()) if entry.is_file())
-    def _embed_root(self) -> str:
-        return join(self.root, "embedding", self.audio_embed_model)
-    def prefix(self):
-        return ""
-    def _get_audio_file(self, row) -> str:
-        return f"{self.root}/audio/{self.prefix()}{row.id}.mp3"
-    def _get_frame_files(self, row) -> List[str]:
-        return [f"{self.root}/frames/frame_{fn}/{self.prefix()}{row.id}.jpg" for fn in range(10)]
-    def _get_embed_file(self, row) -> str:
-        return f"{self.root}/embedding/{self.audio_embed_model}/{self.prefix()}{row.id}.npz"
-class AudioSetEval(AudioSet):
-    def _dataset_folder(self) -> str:
-        return "audioset-eval"
-    def _get_frame_files(self, row) -> List[str]:
-        base_path = f"{self.root}/frames/{self.prefix()}{row.id}_"
-        return [base_path + f"{fn}.jpg" for fn in range(10)]
-    def prefix(self):
-        return ""
-class ADE20K(AVDataset):
-    def _split_name(self, split):
-        if split == "val":
-            return "validation"
-        elif split == "train":
-            return "training"
-        else:
-            raise ValueError(f"Unknown split name {split}")
-    def _load_info(self, split) -> pd.DataFrame:
-        df = pd.read_json(join(self.root, "metadata_with_caption_dedup.json"))
-        df["id"] = df["image"]
-        df = df[df["image"].apply(lambda f: f.split("/")[0] == split)]
-        if self.use_caption:
-            df["word"] = df["caption"].apply(lambda c: c["words"])
-            df["start"] = df["caption"].apply(lambda c: c["start"])
-            df["end"] = df["caption"].apply(lambda c: c["end"])
-            df["text"] = df["word"].apply(lambda l: " ".join(l))
-        return df
-    def _missing_threshold(self) -> float:
-        return 0.03
-    def _expected_num_frames(self):
-        return 1
-    def default_target_length(self) -> int:
-        return 20
-    def _dataset_folder(self) -> str:
-        return "ADE20K"
-    def _frame_root(self) -> str:
-        return join(self.root, "frames")
-    def _audio_root(self) -> str:
-        return join(self.root, "audio")
-    def _semseg_root(self) -> str:
-        return join(self.root, "annotations")
-    def _embed_root(self) -> str:
-        return join(self.root, "embedding", self.audio_embed_model)
-    def _get_audio_file(self, row) -> str:
-        return join(self._audio_root(), row["audio"])
-    def _get_frame_files(self, row) -> List[str]:
-        return [join(self._frame_root(), row["image"])]
-    def _get_semseg_file(self, row) -> str:
-        return join(self._semseg_root(), row["seg"])
-    def _get_embed_file(self, row) -> str:
-        return join(self._embed_root(), row["image"].replace(".jpg", ".npz"))
-    def num_classes(self):
-        return 3662
-class ADE20KPromptedBase(AVDataset):
-    def _expected_num_frames(self):
-        return 1
-    def default_target_length(self) -> int:
-        return 20
-    def _frame_root(self) -> str:
-        return join(self.root, "frames")
-    def _audio_root(self) -> str:
-        return join(self.root, "audio")
-    def _semseg_root(self) -> str:
-        return join(self.root, "annotations")
-    def _embed_root(self) -> str:
-        return join(self.root, "embedding", self.audio_embed_model)
-    def _get_frame_files(self, row) -> List[str]:
-        return [join(self._frame_root(), row["image_location"])]
-    def _get_semseg_file(self, row) -> str:
-        return join(self._semseg_root(), row["image_location"].replace(".jpg", "_seg.png"))
-    def _get_embed_file(self, row) -> str:
-        return join(self._embed_root(), row["image_location"].replace(".jpg", ".npz"))
-    def num_classes(self):
-        return 3662
-    def _missing_threshold(self) -> float:
-        return 0.0
-class ADE20KSpeechPrompted(ADE20KPromptedBase):
-    def _get_audio_file(self, row) -> str:
-        return join(self._audio_root(), row["speech_prompt_file"].split("/")[-1])
-    def _dataset_folder(self) -> str:
-        return "ADE20KSpeechPrompted"
-    def _audio_root(self) -> str:
-        # return join(self.root, "audio-noise-10") # TODO Remove
-        return join(self.root, "audio")  # TODO Remove
-    def _load_info(self, split) -> pd.DataFrame:
-        df = pd.read_csv(join(self.root, "prompted_segmentation.csv"))
-        df = df[df["speech_prompt_file"].apply(lambda s: isinstance(s, str))]
-        df = df[df["ade_class_id"].apply(lambda id: id != 0)]
-        df["id"] = df["image_location"]
-        return df
-class ADE20KSoundPrompted(ADE20KPromptedBase):
-    def _get_audio_file(self, row) -> str:
-        return join(self._audio_root(), row["vggsound_file"].split("/")[-1])
-    def _dataset_folder(self) -> str:
-        return "ADE20KSoundPrompted"
-    def _load_info(self, split) -> pd.DataFrame:
-        df = pd.read_csv(join(self.root, "prompted_segmentation.csv"))
-        df = df[df["vggsound_file"].apply(lambda s: isinstance(s, str))]
-        df = df[df["ade_class_id"].apply(lambda id: id != 0)]
-        df["id"] = df["image_location"]
-        return df
-class PlacesAndAudioSet(Dataset):
-    def __init__(self, **kwargs):
-        self.ds1 = PlacesAudio(**kwargs, n_frames=1)
-        self.ds2 = AudioSet(**kwargs, n_frames=1)
-    def __len__(self):
-        return len(self.ds1)
-    def __getitem__(self, item):
-        if torch.rand(1).item() > .5:
-            d = self.ds2[torch.randint(0, len(self.ds2) - 1, size=(1,)).item()]
-            if d is not None:
-                d["source"] = 1
-        else:
-            d = self.ds1[item]
-            if d is not None:
-                d["source"] = 0
-        return d
-class AVDataModule(pl.LightningDataModule):
-    def __init__(self,
-                 dataset_name,
-                 load_size,
-                 image_aug,
-                 audio_aug,
-                 extra_audio_masking,
-                 audio_model_type,
-                 pytorch_data_dir,
-                 use_cached_embs,
-                 batch_size,
-                 num_workers,
-                 audio_level,
-                 neg_audio,
-                 data_for_plotting,
-                 use_original_val_set,
-                 use_extra_val_sets,
-                 quad_mixup,
-                 bg_mixup,
-                 patch_mixup,
-                 patch_size,
-                 **kwargs):
-        super().__init__()
-        self.dataset_name = dataset_name
-        self.load_size = load_size
-        self.image_aug = image_aug
-        self.audio_aug = audio_aug
-        self.extra_audio_masking = extra_audio_masking
-        self.audio_model_type = audio_model_type
-        self.pytorch_data_dir = pytorch_data_dir
-        self.use_cached_embs = use_cached_embs
-        self.batch_size = batch_size
-        self.num_workers = num_workers
-        self.data_for_plotting = data_for_plotting
-        self.audio_level = audio_level
-        self.neg_audio = neg_audio
-        self.quad_mixup = quad_mixup
-        self.bg_mixup = bg_mixup
-        self.patch_mixup = patch_mixup
-        self.patch_size = patch_size
-        self.loader_args = dict(
-            num_workers=self.num_workers,
-            batch_size=self.batch_size,
-        )
-        self.save_hyperparameters()
-        self.extra_args = kwargs
-        self.use_original_val_set = use_original_val_set
-        self.use_extra_val_sets = use_extra_val_sets
-    def maybe_unpack(self, remove_source):
-        targets = [
-            (
-                join(self.pytorch_data_dir, "audioset-subset", "frame_archives"),
-                join(self.pytorch_data_dir, "audioset-subset", "frames"),
-                1
-            ),
-            (
-                join(self.pytorch_data_dir, "audioset-raw", "frame_archives"),
-                join(self.pytorch_data_dir, "audioset-raw", "frames"),
-                4
-            ),
-            (
-                join(self.pytorch_data_dir, "audioset-raw", "audio_archives"),
-                join(self.pytorch_data_dir, "audioset-raw", "audio"),
-                1
-            ),
-        ]
-        for (archive_dir, target_dir, n_parts) in targets:
-            if not os.path.exists(target_dir) and os.path.exists(archive_dir):
-                print(f"Could not find {target_dir}, attempting to unpack archives")
-                if os.path.exists(archive_dir):
-                    untar_all(archive_dir, target_dir, remove_source)
-                else:
-                    raise RuntimeError(f"Could not find archive folder: {archive_dir}")
-    def get_dataset_by_name(self, name, stage, data_for_plotting, n_frames=None):
-        if name == "vggss":
-            resize_op = T.Resize((self.load_size, self.load_size), Image.BILINEAR)
-        else:
-            resize_op = T.Resize(self.load_size, Image.BILINEAR)
-        img_transform = T.Compose([
-            resize_op,
-            T.CenterCrop(self.load_size),
-            T.ToTensor(),
-            norm])
-        if self.image_aug:
-            train_img_transform = T.Compose([
-                T.RandomResizedCrop(self.load_size),
-                T.RandomHorizontalFlip(),
-                T.ColorJitter(.2, .2, .2, .2),
-                T.RandomGrayscale(),
-                T.ToTensor(),
-                norm])
-            val_img_transform = img_transform
-        else:
-            train_img_transform = img_transform
-            val_img_transform = img_transform
-        if self.audio_aug:
-            train_audio_aug = True
-            val_audio_aug = False
-        else:
-            train_audio_aug = False
-            val_audio_aug = False
-        if self.audio_model_type == "hubert":
-            from featurizers.Hubert import HubertAudioTransform
-            audio_transform = HubertAudioTransform()
-        else:
-            audio_transform = None
-        if self.audio_model_type == "passt":
-            sample_rate = 32000
-        else:
-            sample_rate = 16000
-        if not self.use_cached_embs:
-            if self.audio_model_type == "hubert":
-                self.extra_args["use_audio"] = True
-            elif self.audio_model_type in {"audiomae", "audiomae-finetuned", "cavmae", "cavmae-mixed", "imagebind"}:
-                self.extra_args["use_spec"] = True
-            elif self.audio_model_type == "davenet":
-                self.extra_args["use_audio"] = True
-                self.extra_args["use_davenet_spec"] = True
-            elif self.audio_model_type == "fnac":
-                self.extra_args["use_audio"] = True
-                self.extra_args["use_fnac_spec"] = True
-            else:
-                raise ValueError(f"Unknown audio model type {self.audio_model_type}")
-            if self.audio_model_type == "cavmae" or self.audio_model_type == "cavmae-mixed":
-                self.extra_args["spec_mean"] = -5.081
-                self.extra_args["spec_std"] = 4.4849
-            elif self.audio_model_type == "imagebind":
-                self.extra_args["spec_mean"] = -4.268
-                self.extra_args["spec_std"] = 9.138
-        # if self.audio_model_type in {"audiomae", "audiomae-finetune", "cavmae"} \
-        #         and "override_target_length" not in self.extra_args:
-        if "override_target_length" not in self.extra_args:
-            self.extra_args["override_target_length"] = 10
-        data_args = dict(
-            root=self.pytorch_data_dir,
-            use_frames=True,
-            audio_transform=audio_transform,
-            sample_rate=sample_rate,
-            audio_level=self.audio_level,
-            **self.extra_args
-        )
-        if n_frames is not None:
-            data_args["n_frames"] = n_frames
-        train_args = dict(
-            frame_transform=train_img_transform,
-            extra_audio_masking=self.extra_audio_masking,
-            neg_audio=self.neg_audio,
-            quad_mixup=self.quad_mixup,
-            bg_mixup=self.bg_mixup,
-            patch_mixup=self.patch_mixup,
-            patch_size=self.patch_size,
-            audio_aug=train_audio_aug
-        )
-        val_args = dict(
-            frame_transform=val_img_transform,
-            audio_aug=val_audio_aug
-        )
-        if data_for_plotting:
-            val_args["use_audio"] = True
-            val_args["use_spec"] = True
-        if "ade" in name:
-            label_transform = T.Compose([
-                T.Resize(self.load_size, Image.NEAREST),
-                T.CenterCrop(self.load_size),
-                prep_ade_label
-            ])
-        else:
-            label_transform = T.Compose([
-                T.Resize(self.load_size, Image.NEAREST),
-                T.CenterCrop(self.load_size)
-            ])
-        val_args["use_audio"] = True
-        val_args["label_transform"] = label_transform
-        if name == "places-audio":
-            dataset_constructor = PlacesAudio
-        elif name == "mixed-full":
-            dataset_constructor = PlacesAndAudioSet
-        elif name == "audio-set-full":
-            dataset_constructor = AudioSet
-        elif name == "audio-set-eval":
-            dataset_constructor = AudioSetEval
-        elif name == "ade":
-            val_args["use_semseg"] = True
-            dataset_constructor = ADE20K
-        elif name == "ade-speech-prompted":
-            val_args["use_semseg"] = True
-            dataset_constructor = ADE20KSpeechPrompted
-        elif name == "ade-sound-prompted":
-            val_args["use_semseg"] = True
-            dataset_constructor = ADE20KSoundPrompted
-        else:
-            raise ValueError(f"Unknown dataset name {name}")
-        data_args["use_audio_embed"] = self.use_cached_embs
-        data_args["audio_embed_model"] = self.audio_model_type
-        if stage == "full":
-            val_dataset = dataset_constructor(split="val", **{**data_args, **val_args})
-            train_dataset = dataset_constructor(split="train", **{**data_args, **val_args})
-            return ConcatDataset([train_dataset, val_dataset])
-        elif stage == "fit":
-            return dataset_constructor(split="train", **{**data_args, **train_args})
-        elif stage == "validate":
-            return dataset_constructor(split="val", **{**data_args, **val_args})
-        else:
-            raise ValueError(f"Unknown stage: {stage}")
-    def _maybe_subset(self, dataset, length):
-        if len(dataset) > length and self.dataset_name not in {"ade-sound-prompted", "ade-speech-prompted", "vggss"}:
-            print("Using a subset of validation data")
-            return Subset(dataset, generate_subset(len(dataset), length))
-        else:
-            print("Not using val subset")
-            return dataset
-    def _make_val_datasets(self):
-        val_sets = []
-        if self.use_original_val_set:
-            val_sets.append(self._maybe_subset(self.get_dataset_by_name(
-                self.dataset_name, "validate", self.data_for_plotting), 1000))
-        if self.use_extra_val_sets:
-            val_sets.append(self._maybe_subset(self.get_dataset_by_name(
-                "places-audio", "validate", self.data_for_plotting), 1000))
-            val_sets.append(self._maybe_subset(self.get_dataset_by_name(
-                "audio-set-eval", "validate", False, n_frames=1), 1000))
-            val_sets.append(self.get_dataset_by_name(
-                "ade-speech-prompted", "validate", True))
-            val_sets.append(self.get_dataset_by_name(
-                "ade-sound-prompted", "validate", self.data_for_plotting))
-        return val_sets
-    def setup(self, stage: str):
-        if stage == "full":
-            self.full_dataset = self.get_dataset_by_name(self.dataset_name, stage, self.data_for_plotting)
-        elif stage == "fit":
-            self.train_dataset = self.get_dataset_by_name(self.dataset_name, stage, self.data_for_plotting)
-            self.val_datasets = self._make_val_datasets()
-        elif stage == "validate":
-            self.val_datasets = self._make_val_datasets()
-        else:
-            raise ValueError(f"Unknown stage: {stage}")
-    def train_dataloader(self):
-        return DataLoader(self.train_dataset, shuffle=True, **self.loader_args, collate_fn=custom_coallate)
-    def subsampled_train_dataloader(self, k=5000):
-        if len(self.train_dataset) > k:
-            ds = Subset(self.train_dataset, generate_subset(len(self.train_dataset), k))
-        else:
-            ds = self.train_dataset
-        return DataLoader(ds, shuffle=True, **self.loader_args, collate_fn=custom_coallate)
-    def val_dataloader(self):
-        return [
-            DataLoader(dataset, shuffle=False, **self.loader_args, collate_fn=custom_coallate)
-            for dataset in self.val_datasets
-        ]
-    def full_dataloader(self):
-        return DataLoader(self.full_dataset, shuffle=False, **self.loader_args, collate_fn=custom_coallate)
-def generate_subset(n, batch, seed=0):
-    np.random.seed(seed)
-    return np.random.permutation(n)[:batch]
-def prep_ade_label(img):
-    seg = np.array(img)
-    class_labels = (seg[:, :, 0] / 10).astype(np.int32) * 256 + (seg[:, :, 1].astype(np.int32))
-    return class_labels
-def maybe_replace(e, not_none):
-    if e is not None:
-        return e
-    else:
-        print("Warning found a None in the dataset indicitive of a loading failure, replacing it with another item")
-        return not_none[0]
-empty_caption = {
-    "words": [],
-    "start": [],
-    "end": [],
-}
-def custom_coallate(l):
-    if l is None:
-        return l
-    not_none = [e for e in l if e is not None]
-    assert len(not_none) > 0
-    l = [maybe_replace(e, not_none) for e in l]
-    to_merge = {}
-    def pop_or_default(dict, k, default):
-        if k in dict:
-            return dict.pop(k)
-        else:
-            print(f"WARNING: Could not find {k}, using {default}")
-            return default
-    if "caption" in l[0]:
-        to_merge["caption"] = [pop_or_default(l[i], "caption", empty_caption) for i in range(len(l))]
-    if "text" in l[0]:
-        to_merge["text"] = [pop_or_default(l[i], "text", "") for i in range(len(l))]
-    result = default_collate(l)
-    return {**result, **to_merge}
-if __name__ == "__main__":
-    from featurizers.Hubert import HubertAudioTransform
-    pytorch_data_dir = "/pytorch-data"
-    dataset_constructor = PlacesAudio
-    split = "val"
-    img_transform = T.Compose([
-        T.Resize(224, Image.BILINEAR),
-        T.CenterCrop(224),
-        T.ToTensor(),
-        norm])
-    video_transform = T.Compose([
-        T.Resize(224, Image.BILINEAR),
-        T.CenterCrop(224),
-        norm])
-    label_transform = T.Compose([
-        T.Resize(224, Image.NEAREST),
-        T.CenterCrop(224)
-    ])
-    audio_transform = HubertAudioTransform()
-    data_args = dict(
-        root=pytorch_data_dir,
-        frame_transform=img_transform,
-        use_frames=True,
-        use_spec=True,
-        use_audio=True,
-        use_caption=False,
-        use_semseg=False,
-        label_transform=label_transform,
-        audio_transform=audio_transform,
-        use_audio_embed=False,
-        audio_embed_model="audiomae",
-        extra_audio_masking=False,
-        neg_audio=False,
-        override_target_length=10,
-        audio_level=False,
-        quad_mixup=.3,
-        patch_mixup=.3,
-        bg_mixup=.3,
-    )
-    def return_datasets(dataset_constructor, split):
-        dataset = dataset_constructor(split=split, **data_args)
-        return dataset
-    train_ds = return_datasets(dataset_constructor, split)
-    print(len(train_ds))
-    train_loader = DataLoader(train_ds, batch_size=1, shuffle=False, num_workers=36, collate_fn=custom_coallate)
-    for batch in tqdm(train_loader):
-        pass

DenseAV/denseav/data/__init__.py DELETED Viewed

File without changes

DenseAV/denseav/data/make_tarballs.py DELETED Viewed

@@ -1,108 +0,0 @@
-import glob
-import os
-import tarfile
-from glob import glob
-from io import BytesIO
-from os.path import join
-from torch.utils.data import Dataset, DataLoader
-from tqdm import tqdm
-from pathlib import Path
-from denseav.shared import batch
-import tempfile
-import shutil
-class Tarballer(Dataset):
-    def __init__(self, source, target, n):
-        source_path = Path(source)
-        self.frames = [f.relative_to(source_path) for f in source_path.rglob('*') if f.is_file()]
-        assert (len(self.frames) > 0)
-        self.source = source
-        self.target_dir = target
-        self.batched = list(batch(self.frames, n))
-        os.makedirs(self.target_dir, exist_ok=True)
-    def __len__(self):
-        return len(self.batched)
-    def __getitem__(self, item):
-        with tarfile.open(join(self.target_dir, f"{item}.tar"), "w") as tar:
-            for relpath in self.batched[item]:
-                abs_path = os.path.join(self.source, str(relpath))  # Convert to string here
-                with open(abs_path, "rb") as file:
-                    file_content = file.read()
-                info = tarfile.TarInfo(name=str(relpath))  # Convert to string here
-                info.size = len(file_content)
-                tar.addfile(info, fileobj=BytesIO(file_content))
-        return 0
-class UnTarballer:
-    def __init__(self, archive_dir, target_dir, remove_source=False):
-        self.tarballs = sorted(glob(join(archive_dir, "*.tar")))
-        self.target_dir = target_dir
-        self.remove_source = remove_source  # New flag to determine if source tarball should be removed
-        os.makedirs(self.target_dir, exist_ok=True)
-    def __len__(self):
-        return len(self.tarballs)
-    def __getitem__(self, item):
-        with tarfile.open(self.tarballs[item], "r") as tar:
-            # Create a unique temporary directory inside the target directory
-            with tempfile.TemporaryDirectory(dir=self.target_dir) as tmpdirname:
-                tar.extractall(tmpdirname)  # Extract to the temporary directory
-                # Move contents from temporary directory to final target directory
-                for src_dir, dirs, files in os.walk(tmpdirname):
-                    dst_dir = src_dir.replace(tmpdirname, self.target_dir, 1)
-                    os.makedirs(dst_dir, exist_ok=True)
-                    for file_ in files:
-                        src_file = os.path.join(src_dir, file_)
-                        dst_file = os.path.join(dst_dir, file_)
-                        shutil.move(src_file, dst_file)
-        # Remove the source tarball if the flag is set to True
-        if self.remove_source:
-            os.remove(self.tarballs[item])
-        return 0
-def untar_all(archive_dir, target_dir, remove_source):
-    loader = DataLoader(UnTarballer(archive_dir, target_dir, remove_source), num_workers=24)
-    for _ in tqdm(loader):
-        pass
-if __name__ == "__main__":
-    # loader = DataLoader(Tarballer(
-    #     join("/pytorch-data", "audioset-raw", "audio"),
-    #     join("/pytorch-data", "audioset-raw", "audio_archives")
-    # ), num_workers=24)
-    # loader = DataLoader(Tarballer(
-    #     join("/pytorch-data", "audioset-raw", "frames"),
-    #     join("/pytorch-data", "audioset-raw", "frame_archives"),
-    #     5000
-    # ), num_workers=24)
-    # loader = DataLoader(Tarballer(
-    #     join("/pytorch-data", "ADE20KLabels"),
-    #     join("/pytorch-data", "ADE20KLabelsAr"),
-    #     100
-    # ), num_workers=24)
-    #
-    # for _ in tqdm(loader):
-    #     pass
-    #
-    # #
-    #
-    untar_all(
-        join("/pytorch-data", "audioset-raw", "frame_archives"),
-        join("/pytorch-data", "audioset-raw", "frames_4"))

DenseAV/denseav/eval_utils.py DELETED Viewed

@@ -1,135 +0,0 @@
-import json
-from collections import defaultdict
-import matplotlib.pyplot as plt
-import numpy as np
-import torch
-import torch.nn.functional as F
-from torchmetrics.functional.classification import binary_average_precision
-from tqdm import tqdm
-from constants import *
-from denseav.shared import unnorm, remove_axes
-def prep_heatmap(sims, masks, h, w):
-    masks = masks.to(torch.float32)
-    hm = torch.einsum("bhwt,bt->bhw", sims, masks) / masks.sum(-1).reshape(-1, 1, 1)
-    hm -= hm.min()
-    hm /= hm.max()
-    return F.interpolate(hm.unsqueeze(1), (h, w), mode="bilinear").squeeze(1)
-def iou(prediction, target):
-    prediction = prediction > 0.0
-    target = target > 0.5
-    intersection = torch.logical_and(prediction, target).sum().float()
-    union = torch.logical_or(prediction, target).sum().float()
-    if union == 0:
-        return 1.0
-    return (intersection / union).item()  # Convert to Python scalar
-def multi_iou(prediction, target, k=20):
-    prediction = torch.tensor(prediction)
-    target = torch.tensor(target)
-    target = target > 0.5
-    thresholds = torch.linspace(prediction.min(), prediction.max(), k)
-    hard_pred = prediction.unsqueeze(0) > thresholds.reshape(k, 1, 1, 1, 1)
-    target = torch.broadcast_to(target.unsqueeze(0), hard_pred.shape)
-    # Calculate IoU for each threshold
-    intersection = torch.logical_and(hard_pred, target).sum(dim=(1, 2, 3, 4)).float()
-    union = torch.logical_or(hard_pred, target).sum(dim=(1, 2, 3, 4)).float()
-    union = torch.where(union == 0, torch.tensor(1.0), union)  # Avoid division by zero
-    iou_scores = intersection / union
-    # Find the best IoU and corresponding threshold
-    best_iou, best_idx = torch.max(iou_scores, dim=0)
-    # best_threshold = thresholds[best_idx]
-    # print(best_threshold)
-    return best_iou  # , best_threshold.item()
-def get_paired_heatmaps(
-        model,
-        results,
-        class_ids,
-        timing,
-        class_names=None):
-    sims = model.sim_agg.get_pairwise_sims(
-        results,
-        raw=False,
-        agg_sim=False,
-        agg_heads=True
-    ).squeeze(1).mean(-2)
-    prompt_classes = torch.tensor(list(class_ids))
-    gt = results["semseg"] == prompt_classes.reshape(-1, 1, 1)
-    basic_masks = results[AUDIO_MASK]  # BxT
-    _, fullh, fullw = gt.shape
-    basic_heatmaps = prep_heatmap(sims, basic_masks, fullh, fullw)
-    if timing is not None:
-        prompt_timing = np.array(list(timing))
-        raw_timing = torch.tensor([json.loads(t) for t in prompt_timing])
-        timing = torch.clone(raw_timing)
-        timing[:, 0] -= .2
-        timing[:, 1] += .2
-        total_length = (results['total_length'] / 16000)[0]
-        fracs = timing / total_length
-        bounds = basic_masks.shape[1] * fracs
-        bounds[:, 0] = bounds[:, 0].floor()
-        bounds[:, 1] = bounds[:, 1].ceil()
-        bounds = bounds.to(torch.int64)
-        advanced_masks = (F.one_hot(bounds, basic_masks.shape[1]).cumsum(-1).sum(-2) == 1).to(basic_masks)
-        advanced_heatmaps = prep_heatmap(sims, advanced_masks, fullh, fullw)
-    metrics = defaultdict(list)
-    unique_classes = torch.unique(prompt_classes)
-    should_plot = class_names is not None
-    if should_plot:
-        prompt_names = np.array(list(class_names))
-    for prompt_class in tqdm(unique_classes):
-        subset = torch.where(prompt_classes == prompt_class)[0]
-        gt_subset = gt[subset]
-        basic_subset = basic_heatmaps[subset]
-        metrics["basic_ap"].append(binary_average_precision(basic_subset.flatten(), gt_subset.flatten()))
-        metrics["basic_iou"].append(multi_iou(basic_subset.flatten(), gt_subset.flatten()))
-        if timing is not None:
-            advanced_subset = advanced_heatmaps[subset]
-            metrics["advanced_ap"].append(binary_average_precision(advanced_subset.flatten(), gt_subset.flatten()))
-            metrics["advanced_iou"].append(multi_iou(advanced_subset.flatten(), gt_subset.flatten()))
-        if should_plot:
-            prompt_class_subset = prompt_classes[subset]
-            name_subset = prompt_names[subset]
-            print(prompt_class, name_subset, prompt_class_subset)
-            n_imgs = min(len(subset), 5)
-            if n_imgs > 1:
-                fig, axes = plt.subplots(n_imgs, 5, figsize=(4 * 5, n_imgs * 3))
-                frame_subset = unnorm(results[IMAGE_INPUT][subset].squeeze(1)).permute(0, 2, 3, 1)
-                semseg_subset = results["semseg"][subset]
-                for img_num in range(n_imgs):
-                    axes[img_num, 0].imshow(frame_subset[img_num])
-                    axes[img_num, 1].imshow(basic_subset[img_num])
-                    axes[img_num, 2].imshow(advanced_subset[img_num])
-                    axes[img_num, 3].imshow(gt_subset[img_num])
-                    axes[img_num, 4].imshow(semseg_subset[img_num], cmap="tab20", interpolation='none')
-                axes[0, 0].set_title("Image")
-                class_name = name_subset[0].split(",")[0]
-                axes[0, 1].set_title(f"{class_name} Basic Heatmap")
-                axes[0, 2].set_title(f"{class_name} Advanced Heatmap")
-                axes[0, 3].set_title("True Mask")
-                axes[0, 4].set_title("True Seg")
-                remove_axes(axes)
-                plt.tight_layout()
-                plt.show()
-    return metrics, unique_classes

DenseAV/denseav/evaluate.py DELETED Viewed

@@ -1,87 +0,0 @@
-from os.path import join
-import hydra
-from omegaconf import DictConfig, OmegaConf
-from pytorch_lightning import Trainer
-from pytorch_lightning import seed_everything
-from pytorch_lightning.loggers import TensorBoardLogger
-from denseav.data.AVDatasets import AVDataModule
-from denseav.shared import load_trained_model
-@hydra.main(config_path="configs", config_name="av_align.yaml")
-def my_app(cfg: DictConfig) -> None:
-    from saved_models import saved_model_dict
-    seed_everything(0)
-    print(OmegaConf.to_yaml(cfg))
-    models_to_eval = [
-        "denseav_language",
-        "denseav_sound",
-    ]
-    checkpoint_dir = "../checkpoints"
-    saved_models = saved_model_dict(checkpoint_dir)
-    for model_name in models_to_eval:
-        model_info = saved_models[model_name]
-        extra_data_args = model_info["data_args"] if "data_args" in model_info else {}
-        model_info["extra_args"]["output_root"] = "../"
-        model_info["extra_args"]["neg_audio"] = False
-        model_info["extra_args"]["image_mixup"] = 0.0
-        model = load_trained_model(join(checkpoint_dir, model_info["chkpt_name"]), model_info["extra_args"])
-        model.set_full_train(True)
-        if model.image_model_type == "dinov2":
-            load_size = cfg.load_size * 2
-        else:
-            load_size = cfg.load_size
-        if model.image_model_type == "davenet":
-            batch_size = cfg.batch_size // 2
-        elif model.image_model_type == "imagebind":
-            batch_size = cfg.batch_size
-        else:
-            batch_size = cfg.batch_size
-        print(load_size)
-        data_args = dict(
-            dataset_name=cfg.dataset_name,
-            load_size=load_size,
-            image_aug=cfg.image_aug,
-            audio_aug=cfg.audio_aug,
-            audio_model_type=model.audio_model_type,
-            pytorch_data_dir=cfg.pytorch_data_dir,
-            use_cached_embs=model.use_cached_embs,
-            batch_size=batch_size,
-            num_workers=cfg.num_workers,
-            extra_audio_masking=False,
-            use_original_val_set=False,
-            use_extra_val_sets=True,
-            use_caption=True,
-            data_for_plotting=False,
-            n_frames=None,
-            audio_level=False,
-            neg_audio=False,
-            quad_mixup=0.0,
-            bg_mixup=0.0,
-            patch_mixup=0.0,
-            patch_size=8,
-        )
-        data_args = {**data_args, **extra_data_args}
-        datamodule = AVDataModule(**data_args)
-        log_dir = join(cfg.output_root, "logs", "evaluate", model_name)
-        print(log_dir)
-        tb_logger = TensorBoardLogger(log_dir, default_hp_metric=False)
-        trainer = Trainer(
-            accelerator='gpu',
-            strategy="ddp",
-            devices=cfg.num_gpus,
-            logger=tb_logger)
-        trainer.validate(model, datamodule)
-if __name__ == "__main__":
-    my_app()

DenseAV/denseav/featurizers/AudioMAE.py DELETED Viewed

@@ -1,570 +0,0 @@
-import math
-import os
-import warnings
-from functools import partial
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchaudio
-from timm.models.layers import to_2tuple
-from torch.utils.data import Dataset
-from torchaudio.functional import resample
-import pickle
-def _no_grad_trunc_normal_(tensor, mean, std, a, b):
-    # Cut & paste from PyTorch official master until it's in a few official releases - RW
-    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
-    def norm_cdf(x):
-        # Computes standard normal cumulative distribution function
-        return (1. + math.erf(x / math.sqrt(2.))) / 2.
-    if (mean < a - 2 * std) or (mean > b + 2 * std):
-        warnings.warn("mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
-                      "The distribution of values may be incorrect.",
-                      stacklevel=2)
-    with torch.no_grad():
-        # Values are generated by using a truncated uniform distribution and
-        # then using the inverse CDF for the normal distribution.
-        # Get upper and lower cdf values
-        l = norm_cdf((a - mean) / std)
-        u = norm_cdf((b - mean) / std)
-        # Uniformly fill tensor with values from [l, u], then translate to
-        # [2l-1, 2u-1].
-        tensor.uniform_(2 * l - 1, 2 * u - 1)
-        # Use inverse cdf transform for normal distribution to get truncated
-        # standard normal
-        tensor.erfinv_()
-        # Transform to proper mean, std
-        tensor.mul_(std * math.sqrt(2.))
-        tensor.add_(mean)
-        # Clamp to ensure it's in the proper range
-        tensor.clamp_(min=a, max=b)
-        return tensor
-def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
-    # type: (Tensor, float, float, float, float) -> Tensor
-    r"""Fills the input Tensor with values drawn from a truncated
-    normal distribution. The values are effectively drawn from the
-    normal distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`
-    with values outside :math:`[a, b]` redrawn until they are within
-    the bounds. The method used for generating the random values works
-    best when :math:`a \leq \text{mean} \leq b`.
-    Args:
-        tensor: an n-dimensional `torch.Tensor`
-        mean: the mean of the normal distribution
-        std: the standard deviation of the normal distribution
-        a: the minimum cutoff value
-        b: the maximum cutoff value
-    Examples:
-        >>> w = torch.empty(3, 5)
-        >>> nn.init.trunc_normal_(w)
-    """
-    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
-class Mlp(nn.Module):
-    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
-        super().__init__()
-        out_features = out_features or in_features
-        hidden_features = hidden_features or in_features
-        self.fc1 = nn.Linear(in_features, hidden_features)
-        self.act = act_layer()
-        self.fc2 = nn.Linear(hidden_features, out_features)
-        self.drop = nn.Dropout(drop)
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        x = self.drop(x)
-        x = self.fc2(x)
-        x = self.drop(x)
-        return x
-class Attention(nn.Module):
-    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
-        self.scale = qk_scale or head_dim ** -0.5
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-    def forward(self, x):
-        B, N, C = x.shape
-        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
-        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)
-        attn = (q @ k.transpose(-2, -1)) * self.scale
-        attn = attn.softmax(dim=-1)
-        attn = self.attn_drop(attn)
-        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-def drop_path(x, drop_prob: float = 0., training: bool = False):
-    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
-    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
-    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
-    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
-    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
-    'survival rate' as the argument.
-    """
-    if drop_prob == 0. or not training:
-        return x
-    keep_prob = 1 - drop_prob
-    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
-    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
-    random_tensor.floor_()  # binarize
-    output = x.div(keep_prob) * random_tensor
-    return output
-class DropPath(nn.Module):
-    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
-    """
-    def __init__(self, drop_prob=None):
-        super(DropPath, self).__init__()
-        self.drop_prob = drop_prob
-    def forward(self, x):
-        return drop_path(x, self.drop_prob, self.training)
-class Block(nn.Module):
-    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
-                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
-        super().__init__()
-        self.norm1 = norm_layer(dim)
-        self.attn = Attention(
-            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
-        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
-        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
-        self.norm2 = norm_layer(dim)
-        mlp_hidden_dim = int(dim * mlp_ratio)
-        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
-    def forward(self, x):
-        x = x + self.drop_path(self.attn(self.norm1(x)))
-        x = x + self.drop_path(self.mlp(self.norm2(x)))
-        return x
-class PatchEmbed(nn.Module):
-    """ Image to Patch Embedding
-    """
-    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-        num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
-        self.patch_hw = (img_size[1] // patch_size[1], img_size[0] // patch_size[0])
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.num_patches = num_patches
-        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
-    def forward(self, x):
-        B, C, H, W = x.shape
-        # FIXME look at relaxing size constraints
-        # assert H == self.img_size[0] and W == self.img_size[1], \
-        #    f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-        x = self.proj(x).flatten(2).transpose(1, 2)
-        return x
-class HybridEmbed(nn.Module):
-    """ CNN Feature Map Embedding
-    Extract feature map from CNN, flatten, project to embedding dim.
-    """
-    def __init__(self, backbone, img_size=224, feature_size=None, in_chans=3, embed_dim=768):
-        super().__init__()
-        assert isinstance(backbone, nn.Module)
-        img_size = to_2tuple(img_size)
-        self.img_size = img_size
-        self.backbone = backbone
-        if feature_size is None:
-            with torch.no_grad():
-                # FIXME this is hacky, but most reliable way of determining the exact dim of the output feature
-                # map for all networks, the feature metadata has reliable channel and stride info, but using
-                # stride to calc feature dim requires info about padding of each stage that isn't captured.
-                training = backbone.training
-                if training:
-                    backbone.eval()
-                o = self.backbone(torch.zeros(1, in_chans, img_size[0], img_size[1]))[-1]
-                feature_size = o.shape[-2:]
-                feature_dim = o.shape[1]
-                backbone.train(training)
-        else:
-            feature_size = to_2tuple(feature_size)
-            feature_dim = self.backbone.feature_info.channels()[-1]
-        self.num_patches = feature_size[0] * feature_size[1]
-        self.proj = nn.Linear(feature_dim, embed_dim)
-    def forward(self, x):
-        x = self.backbone(x)[-1]
-        x = x.flatten(2).transpose(1, 2)
-        x = self.proj(x)
-        return x
-class TimmVisionTransformer(nn.Module):
-    """ Vision Transformer with support for patch or hybrid CNN input stage
-    """
-    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dim=768, depth=12,
-                 num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
-                 drop_path_rate=0., hybrid_backbone=None, norm_layer=nn.LayerNorm):
-        super().__init__()
-        self.num_classes = num_classes
-        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
-        if hybrid_backbone is not None:
-            self.patch_embed = HybridEmbed(
-                hybrid_backbone, img_size=img_size, in_chans=in_chans, embed_dim=embed_dim)
-        else:
-            self.patch_embed = PatchEmbed(
-                img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
-        num_patches = self.patch_embed.num_patches
-        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
-        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
-        self.pos_drop = nn.Dropout(p=drop_rate)
-        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
-        self.blocks = nn.ModuleList([
-            Block(
-                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
-                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer)
-            for i in range(depth)])
-        self.norm = norm_layer(embed_dim)
-        # NOTE as per official impl, we could have a pre-logits representation dense layer + tanh here
-        # self.repr = nn.Linear(embed_dim, representation_size)
-        # self.repr_act = nn.Tanh()
-        # Classifier head
-        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        trunc_normal_(self.pos_embed, std=.02)
-        trunc_normal_(self.cls_token, std=.02)
-        self.apply(self._init_weights)
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            trunc_normal_(m.weight, std=.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {'pos_embed', 'cls_token'}
-    def get_classifier(self):
-        return self.head
-    def reset_classifier(self, num_classes, global_pool=''):
-        self.num_classes = num_classes
-        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-    def forward_features(self, x):
-        B = x.shape[0]
-        x = self.patch_embed(x)
-        cls_tokens = self.cls_token.expand(B, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
-        x = torch.cat((cls_tokens, x), dim=1)
-        x = x + self.pos_embed
-        x = self.pos_drop(x)
-        for blk in self.blocks:
-            x = blk(x)
-        x = self.norm(x)
-        return x[:, 0]
-    def forward(self, x):
-        x = self.forward_features(x)
-        x = self.head(x)
-        return x
-class VisionTransformer(TimmVisionTransformer):
-    """ Vision Transformer with support for global average pooling
-    """
-    def __init__(self, **kwargs):
-        super(VisionTransformer, self).__init__(**kwargs)
-        norm_layer = kwargs['norm_layer']
-        embed_dim = kwargs['embed_dim']
-        self.fc_norm = norm_layer(embed_dim)
-        del self.norm  # remove the original norm
-    def interpolate_pos_encoding(self, x, embed):
-        new_patches = x.shape[1]
-        old_patches = embed.shape[1]
-        w = 8
-        h = int(new_patches / w)
-        if new_patches == old_patches:
-            return embed
-        dim = x.shape[-1]
-        pos_embed = nn.functional.interpolate(
-            embed.reshape(1, 64, 8, dim).permute(0, 3, 1, 2),
-            size=(h, w),
-            mode='bicubic',
-        )
-        pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
-        return pos_embed
-    def forward(self, x):
-        B = x.shape[0]
-        x = self.patch_embed(x)
-        x = x + self.interpolate_pos_encoding(x, self.pos_embed[:, 1:, :])
-        cls_token = self.cls_token + self.pos_embed[:, :1, :]
-        cls_tokens = cls_token.expand(B, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
-        x = torch.cat((cls_tokens, x), dim=1)
-        x = self.pos_drop(x)
-        for blk in self.blocks:
-            x = blk(x)
-        # x = x[:, 1:, :].mean(dim=1)  # global pool without cls token
-        # outcome = self.fc_norm(x)
-        return x[:, 1:, :].reshape(B, -1, 8, 768).permute(0, 3, 2, 1), x[:, 0]
-class NewPatchEmbed(nn.Module):
-    """ Flexible Image to Patch Embedding
-    """
-    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, stride=10):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-        stride = to_2tuple(stride)
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride)  # with overlapped patches
-        _, _, h, w = self.get_output_shape(img_size)  # n, emb_dim, h, w
-        self.patch_hw = (h, w)
-        self.num_patches = h * w
-    def get_output_shape(self, img_size):
-        # todo: don't be lazy..
-        return self.proj(torch.randn(1, 1, img_size[0], img_size[1])).shape
-    def forward(self, x):
-        x = self.proj(x)
-        x = x.flatten(2).transpose(1, 2)
-        return x
-def pca(image_feats_list, dim=3, fit_pca=None):
-    from sklearn.decomposition import PCA
-    device = image_feats_list[0].device
-    def flatten(tensor, target_size=None):
-        if target_size is not None and fit_pca is None:
-            F.interpolate(tensor, (target_size, target_size), mode="bilinear")
-        B, C, H, W = tensor.shape
-        return feats.permute(1, 0, 2, 3).reshape(C, B * H * W).permute(1, 0).detach().cpu()
-    if len(image_feats_list) > 1 and fit_pca is None:
-        target_size = image_feats_list[0].shape[2]
-    else:
-        target_size = None
-    flattened_feats = []
-    for feats in image_feats_list:
-        flattened_feats.append(flatten(feats, target_size))
-    x = torch.cat(flattened_feats, dim=0)
-    if fit_pca is None:
-        fit_pca = PCA(n_components=dim, svd_solver="arpack").fit(np.nan_to_num(x.detach().numpy()))
-    reduced_feats = []
-    for feats in image_feats_list:
-        x_red = torch.from_numpy(fit_pca.transform(flatten(feats)))
-        x_red -= x_red.min(dim=0, keepdim=True).values
-        x_red /= x_red.max(dim=0, keepdim=True).values
-        B, C, H, W = feats.shape
-        reduced_feats.append(x_red.reshape(B, H, W, dim).permute(0, 3, 1, 2).to(device))
-    return reduced_feats, fit_pca
-class AudiosetDataset(Dataset):
-    def __init__(self, audio_conf):
-        self.audio_conf = audio_conf
-        self.melbins = self.audio_conf.get('num_mel_bins')
-        self.dataset = self.audio_conf.get('dataset')
-        self.norm_mean = self.audio_conf.get('mean')
-        self.norm_std = self.audio_conf.get('std')
-        print('Dataset: {}, mean {:.3f} and std {:.3f}'.format(self.dataset, self.norm_mean, self.norm_std))
-        print(f'size of dataset {self.__len__()}')
-    def _wav2fbank(self, filename):
-        sample_rate = 16000
-        target_length = 10
-        waveform, obs_sr = torchaudio.load(filename)
-        waveform = waveform[0]
-        if obs_sr != sample_rate:
-            waveform = resample(waveform, obs_sr, sample_rate)
-        original_length = waveform.shape[0]
-        padding = target_length * sample_rate - original_length
-        if padding > 0:
-            m = torch.nn.ZeroPad2d((0, padding))
-            waveform = m(waveform)
-        else:
-            waveform = waveform[:target_length * sample_rate]
-        waveform = waveform - waveform.mean()
-        # 498 128, 998, 128
-        fbank = torchaudio.compliance.kaldi.fbank(
-            waveform.unsqueeze(0),
-            htk_compat=True,
-            sample_frequency=sample_rate,
-            use_energy=False,
-            window_type='hanning',
-            num_mel_bins=128,
-            dither=0.0,
-            frame_shift=10)
-        normed_fbank = (fbank - self.norm_mean) / (self.norm_std * 2)
-        return normed_fbank
-    def __getitem__(self, index):
-        datum = {"wav": "../../samples/example.wav"}
-        fbank = self._wav2fbank(datum['wav'])
-        fbank = fbank.transpose(0, 1).unsqueeze(0)  # 1, 128, 1024 (...,freq,time)
-        fbank = torch.transpose(fbank.squeeze(), 0, 1)  # time, freq
-        # the output fbank shape is [time_frame_num, frequency_bins], e.g., [1024, 128]
-        return fbank.unsqueeze(0)
-    def __len__(self):
-        return 1
-class AudioMAE(nn.Module):
-    def __init__(self, output_path, finetuned):
-        super().__init__()
-        # build model
-        model = VisionTransformer(
-            patch_size=16,
-            embed_dim=768,
-            depth=12,
-            num_heads=12,
-            mlp_ratio=4,
-            qkv_bias=True,
-            norm_layer=partial(nn.LayerNorm, eps=1e-6),
-            num_classes=527,
-            drop_path_rate=0.1)
-        img_size = (1024, 128)  # 1024, 128
-        emb_dim = 768
-        model.patch_embed = NewPatchEmbed(
-            img_size=img_size, patch_size=(16, 16), in_chans=1, embed_dim=emb_dim, stride=16)
-        num_patches = model.patch_embed.num_patches
-        model.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, emb_dim), requires_grad=False)
-        if finetuned:
-            fn = "audiomae_finetuned.pth"
-        else:
-            fn = "audiomae.pth"
-        checkpoint = torch.load(os.path.join(output_path, 'models', fn), map_location='cpu')
-        checkpoint_model = checkpoint['model']
-        state_dict = model.state_dict()
-        for k in ['head.weight', 'head.bias']:
-            if k in checkpoint_model and checkpoint_model[k].shape != state_dict[k].shape:
-                print(f"Removing key {k} from pretrained checkpoint")
-                del checkpoint_model[k]
-        msg = model.load_state_dict(checkpoint_model, strict=False)
-        print(msg)
-        model = model.eval()
-        self.model = model
-        self.config = dict(output_path=output_path, finetuned=finetuned)
-    def forward(self, audio, include_cls):
-        patch_tokens, cls_token = self.model(audio)
-        if include_cls:
-            return patch_tokens, cls_token
-        else:
-            return patch_tokens
-if __name__ == '__main__':
-    import os
-    device = torch.device("cuda:2")
-    torch.manual_seed(0)
-    np.random.seed(0)
-    model = AudioMAE("../../", True).to(device)
-    audio_conf_val = {
-        'num_mel_bins': 128,
-        'target_length': 1024,
-        'dataset': "audioset",
-        'mode': 'val',
-        'mean': -4.2677393,
-        'std': 4.5689974,
-    }
-    dataset = AudiosetDataset(audio_conf=audio_conf_val)
-    batch = dataset[0].unsqueeze(0).to(device)
-    embeddings = model(batch, include_cls=False)
-    import matplotlib.pyplot as plt
-    with torch.no_grad():
-        [pca_feats], _ = pca([embeddings])
-        plt.imshow(pca_feats.cpu().squeeze(0).permute(1, 2, 0))
-        plt.show()
-        print("here")
-    print("here")

DenseAV/denseav/featurizers/CAVMAE.py DELETED Viewed

@@ -1,1082 +0,0 @@
-import random
-import numpy as np
-import timm
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchaudio
-import torchvision.transforms as T
-from PIL import Image
-from timm.models.layers import to_2tuple, DropPath
-from timm.models.vision_transformer import Mlp, PatchEmbed, Block
-import os
-class Attention(nn.Module):
-    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
-        self.scale = qk_scale or head_dim ** -0.5
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-    def forward(self, x):
-        B, N, C = x.shape
-        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
-        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)
-        attn = (q @ k.transpose(-2, -1)) * self.scale
-        attn = attn.softmax(dim=-1)
-        attn = self.attn_drop(attn)
-        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-def get_2d_sincos_pos_embed(embed_dim, grid_h_size, grid_w_size, cls_token=False):
-    """
-    grid_size: int of the grid height and width
-    return:
-    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
-    """
-    grid_h = np.arange(grid_h_size, dtype=float)
-    grid_w = np.arange(grid_w_size, dtype=float)
-    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
-    grid = np.stack(grid, axis=0)
-    grid = grid.reshape([2, 1, grid_w_size, grid_h_size])
-    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
-    if cls_token:
-        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)
-    return pos_embed
-def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
-    assert embed_dim % 2 == 0
-    # use half of dimensions to encode grid_h
-    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
-    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
-    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
-    return emb
-def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
-    """
-    embed_dim: output dimension for each position
-    pos: a list of positions to be encoded: size (M,)
-    out: (M, D)
-    """
-    assert embed_dim % 2 == 0
-    omega = np.arange(embed_dim // 2, dtype=float)
-    omega /= embed_dim / 2.
-    omega = 1. / 10000 ** omega  # (D/2,)
-    pos = pos.reshape(-1)  # (M,)
-    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product
-    emb_sin = np.sin(out)  # (M, D/2)
-    emb_cos = np.cos(out)  # (M, D/2)
-    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
-    return emb
-# --------------------------------------------------------
-# Interpolate position embeddings for high-resolution
-# References:
-# DeiT: https://github.com/facebookresearch/deit
-# --------------------------------------------------------
-def interpolate_pos_embed(model, checkpoint_model):
-    if 'pos_embed' in checkpoint_model:
-        pos_embed_checkpoint = checkpoint_model['pos_embed']
-        embedding_size = pos_embed_checkpoint.shape[-1]
-        num_patches = model.patch_embed.num_patches
-        num_extra_tokens = model.pos_embed.shape[-2] - num_patches
-        # height (== width) for the checkpoint position embedding
-        orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)
-        # height (== width) for the new position embedding
-        new_size = int(num_patches ** 0.5)
-        # class_token and dist_token are kept unchanged
-        if orig_size != new_size:
-            print("Position interpolate from %dx%d to %dx%d" % (orig_size, orig_size, new_size, new_size))
-            extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
-            # only the position tokens are interpolated
-            pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
-            pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
-            pos_tokens = torch.nn.functional.interpolate(
-                pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
-            pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
-            new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
-            checkpoint_model['pos_embed'] = new_pos_embed
-class PatchEmbed(nn.Module):
-    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-        num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.num_patches = num_patches
-        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
-    def forward(self, x):
-        x = self.proj(x).flatten(2).transpose(1, 2)
-        return x
-class Block(nn.Module):
-    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
-                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
-        super().__init__()
-        self.norm1 = norm_layer(dim)
-        self.norm1_a = norm_layer(dim)
-        self.norm1_v = norm_layer(dim)
-        self.attn = Attention(
-            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
-        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
-        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
-        self.norm2 = norm_layer(dim)
-        self.norm2_a = norm_layer(dim)
-        self.norm2_v = norm_layer(dim)
-        mlp_hidden_dim = int(dim * mlp_ratio)
-        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
-    def forward(self, x, modality=None):
-        if modality == None:
-            x = x + self.drop_path(self.attn(self.norm1(x)))
-            x = x + self.drop_path(self.mlp(self.norm2(x)))
-        elif modality == 'a':
-            x = x + self.drop_path(self.attn(self.norm1_a(x)))
-            x = x + self.drop_path(self.mlp(self.norm2_a(x)))
-        elif modality == 'v':
-            x = x + self.drop_path(self.attn(self.norm1_v(x)))
-            x = x + self.drop_path(self.mlp(self.norm2_v(x)))
-        return x
-# our main proposed model, for pretraining only, for finetuning, use CAVMAEFT class
-class CAVMAE(nn.Module):
-    """ CAV-MAE Model
-    """
-    def __init__(self, img_size=224, audio_length=1024, patch_size=16, in_chans=3,
-                 embed_dim=768, modality_specific_depth=11, num_heads=12,
-                 decoder_embed_dim=512, decoder_depth=8, decoder_num_heads=16,
-                 mlp_ratio=4., norm_layer=nn.LayerNorm, norm_pix_loss=False, tr_pos=False):
-        super().__init__()
-        print('A CAV-MAE Model')
-        print('Use norm_pix_loss: ', norm_pix_loss)
-        print('Learnable Positional Embedding: ', tr_pos)
-        # the encoder part
-        # overide the timm package
-        timm.models.vision_transformer.PatchEmbed = PatchEmbed
-        timm.models.vision_transformer.Block = Block
-        self.patch_embed_a = PatchEmbed(img_size, patch_size, 1, embed_dim)
-        self.patch_embed_v = PatchEmbed(img_size, patch_size, in_chans, embed_dim)
-        self.patch_embed_a.num_patches = int(audio_length * 128 / 256)
-        print('Number of Audio Patches: {:d}, Visual Patches: {:d}'.format(self.patch_embed_a.num_patches,
-                                                                           self.patch_embed_v.num_patches))
-        self.modality_a = nn.Parameter(torch.zeros(1, 1, embed_dim))
-        self.modality_v = nn.Parameter(torch.zeros(1, 1, embed_dim))
-        self.pos_embed_a = nn.Parameter(torch.zeros(1, self.patch_embed_a.num_patches, embed_dim),
-                                        requires_grad=tr_pos)  # fixed sin-cos embedding
-        self.pos_embed_v = nn.Parameter(torch.zeros(1, self.patch_embed_v.num_patches, embed_dim),
-                                        requires_grad=tr_pos)  # fixed sin-cos embedding
-        # audio-branch
-        self.blocks_a = nn.ModuleList(
-            [Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in
-             range(modality_specific_depth)])
-        # visual-branch
-        self.blocks_v = nn.ModuleList(
-            [Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in
-             range(modality_specific_depth)])
-        # unified branch
-        self.blocks_u = nn.ModuleList(
-            [Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in
-             range(12 - modality_specific_depth)])
-        # independent normalization layer for audio, visual, and audio-visual
-        self.norm_a, self.norm_v, self.norm = norm_layer(embed_dim), norm_layer(embed_dim), norm_layer(embed_dim)
-        # the decoder part
-        # Project to lower dimension for the decoder
-        self.decoder_embed = nn.Linear(embed_dim, decoder_embed_dim, bias=True)
-        # token used for masking
-        self.mask_token = nn.Parameter(torch.zeros(1, 1, decoder_embed_dim))
-        self.decoder_modality_a = nn.Parameter(torch.zeros(1, 1, decoder_embed_dim))
-        self.decoder_modality_v = nn.Parameter(torch.zeros(1, 1, decoder_embed_dim))
-        self.decoder_pos_embed_a = nn.Parameter(torch.zeros(1, self.patch_embed_a.num_patches, decoder_embed_dim),
-                                                requires_grad=tr_pos)  # fixed sin-cos embedding
-        self.decoder_pos_embed_v = nn.Parameter(torch.zeros(1, self.patch_embed_v.num_patches, decoder_embed_dim),
-                                                requires_grad=tr_pos)  # fixed sin-cos embedding
-        self.decoder_blocks = nn.ModuleList(
-            [Block(decoder_embed_dim, decoder_num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer)
-             for i in range(decoder_depth)])
-        self.decoder_norm = norm_layer(decoder_embed_dim)
-        # project channel is different for two modality, use two projection head
-        self.decoder_pred_a = nn.Linear(decoder_embed_dim, patch_size ** 2 * 1, bias=True)  # decoder to patch
-        self.decoder_pred_v = nn.Linear(decoder_embed_dim, patch_size ** 2 * in_chans, bias=True)  # decoder to patch
-        self.norm_pix_loss = norm_pix_loss
-        self.initialize_weights()
-        print('Audio Positional Embedding Shape:', self.pos_embed_a.shape)
-        print('Visual Positional Embedding Shape:', self.pos_embed_v.shape)
-    def initialize_weights(self):
-        # initialize (and freeze) pos_embed by sin-cos embedding, opt the cls token, add by myself
-        pos_embed_a = get_2d_sincos_pos_embed(self.pos_embed_a.shape[-1], 8, int(self.patch_embed_a.num_patches / 8),
-                                              cls_token=False)
-        self.pos_embed_a.data.copy_(torch.from_numpy(pos_embed_a).float().unsqueeze(0))
-        pos_embed_v = get_2d_sincos_pos_embed(self.pos_embed_v.shape[-1], int(self.patch_embed_v.num_patches ** .5),
-                                              int(self.patch_embed_v.num_patches ** .5), cls_token=False)
-        self.pos_embed_v.data.copy_(torch.from_numpy(pos_embed_v).float().unsqueeze(0))
-        decoder_pos_embed_a = get_2d_sincos_pos_embed(self.decoder_pos_embed_a.shape[-1], 8,
-                                                      int(self.patch_embed_a.num_patches / 8), cls_token=False)
-        self.decoder_pos_embed_a.data.copy_(torch.from_numpy(decoder_pos_embed_a).float().unsqueeze(0))
-        decoder_pos_embed_v = get_2d_sincos_pos_embed(self.decoder_pos_embed_v.shape[-1],
-                                                      int(self.patch_embed_v.num_patches ** .5),
-                                                      int(self.patch_embed_v.num_patches ** .5), cls_token=False)
-        self.decoder_pos_embed_v.data.copy_(torch.from_numpy(decoder_pos_embed_v).float().unsqueeze(0))
-        # initialize patch_embed like nn.Linear (instead of nn.Conv2d)
-        w = self.patch_embed_a.proj.weight.data
-        torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-        w = self.patch_embed_v.proj.weight.data
-        torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-        # timm's trunc_normal_(std=.02) is effectively normal_(std=0.02) as cutoff is too big (2.)
-        torch.nn.init.normal_(self.modality_a, std=.02)
-        torch.nn.init.normal_(self.modality_v, std=.02)
-        torch.nn.init.normal_(self.decoder_modality_a, std=.02)
-        torch.nn.init.normal_(self.decoder_modality_v, std=.02)
-        torch.nn.init.normal_(self.mask_token, std=.02)
-        # initialize nn.Linear and nn.LayerNorm
-        self.apply(self._init_weights)
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            # we use xavier_uniform following official JAX ViT:
-            torch.nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-    def patchify(self, imgs, c, h, w, p=16):
-        """
-        imgs: (N, 3, H, W)
-        x: (N, L, patch_size**2 *3)
-        """
-        x = imgs.reshape(shape=(imgs.shape[0], c, h, p, w, p))
-        x = torch.einsum('nchpwq->nhwpqc', x)
-        x = x.reshape(shape=(imgs.shape[0], h * w, p ** 2 * c))
-        return x
-    def unpatchify(self, x, c, h, w, p=16):
-        """
-        x: (N, L, patch_size**2 *3)
-        imgs: (N, 3, H, W)
-        """
-        assert h * w == x.shape[1]
-        x = x.reshape(shape=(x.shape[0], h, w, p, p, c))
-        x = torch.einsum('nhwpqc->nchpwq', x)
-        imgs = x.reshape(shape=(x.shape[0], c, h * p, w * p))
-        return imgs
-    def random_masking_unstructured(self, x, mask_ratio):
-        """
-        Perform per-sample random masking by per-sample shuffling.
-        Per-sample shuffling is done by argsort random noise.
-        x: [N, L, D], sequence
-        """
-        N, L, D = x.shape  # batch, length, dim
-        len_keep = int(L * (1 - mask_ratio))
-        noise = torch.rand(N, L, device=x.device)  # noise in [0, 1]
-        # sort noise for each sample
-        ids_shuffle = torch.argsort(noise, dim=1)  # ascend: small is keep, large is remove
-        ids_restore = torch.argsort(ids_shuffle, dim=1)
-        # keep the first subset
-        ids_keep = ids_shuffle[:, :len_keep]
-        x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
-        # generate the binary mask: 0 is keep, 1 is remove
-        mask = torch.ones([N, L], device=x.device)
-        mask[:, :len_keep] = 0
-        # unshuffle to get the binary mask
-        mask = torch.gather(mask, dim=1, index=ids_restore)
-        return x_masked, mask, ids_restore
-    def random_masking_structured(self, x, mask_ratio, t=64, f=8, mode='time'):
-        """
-        Perform per-sample random masking by per-sample shuffling.
-        Per-sample shuffling is done by argsort random noise.
-        x: [N, L, D], sequence
-        """
-        N, L, D = x.shape  # batch, length, dim
-        len_keep = int(L * (1 - mask_ratio))
-        noise = torch.rand(N, L, device=x.device)  # noise in [0, 1]
-        assert L == f * t
-        noise = noise.reshape(N, f, t)  # the audio patch is in shape [f,t], not [t,f]
-        if mode == 'time':
-            for i in range(N):
-                mask_t_list = random.sample(range(t), int(t * mask_ratio))
-                for k in mask_t_list:
-                    noise[i, :, k] = 1.1  # large value will be removed
-        elif mode == 'freq':
-            for i in range(N):
-                mask_f_list = random.sample(range(f), int(f * mask_ratio))
-                for k in mask_f_list:
-                    noise[i, k, :] = 1.1  # large value will be removed
-        elif mode == 'tf':
-            for i in range(N):
-                mask_t_list = random.sample(range(t), int(t * mask_ratio * 0.7))
-                for k in mask_t_list:
-                    noise[i, :, k] = 1.1  # large value will be removed
-            for i in range(N):
-                mask_f_list = random.sample(range(f), int(f * mask_ratio * 0.7))
-                for k in mask_f_list:
-                    noise[i, k, :] = 1.1  # large value will be removed
-        noise = noise.reshape(N, L)
-        # sort noise for each sample, only need to manuplate these two ids_shuffle, ids_restore
-        ids_shuffle = torch.argsort(noise, dim=1)  # ascend: small is keep, large is remove
-        ids_restore = torch.argsort(ids_shuffle, dim=1)
-        # keep the first subset
-        ids_keep = ids_shuffle[:, :len_keep]
-        x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
-        # generate the binary mask: 0 is keep, 1 is remove
-        mask = torch.ones([N, L], device=x.device)
-        mask[:, :len_keep] = 0
-        # unshuffle to get the binary mask
-        mask = torch.gather(mask, dim=1, index=ids_restore)
-        return x_masked, mask, ids_restore
-    def forward_encoder(self, a, v, mask_ratio_a, mask_ratio_v, mask_mode='unstructured'):
-        # embed patches
-        a = a.unsqueeze(1)
-        a = a.transpose(2, 3)
-        a = self.patch_embed_a(a)
-        a = a + self.pos_embed_a
-        a = a + self.modality_a
-        v = self.patch_embed_v(v)
-        v = v + self.pos_embed_v
-        v = v + self.modality_v
-        # by default, we always use unstructured masking
-        if mask_mode == 'unstructured':
-            a, mask_a, ids_restore_a = self.random_masking_unstructured(a, mask_ratio_a)
-        # in ablation study, we tried time/freq/tf masking. mode in ['freq', 'time', 'tf']
-        else:
-            a, mask_a, ids_restore_a = self.random_masking_structured(a, mask_ratio_a, t=64, f=8, mode=mask_mode)
-        # visual branch always use unstructured masking
-        v, mask_v, ids_restore_v = self.random_masking_unstructured(v, mask_ratio_v)
-        # audio and visual stream, independent blocks
-        for blk in self.blocks_a:
-            a = blk(a)
-        for blk in self.blocks_v:
-            v = blk(v)
-        x = torch.cat((a, v), dim=1)
-        # unified stream, shared blocks_u, but independent normalization layers
-        for blk in self.blocks_u:
-            x = blk(x)
-        x = self.norm(x)
-        for blk in self.blocks_u:
-            ca = blk(a, 'a')
-        ca = self.norm_a(ca)
-        for blk in self.blocks_u:
-            cv = blk(v, 'v')
-        cv = self.norm_v(cv)
-        return x, mask_a, ids_restore_a, mask_v, ids_restore_v, ca, cv
-    def forward_decoder(self, x, mask_a, ids_restore_a, mask_v, ids_restore_v):
-        x = self.decoder_embed(x)
-        # append mask tokens to sequence
-        # mask_tokens_a in shape [B, #a_mask_token, mask_token_dim], get the number of masked samples from mask_a[0], which is the first example of the batch, all samples should have same number of masked tokens
-        mask_tokens_a = self.mask_token.repeat(x.shape[0], int(mask_a[0].sum()), 1)
-        a_ = torch.cat([x[:, :self.patch_embed_a.num_patches - int(mask_a[0].sum()), :], mask_tokens_a],
-                       dim=1)  # no cls token
-        a_ = torch.gather(a_, dim=1, index=ids_restore_a.unsqueeze(-1).repeat(1, 1, x.shape[2]))  # unshuffle
-        # similar for the visual modality
-        mask_tokens_v = self.mask_token.repeat(x.shape[0], int(mask_v[0].sum()), 1)
-        v_ = torch.cat([x[:, self.patch_embed_a.num_patches - int(mask_a[0].sum()):, :], mask_tokens_v],
-                       dim=1)  # no cls token
-        v_ = torch.gather(v_, dim=1, index=ids_restore_v.unsqueeze(-1).repeat(1, 1, x.shape[2]))  # unshuffle
-        # concatenate audio and visual tokens
-        x = torch.cat([a_, v_], dim=1)
-        decoder_pos_embed = torch.cat([self.decoder_pos_embed_a, self.decoder_pos_embed_v], dim=1)
-        x = x + decoder_pos_embed
-        # add modality indication tokens
-        x[:, 0:self.patch_embed_a.num_patches, :] = x[:, 0:self.patch_embed_a.num_patches, :] + self.decoder_modality_a
-        x[:, self.patch_embed_a.num_patches:, :] = x[:, self.patch_embed_a.num_patches:, :] + self.decoder_modality_v
-        # apply Transformer blocks
-        for blk in self.decoder_blocks:
-            x = blk(x)
-        x = self.decoder_norm(x)
-        # predictor projection
-        x_a = self.decoder_pred_a(x[:, :self.patch_embed_a.num_patches, :])
-        x_v = self.decoder_pred_v(x[:, self.patch_embed_a.num_patches:, :])
-        # return audio and video tokens
-        return x_a, x_v
-    def forward_contrastive(self, audio_rep, video_rep, bidirect_contrast=False):
-        # calculate nce loss for mean-visual representation and mean-audio representation
-        audio_rep = torch.nn.functional.normalize(audio_rep, dim=-1)
-        video_rep = torch.nn.functional.normalize(video_rep, dim=-1)
-        total = torch.mm(audio_rep, torch.transpose(video_rep, 0, 1)) / 0.05
-        # by default we use single directional
-        if bidirect_contrast == False:
-            nce = -torch.mean(torch.diag(torch.nn.functional.log_softmax(total, dim=0)))
-            c_acc = torch.sum(torch.eq(torch.argmax(torch.nn.functional.softmax(total, dim=0), dim=0),
-                                       torch.arange(0, total.shape[0], device=audio_rep.device))) / total.shape[0]
-            return nce, c_acc
-        else:
-            nce_1 = -torch.mean(torch.diag(torch.nn.functional.log_softmax(total, dim=0)))
-            nce_2 = -torch.mean(torch.diag(torch.nn.functional.log_softmax(total.t(), dim=0)))
-            c_acc_1 = torch.sum(torch.eq(torch.argmax(torch.nn.functional.softmax(total, dim=0), dim=0),
-                                         torch.arange(0, total.shape[0], device=audio_rep.device))) / total.shape[0]
-            c_acc_2 = torch.sum(torch.eq(torch.argmax(torch.nn.functional.softmax(total.t(), dim=0), dim=0),
-                                         torch.arange(0, total.shape[0], device=audio_rep.device))) / total.shape[0]
-            nce = (nce_1 + nce_2) / 2
-            c_acc = (c_acc_1 + c_acc_2) / 2
-            return nce, c_acc
-    def forward_mae_loss(self, input, pred, mask, modality):
-        if modality == 'a':
-            # for audio, need to adjust the shape
-            input = input.unsqueeze(1)
-            input = input.transpose(2, 3)
-            target = self.patchify(input, 1, int(input.shape[2] / self.patch_embed_a.patch_size[0]),
-                                   int(input.shape[3] / self.patch_embed_a.patch_size[1]), 16)
-        elif modality == 'v':
-            target = self.patchify(input, 3, int(input.shape[2] / self.patch_embed_v.patch_size[0]),
-                                   int(input.shape[3] / self.patch_embed_v.patch_size[1]), 16)
-        # patch-wise normalization might minorly improve the classification performance, but will make the model lose inpainting function
-        if self.norm_pix_loss:
-            mean = target.mean(dim=-1, keepdim=True)
-            var = target.var(dim=-1, keepdim=True)
-            target = (target - mean) / (var + 1.e-6) ** .5
-        loss = (pred - target) ** 2
-        loss = loss.mean(dim=-1)  # [N, L], mean loss per patch
-        loss = (loss * mask).sum() / mask.sum()  # mean loss on removed patches
-        return loss
-    def forward(self, audio, imgs, mask_ratio_a=0.75, mask_ratio_v=0.75, mae_loss_weight=1., contrast_loss_weight=0.01,
-                mask_mode='unstructured'):
-        # latent is used for reconstruction (mae), latent_c_{a,v} are used for contrastive learning
-        latent, mask_a, ids_restore_a, mask_v, ids_restore_v, latent_c_a, latent_c_v = self.forward_encoder(audio, imgs,
-                                                                                                            mask_ratio_a,
-                                                                                                            mask_ratio_v,
-                                                                                                            mask_mode=mask_mode)
-        # if mae loss is used
-        if mae_loss_weight != 0:
-            pred_a, pred_v = self.forward_decoder(latent, mask_a, ids_restore_a, mask_v, ids_restore_v)
-            loss_mae_a = self.forward_mae_loss(audio, pred_a, mask_a, 'a')
-            loss_mae_v = self.forward_mae_loss(imgs, pred_v, mask_v, 'v')
-            loss_mae = mae_loss_weight * (loss_mae_a + loss_mae_v)
-        else:
-            loss_mae_a, loss_mae_v, loss_mae = torch.tensor(0.0, device=audio.device), torch.tensor(0.0,
-                                                                                                    device=audio.device), torch.tensor(
-                0.0, device=audio.device)
-        # if contrastive loss is used
-        if contrast_loss_weight != 0:
-            # note this is single directional
-            loss_c, c_acc = self.forward_contrastive(latent_c_a.mean(dim=1), latent_c_v.mean(dim=1))
-            loss_c = contrast_loss_weight * loss_c
-        else:
-            loss_c, c_acc = torch.tensor(0.0, device=audio.device), torch.tensor(0.0, device=audio.device)
-        loss = loss_mae + loss_c
-        return loss, loss_mae, loss_mae_a, loss_mae_v, loss_c, mask_a, mask_v, c_acc
-    # used only for inpainting, ignore if inpainting is not of interest
-    def forward_inpaint(self, audio, imgs, mask_ratio_a=0.75, mask_ratio_v=0.75, mask_mode='unstructured'):
-        latent, mask_a, ids_restore_a, mask_v, ids_restore_v, latent_c_a, latent_c_v = self.forward_encoder(audio, imgs,
-                                                                                                            mask_ratio_a,
-                                                                                                            mask_ratio_v,
-                                                                                                            mask_mode=mask_mode)
-        pred_a, pred_v = self.forward_decoder(latent, mask_a, ids_restore_a, mask_v, ids_restore_v)  # [N, L, p*p*3]
-        loss_pixel_a = self.forward_mae_loss(audio, pred_a, mask_a, 'a')
-        loss_pixel_v = self.forward_mae_loss(imgs, pred_v, mask_v, 'v')
-        return pred_a, pred_v, mask_a, mask_v, loss_pixel_a, loss_pixel_v
-    # used for retrieval, ignore if retrieval is not of interest
-    def forward_feat(self, a, v):
-        # embed patches
-        a = a.unsqueeze(1)
-        a = a.transpose(2, 3)
-        a = self.patch_embed_a(a)
-        a = a + self.pos_embed_a
-        a = a + self.modality_a
-        v = self.patch_embed_v(v)
-        v = v + self.pos_embed_v
-        v = v + self.modality_v
-        # the modality-specific stream
-        for blk in self.blocks_a:
-            a = blk(a)
-        for blk in self.blocks_v:
-            v = blk(v)
-        # use modality specific normalization,
-        for blk in self.blocks_u:
-            a = blk(a, 'a')
-        a = self.norm_a(a)
-        for blk in self.blocks_u:
-            v = blk(v, 'v')
-        v = self.norm_v(v)
-        return a, v
-    def forward_audio(self, a):
-        # embed patches
-        a = a.unsqueeze(1)
-        a = a.transpose(2, 3)
-        a = self.patch_embed_a(a)
-        a = a + self.pos_embed_a
-        a = a + self.modality_a
-        # the modality-specific stream
-        for blk in self.blocks_a:
-            a = blk(a)
-        # use modality specific normalization,
-        for blk in self.blocks_u:
-            a = blk(a, 'a')
-        a = self.norm_a(a)
-        return a.reshape(a.shape[0], 128 // 16, 1024 // 16, 768).permute(0, 3, 1, 2)
-    def forward_video(self, v):
-        v = self.patch_embed_v(v)
-        v = v + self.pos_embed_v
-        v = v + self.modality_v
-        for blk in self.blocks_v:
-            v = blk(v)
-        for blk in self.blocks_u:
-            v = blk(v, 'v')
-        v = self.norm_v(v)
-        return v.reshape(v.shape[0], 224 // 16, 224 // 16, 768).permute(0, 3, 1, 2)
-# the finetuned CAV-MAE model
-class CAVMAEFT(nn.Module):
-    def __init__(self, label_dim, img_size=224, audio_length=1024, patch_size=16, in_chans=3,
-                 embed_dim=768, modality_specific_depth=11, num_heads=12, mlp_ratio=4., norm_layer=nn.LayerNorm,
-                 norm_pix_loss=False, tr_pos=True):
-        super().__init__()
-        timm.models.vision_transformer.Block = Block
-        print('Use norm_pix_loss: ', norm_pix_loss)
-        timm.models.vision_transformer.PatchEmbed = PatchEmbed
-        timm.models.vision_transformer.Block = Block
-        self.patch_embed_a = PatchEmbed(img_size, patch_size, 1, embed_dim)
-        self.patch_embed_v = PatchEmbed(img_size, patch_size, in_chans, embed_dim)
-        self.patch_embed_a.num_patches = int(audio_length * 128 / 256)
-        print('Number of Audio Patches: {:d}, Visual Patches: {:d}'.format(self.patch_embed_a.num_patches,
-                                                                           self.patch_embed_v.num_patches))
-        self.modality_a = nn.Parameter(torch.zeros(1, 1, embed_dim))
-        self.modality_v = nn.Parameter(torch.zeros(1, 1, embed_dim))
-        self.pos_embed_a = nn.Parameter(torch.zeros(1, self.patch_embed_a.num_patches, embed_dim),
-                                        requires_grad=tr_pos)  # fixed sin-cos embedding
-        self.pos_embed_v = nn.Parameter(torch.zeros(1, self.patch_embed_v.num_patches, embed_dim),
-                                        requires_grad=tr_pos)  # fixed sin-cos embedding
-        self.blocks_a = nn.ModuleList(
-            [Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in
-             range(modality_specific_depth)])
-        self.blocks_v = nn.ModuleList(
-            [Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in
-             range(modality_specific_depth)])
-        self.blocks_u = nn.ModuleList(
-            [Block(embed_dim, num_heads, mlp_ratio, qkv_bias=True, qk_scale=None, norm_layer=norm_layer) for i in
-             range(12 - modality_specific_depth)])
-        self.norm_a = norm_layer(embed_dim)
-        self.norm_v = norm_layer(embed_dim)
-        self.norm = norm_layer(embed_dim)
-        self.mlp_head = nn.Sequential(nn.LayerNorm(embed_dim), nn.Linear(embed_dim, label_dim))
-        self.initialize_weights()
-        print('Audio Positional Embedding Shape:', self.pos_embed_a.shape)
-        print('Visual Positional Embedding Shape:', self.pos_embed_v.shape)
-    def get_patch_num(self, input_shape, stride):
-        test_input = torch.zeros(1, 1, input_shape[0], input_shape[1])
-        test_proj = torch.nn.Conv2d(1, 4, kernel_size=(16, 16), stride=(stride, stride))
-        test_output = test_proj(test_input)
-        print(test_output.shape)
-        return test_output.shape[2], test_output[3], test_output[2] * test_output[2]
-    def initialize_weights(self):
-        pos_embed_a = get_2d_sincos_pos_embed(self.pos_embed_a.shape[-1], 8, int(self.patch_embed_a.num_patches / 8),
-                                              cls_token=False)
-        self.pos_embed_a.data.copy_(torch.from_numpy(pos_embed_a).float().unsqueeze(0))
-        pos_embed_v = get_2d_sincos_pos_embed(self.pos_embed_v.shape[-1], int(self.patch_embed_v.num_patches ** .5),
-                                              int(self.patch_embed_v.num_patches ** .5), cls_token=False)
-        self.pos_embed_v.data.copy_(torch.from_numpy(pos_embed_v).float().unsqueeze(0))
-        w = self.patch_embed_a.proj.weight.data
-        torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-        w = self.patch_embed_v.proj.weight.data
-        torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-        torch.nn.init.normal_(self.modality_a, std=.02)
-        torch.nn.init.normal_(self.modality_v, std=.02)
-        self.apply(self._init_weights)
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            # we use xavier_uniform following official JAX ViT:
-            torch.nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-    def forward(self, a, v, mode):
-        # multi-modal fine-tuning, our default method for fine-tuning
-        if mode == 'multimodal':
-            a = a.unsqueeze(1)
-            a = a.transpose(2, 3)
-            a = self.patch_embed_a(a)
-            a = a + self.pos_embed_a
-            a = a + self.modality_a
-            v = self.patch_embed_v(v)
-            v = v + self.pos_embed_v
-            v = v + self.modality_v
-            for blk in self.blocks_a:
-                a = blk(a)
-            for blk in self.blocks_v:
-                v = blk(v)
-            x = torch.cat((a, v), dim=1)
-            for blk in self.blocks_u:
-                x = blk(x)
-            x = self.norm(x)
-            x = x.mean(dim=1)
-            x = self.mlp_head(x)
-            return x
-        # finetune with only audio (and inference with only audio when the model is finetuned with only audio)
-        elif mode == 'audioonly':
-            a = a.unsqueeze(1)
-            a = a.transpose(2, 3)
-            a = self.patch_embed_a(a)
-            a = a + self.pos_embed_a
-            a = a + self.modality_a
-            for blk in self.blocks_a:
-                a = blk(a)
-            # note here uses the 'a' normalization, it is used in both training and inference, so it is fine
-            for blk in self.blocks_u:
-                a = blk(a, 'a')
-            a = self.norm_a(a)
-            x = a.mean(dim=1)
-            x = self.mlp_head(x)
-            return x
-        # finetune with only image (and inference with only audio when the model is finetuned with only image)
-        elif mode == 'videoonly':
-            v = self.patch_embed_v(v)
-            v = v + self.pos_embed_v
-            v = v + self.modality_v
-            for blk in self.blocks_v:
-                v = blk(v)
-            # note here uses the 'v' normalization, it is used in both training and inference, so it is fine
-            for blk in self.blocks_u:
-                v = blk(v, 'v')
-            v = self.norm_v(v)
-            x = v.mean(dim=1)
-            x = self.mlp_head(x)
-            return x
-        # used in case that the model is finetuned with both modality, but in inference only audio is given
-        elif mode == 'missingaudioonly':
-            a = a.unsqueeze(1)
-            a = a.transpose(2, 3)
-            a = self.patch_embed_a(a)
-            a = a + self.pos_embed_a
-            a = a + self.modality_a
-            for blk in self.blocks_a:
-                a = blk(a)
-            # two forward passes to the block_u, one with modality-specific normalization, another with unified normalization
-            u = a
-            for blk in self.blocks_u:
-                u = blk(u)  # note here use unified normalization
-            u = self.norm(u)
-            u = u.mean(dim=1)
-            for blk in self.blocks_u:
-                a = blk(a, 'a')  # note here use modality-specific normalization
-            a = self.norm_a(a)
-            a = a.mean(dim=1)
-            # average the output of the two forward passes
-            x = (u + a) / 2
-            x = self.mlp_head(x)
-            return x
-        # used in case that the model is fine-tuned with both modality, but in inference only image is given
-        elif mode == 'missingvideoonly':
-            v = self.patch_embed_v(v)
-            v = v + self.pos_embed_v
-            v = v + self.modality_v
-            for blk in self.blocks_v:
-                v = blk(v)
-            # two forward passes to the block_u, one with modality-specific normalization, another with unified normalization
-            u = v
-            for blk in self.blocks_u:
-                u = blk(u)  # note here use unified normalization
-            u = self.norm(u)
-            u = u.mean(dim=1)
-            for blk in self.blocks_u:
-                v = blk(v, 'v')  # note here use modality-specific normalization
-            v = self.norm_v(v)
-            v = v.mean(dim=1)
-            # average the output of the two forward passes
-            x = (u + v) / 2
-            x = self.mlp_head(x)
-            return x
-    # for retrieval
-    def forward_feat(self, a, v, mode='av'):
-        # return both audio and visual
-        if mode == 'av':
-            a = a.unsqueeze(1)
-            a = a.transpose(2, 3)
-            a = self.patch_embed_a(a)
-            a = a + self.pos_embed_a
-            a = a + self.modality_a
-            v = self.patch_embed_v(v)
-            v = v + self.pos_embed_v
-            v = v + self.modality_v
-            for blk in self.blocks_a:
-                a = blk(a)
-            for blk in self.blocks_v:
-                v = blk(v)
-            for blk in self.blocks_u:
-                a = blk(a, 'a')
-            a = self.norm_a(a)
-            for blk in self.blocks_u:
-                v = blk(v, 'v')
-            v = self.norm_v(v)
-            return a, v
-        # return only audio
-        if mode == 'a':
-            a = a.unsqueeze(1)
-            a = a.transpose(2, 3)
-            a = self.patch_embed_a(a)
-            a = a + self.pos_embed_a
-            a = a + self.modality_a
-            for blk in self.blocks_a:
-                a = blk(a)
-            for blk in self.blocks_u:
-                a = blk(a, 'a')
-            a = self.norm_a(a)
-            return a
-def _wav2fbank(filename):
-    waveform, sr = torchaudio.load(filename)
-    waveform = torchaudio.functional.resample(
-        waveform, orig_freq=sr, new_freq=16000
-    )
-    waveform = waveform - waveform.mean()
-    waveform
-    print(sr)
-    fbank = torchaudio.compliance.kaldi.fbank(
-        waveform,
-        htk_compat=True,
-        sample_frequency=sr,
-        use_energy=False,
-        window_type='hanning',
-        num_mel_bins=128,
-        dither=0.0,
-        frame_shift=10)
-    target_length = 1024
-    n_frames = fbank.shape[0]
-    p = target_length - n_frames
-    # cut and pad
-    if p > 0:
-        m = torch.nn.ZeroPad2d((0, 0, 0, p))
-        fbank = m(fbank)
-    elif p < 0:
-        fbank = fbank[0:target_length, :]
-    return fbank
-def pca(image_feats_list, dim=3, fit_pca=None):
-    from sklearn.decomposition import PCA
-    device = image_feats_list[0].device
-    def flatten(tensor, target_size=None):
-        if target_size is not None and fit_pca is None:
-            F.interpolate(tensor, (target_size, target_size), mode="bilinear")
-        B, C, H, W = tensor.shape
-        return feats.permute(1, 0, 2, 3).reshape(C, B * H * W).permute(1, 0).detach().cpu()
-    if len(image_feats_list) > 1 and fit_pca is None:
-        target_size = image_feats_list[0].shape[2]
-    else:
-        target_size = None
-    flattened_feats = []
-    for feats in image_feats_list:
-        flattened_feats.append(flatten(feats, target_size))
-    x = torch.cat(flattened_feats, dim=0)
-    if fit_pca is None:
-        fit_pca = PCA(n_components=dim).fit(x)
-    reduced_feats = []
-    for feats in image_feats_list:
-        x_red = torch.from_numpy(fit_pca.transform(flatten(feats)))
-        x_red -= x_red.min(dim=0, keepdim=True).values
-        x_red /= x_red.max(dim=0, keepdim=True).values
-        B, C, H, W = feats.shape
-        reduced_feats.append(x_red.reshape(B, H, W, dim).permute(0, 3, 1, 2).to(device))
-    return reduced_feats, fit_pca
-class CAVMAEAudioFeaturizer(nn.Module):
-    def __init__(self, output_path, model_name="base", model=None):
-        super().__init__()
-        if model is not None:
-            self.model = model
-        else:
-            if model_name == "base":
-                model_path = os.path.join(output_path, 'models/audio_model.21.pth')
-            else:
-                raise ValueError(f"Unknown model type {model_name}")
-            audio_model = CAVMAE(
-                audio_length=1024,
-                modality_specific_depth=11,
-                norm_pix_loss=True,
-                tr_pos=False)
-            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-            mdl_weight = torch.load(model_path, map_location=device)
-            audio_model = torch.nn.DataParallel(audio_model)
-            audio_model.load_state_dict(mdl_weight, strict=True)
-            self.model = audio_model.module.cuda()
-    def forward(self, audio, include_cls):
-        cls_token = None
-        patch_tokens = self.model.forward_audio(audio.squeeze(1))
-        if include_cls:
-            return patch_tokens, cls_token
-        else:
-            return patch_tokens
-class CAVMAEImageFeaturizer(nn.Module):
-    def __init__(self, output_path, model=None, model_name="base"):
-        super().__init__()
-        if model is not None:
-            self.model: CAVMAE = model
-        else:
-            if model_name == "base":
-                model_path = os.path.join(output_path, 'models/audio_model.21.pth')
-            else:
-                raise ValueError(f"Unknown model type {model_name}")
-            audio_model = CAVMAE(
-                audio_length=1024,
-                modality_specific_depth=11,
-                norm_pix_loss=True,
-                tr_pos=False)
-            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-            mdl_weight = torch.load(model_path, map_location=device)
-            audio_model = torch.nn.DataParallel(audio_model)
-            audio_model.load_state_dict(mdl_weight, strict=True)
-            self.model: CAVMAE = audio_model.module.cuda()
-    def forward(self, image, include_cls):
-        cls_token = None
-        patch_tokens = self.model.forward_video(image)
-        if include_cls:
-            return patch_tokens, cls_token
-        else:
-            return patch_tokens
-if __name__ == "__main__":
-    model_path = os.path.join("../../", 'models/audio_model.21.pth')
-    audio_model = CAVMAE(
-        audio_length=1024,
-        modality_specific_depth=11,
-        norm_pix_loss=True,
-        tr_pos=False)
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    mdl_weight = torch.load(model_path, map_location=device)
-    audio_model = torch.nn.DataParallel(audio_model)
-    audio_model.load_state_dict(mdl_weight, strict=True)
-    model: CAVMAE = audio_model.module.cuda()
-    image_paths = ["../../samples/dog_image.jpg", "../../samples/car_image.jpg", "../../samples/bird_image.jpg"]
-    audio_paths = ["../../samples/dog_audio.wav", "../../samples/car_audio.wav", "../../samples/bird_audio.wav"]
-    images = []
-    audios = []
-    for image_path in image_paths:
-        image = Image.open(image_path).convert("RGB")
-        preprocess = T.Compose([
-            T.Resize(224, interpolation=Image.BICUBIC),
-            T.CenterCrop(224),
-            T.ToTensor(),
-            T.Normalize(
-                mean=[0.4850, 0.4560, 0.4060],
-                std=[0.2290, 0.2240, 0.2250]
-            )])
-        images.append(preprocess(image).unsqueeze(0).cuda())
-    for audio_path in audio_paths:
-        a = _wav2fbank(audio_path).cuda().unsqueeze(0)
-        a = (a + 5.081) / (4.4849)
-        audios.append(a)
-    audio_feats, image_feats = model.forward_feat(
-        torch.cat(audios, dim=0), torch.cat(images, dim=0))
-    audio_feats = F.normalize(audio_feats.mean(1), dim=1)
-    image_feats = F.normalize(image_feats.mean(1), dim=1)
-    sims = torch.einsum("bc,dc->bd", image_feats, audio_feats)
-    print(sims)
-    print("here")
-    # a_feat = F.normalize(a_feat, dim=1)
-    # v_feat = F.normalize(v_feat, dim=1)
-    # [red_v_feat, red_a_feat], fit_pca = pca([v_feat, a_feat])
-    #
-    # [red_v_feat], fit_pca = pca([v_feat])
-    # [red_a_feat], fit_pca = pca([a_feat])
-    #
-    # import matplotlib.pyplot as plt
-    #
-    # fig, ax = plt.subplots(1, 2, figsize=(2 * 5, 5))
-    # ax[0].imshow(red_v_feat[0].permute(1, 2, 0).cpu())
-    # ax[1].imshow(red_a_feat[0].permute(1, 2, 0).cpu())
-    # plt.tight_layout()
-    # plt.show()
-    # print("here")

DenseAV/denseav/featurizers/CLIP.py DELETED Viewed

@@ -1,50 +0,0 @@
-import clip
-import torch
-from torch import nn
-class CLIPFeaturizer(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.model, self.preprocess = clip.load("ViT-B/16", device="cpu")
-        self.model.eval().cuda()
-        self.config = {}
-    def get_cls_token(self, img):
-        return self.model.encode_image(img).to(torch.float32)
-    def forward(self, img, include_cls):
-        features = self.model.get_visual_features(img, include_cls)
-        new_features = []
-        for i in range(2):
-            t = features[i]
-            if isinstance(t, torch.Tensor):
-                new_features.append(t.to(torch.float32))
-            else:
-                new_features.append(t)
-        return new_features
-if __name__ == "__main__":
-    import torchvision.transforms as T
-    from PIL import Image
-    from shared import norm, crop_to_divisor
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    image = Image.open("../samples/lex1.jpg")
-    load_size = 224  # * 3
-    transform = T.Compose([
-        T.Resize(load_size, Image.BILINEAR),
-        # T.CenterCrop(load_size),
-        T.ToTensor(),
-        lambda x: crop_to_divisor(x, 16),
-        norm])
-    model = CLIPFeaturizer().cuda()
-    results = model(transform(image).cuda().unsqueeze(0))
-    print(clip.available_models())

DenseAV/denseav/featurizers/DAVENet.py DELETED Viewed

@@ -1,162 +0,0 @@
-# Author: David Harwath
-import torch
-import torch.nn as nn
-import torch.nn.functional
-import torch.nn.functional
-import torch.nn.functional as F
-import torch.utils.model_zoo as model_zoo
-import torchvision.models as imagemodels
-class Davenet(nn.Module):
-    def __init__(self, embedding_dim=1024):
-        super(Davenet, self).__init__()
-        self.embedding_dim = embedding_dim
-        self.batchnorm1 = nn.BatchNorm2d(1)
-        self.conv1 = nn.Conv2d(1, 128, kernel_size=(40, 1), stride=(1, 1), padding=(0, 0))
-        self.conv2 = nn.Conv2d(128, 256, kernel_size=(1, 11), stride=(1, 1), padding=(0, 5))
-        self.conv3 = nn.Conv2d(256, 512, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
-        self.conv4 = nn.Conv2d(512, 512, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
-        self.conv5 = nn.Conv2d(512, embedding_dim, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
-        self.pool = nn.MaxPool2d(kernel_size=(1, 3), stride=(1, 2), padding=(0, 1))
-    def forward(self, x):
-        if x.dim() == 3:
-            x = x.unsqueeze(1)
-        x = self.batchnorm1(x)
-        x = F.relu(self.conv1(x))
-        x = F.relu(self.conv2(x))
-        x = self.pool(x)
-        x = F.relu(self.conv3(x))
-        x = self.pool(x)
-        x = F.relu(self.conv4(x))
-        x = self.pool(x)
-        x = F.relu(self.conv5(x))
-        x = self.pool(x)
-        x = x.squeeze(2)
-        return x
-class Resnet18(imagemodels.ResNet):
-    def __init__(self, embedding_dim=1024, pretrained=False):
-        super(Resnet18, self).__init__(imagemodels.resnet.BasicBlock, [2, 2, 2, 2])
-        if pretrained:
-            self.load_state_dict(model_zoo.load_url(imagemodels.resnet.model_urls['resnet18']))
-        self.avgpool = None
-        self.fc = None
-        self.embedder = nn.Conv2d(512, embedding_dim, kernel_size=1, stride=1, padding=0)
-        self.embedding_dim = embedding_dim
-        self.pretrained = pretrained
-    def forward(self, x):
-        x = self.conv1(x)
-        x = self.bn1(x)
-        x = self.relu(x)
-        x = self.maxpool(x)
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.layer4(x)
-        x = self.embedder(x)
-        return x
-class Resnet34(imagemodels.ResNet):
-    def __init__(self, embedding_dim=1024, pretrained=False):
-        super(Resnet34, self).__init__(imagemodels.resnet.BasicBlock, [3, 4, 6, 3])
-        if pretrained:
-            self.load_state_dict(model_zoo.load_url(imagemodels.resnet.model_urls['resnet34']))
-        self.avgpool = None
-        self.fc = None
-        self.embedder = nn.Conv2d(512, embedding_dim, kernel_size=1, stride=1, padding=0)
-    def forward(self, x):
-        x = self.conv1(x)
-        x = self.bn1(x)
-        x = self.relu(x)
-        x = self.maxpool(x)
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.layer4(x)
-        x = self.embedder(x)
-        return x
-class Resnet50(imagemodels.ResNet):
-    def __init__(self, embedding_dim=1024, pretrained=False):
-        super(Resnet50, self).__init__(imagemodels.resnet.Bottleneck, [3, 4, 6, 3])
-        if pretrained:
-            self.load_state_dict(model_zoo.load_url(imagemodels.resnet.model_urls['resnet50']))
-        self.avgpool = None
-        self.fc = None
-        self.embedder = nn.Conv2d(2048, embedding_dim, kernel_size=1, stride=1, padding=0)
-    def forward(self, x):
-        x = self.conv1(x)
-        x = self.bn1(x)
-        x = self.relu(x)
-        x = self.maxpool(x)
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.layer4(x)
-        x = self.embedder(x)
-        return x
-class VGG16(nn.Module):
-    def __init__(self, embedding_dim=1024, pretrained=False):
-        super(VGG16, self).__init__()
-        seed_model = imagemodels.__dict__['vgg16'](pretrained=pretrained).features
-        seed_model = nn.Sequential(*list(seed_model.children())[:-1])  # remove final maxpool
-        last_layer_index = len(list(seed_model.children()))
-        seed_model.add_module(str(last_layer_index),
-                              nn.Conv2d(512, embedding_dim, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)))
-        self.image_model = seed_model
-    def forward(self, x):
-        x = self.image_model(x)
-        return x
-def prep(dict):
-    return {k.replace("module.", ""): v for k, v in dict.items()}
-class DavenetAudioFeaturizer(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.audio_model = Davenet()
-        self.audio_model.load_state_dict(prep(torch.load("../models/davenet_pt_audio.pth")))
-    def forward(self, audio, include_cls):
-        patch_tokens = self.audio_model(audio).unsqueeze(2)
-        if include_cls:
-            return patch_tokens, None
-        else:
-            return patch_tokens
-    def get_last_params(self):
-        return []
-class DavenetImageFeaturizer(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.image_model = VGG16()
-        self.image_model.load_state_dict(prep(torch.load("../models/davenet_pt_image.pth")))
-    def forward(self, image, include_cls):
-        patch_tokens = self.image_model(image)
-        if include_cls:
-            return patch_tokens, None
-        else:
-            return patch_tokens
-    def get_last_params(self):
-        return []

DenseAV/denseav/featurizers/DINO.py DELETED Viewed

@@ -1,451 +0,0 @@
-import math
-import warnings
-from functools import partial
-import timm
-import torch
-import torch.nn as nn
-eps = 1e-4
-def _no_grad_trunc_normal_(tensor, mean, std, a, b):
-    # Cut & paste from PyTorch official master until it's in a few official releases - RW
-    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
-    def norm_cdf(x):
-        # Computes standard normal cumulative distribution function
-        return (1. + math.erf(x / math.sqrt(2.))) / 2.
-    if (mean < a - 2 * std) or (mean > b + 2 * std):
-        warnings.warn("mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
-                      "The distribution of values may be incorrect.",
-                      stacklevel=2)
-    with torch.no_grad():
-        # Values are generated by using a truncated uniform distribution and
-        # then using the inverse CDF for the normal distribution.
-        # Get upper and lower cdf values
-        l = norm_cdf((a - mean) / std)
-        u = norm_cdf((b - mean) / std)
-        # Uniformly fill tensor with values from [l, u], then translate to
-        # [2l-1, 2u-1].
-        tensor.uniform_(2 * l - 1, 2 * u - 1)
-        # Use inverse cdf transform for normal distribution to get truncated
-        # standard normal
-        tensor.erfinv_()
-        # Transform to proper mean, std
-        tensor.mul_(std * math.sqrt(2.))
-        tensor.add_(mean)
-        # Clamp to ensure it's in the proper range
-        tensor.clamp_(min=a, max=b)
-        return tensor
-def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
-    # type: (Tensor, float, float, float, float) -> Tensor
-    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
-def drop_path(x, drop_prob: float = 0., training: bool = False):
-    if drop_prob == 0. or not training:
-        return x
-    keep_prob = 1 - drop_prob
-    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
-    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
-    random_tensor.floor_()  # binarize
-    output = x.div(keep_prob) * random_tensor
-    return output
-class DropPath(nn.Module):
-    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
-    """
-    def __init__(self, drop_prob=None):
-        super(DropPath, self).__init__()
-        self.drop_prob = drop_prob
-    def forward(self, x):
-        return drop_path(x, self.drop_prob, self.training)
-class Mlp(nn.Module):
-    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
-        super().__init__()
-        out_features = out_features or in_features
-        hidden_features = hidden_features or in_features
-        self.fc1 = nn.Linear(in_features, hidden_features)
-        self.act = act_layer()
-        self.fc2 = nn.Linear(hidden_features, out_features)
-        self.drop = nn.Dropout(drop)
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        x = self.drop(x)
-        x = self.fc2(x)
-        x = self.drop(x)
-        return x
-class Attention(nn.Module):
-    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        self.scale = qk_scale or head_dim ** -0.5
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-    def forward(self, x, return_qkv=False):
-        B, N, C = x.shape
-        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
-        q, k, v = qkv[0], qkv[1], qkv[2]
-        attn = (q @ k.transpose(-2, -1)) * self.scale
-        attn = attn.softmax(dim=-1)
-        attn = self.attn_drop(attn)
-        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x, attn, qkv
-class Block(nn.Module):
-    def __init__(self, dim,
-                 num_heads,
-                 mlp_ratio=4.,
-                 qkv_bias=False,
-                 qk_scale=None,
-                 drop=0.,
-                 attn_drop=0.,
-                 drop_path=0.,
-                 act_layer=nn.GELU,
-                 norm_layer=nn.LayerNorm):
-        super().__init__()
-        self.norm1 = norm_layer(dim)
-        self.attn = Attention(
-            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
-        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
-        self.norm2 = norm_layer(dim)
-        mlp_hidden_dim = int(dim * mlp_ratio)
-        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
-    def forward(self, x, return_attention=False, return_qkv=False):
-        y, attn, qkv = self.attn(self.norm1(x))
-        if return_attention:
-            return attn
-        x = x + self.drop_path(y)
-        x = x + self.drop_path(self.mlp(self.norm2(x)))
-        if return_qkv:
-            return x, attn, qkv
-        return x
-class PatchEmbed(nn.Module):
-    """ Image to Patch Embedding
-    """
-    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
-        super().__init__()
-        num_patches = (img_size // patch_size) * (img_size // patch_size)
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.num_patches = num_patches
-        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
-    def forward(self, x):
-        B, C, H, W = x.shape
-        x = self.proj(x).flatten(2).transpose(1, 2)
-        return x
-class VisionTransformer(nn.Module):
-    """ Vision Transformer """
-    def __init__(self,
-                 img_size=[224],
-                 patch_size=16,
-                 in_chans=3,
-                 num_classes=0,
-                 embed_dim=768,
-                 depth=12,
-                 num_heads=12,
-                 mlp_ratio=4.,
-                 qkv_bias=False,
-                 qk_scale=None,
-                 drop_rate=0.,
-                 attn_drop_rate=0.,
-                 drop_path_rate=0.,
-                 norm_layer=nn.LayerNorm,
-                 **kwargs):
-        super().__init__()
-        self.num_features = self.embed_dim = embed_dim
-        self.patch_embed = PatchEmbed(
-            img_size=img_size[0], patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
-        num_patches = self.patch_embed.num_patches
-        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
-        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
-        self.pos_drop = nn.Dropout(p=drop_rate)
-        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
-        self.blocks = nn.ModuleList([
-            Block(
-                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
-                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer)
-            for i in range(depth)])
-        self.norm = norm_layer(embed_dim)
-        # Classifier head
-        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        trunc_normal_(self.pos_embed, std=.02)
-        trunc_normal_(self.cls_token, std=.02)
-        self.apply(self._init_weights)
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            trunc_normal_(m.weight, std=.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-    def interpolate_pos_encoding(self, x, w, h):
-        npatch = x.shape[1] - 1
-        N = self.pos_embed.shape[1] - 1
-        if npatch == N and w == h:
-            return self.pos_embed
-        class_pos_embed = self.pos_embed[:, 0]
-        patch_pos_embed = self.pos_embed[:, 1:]
-        dim = x.shape[-1]
-        w0 = w // self.patch_embed.patch_size
-        h0 = h // self.patch_embed.patch_size
-        # we add a small number to avoid floating point error in the interpolation
-        # see discussion at https://github.com/facebookresearch/dino/issues/8
-        w0, h0 = w0 + 0.1, h0 + 0.1
-        patch_pos_embed = nn.functional.interpolate(
-            patch_pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
-            scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)),
-            mode='bicubic',
-        )
-        assert int(w0) == patch_pos_embed.shape[-2] and int(h0) == patch_pos_embed.shape[-1]
-        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).reshape(1, -1, dim)
-        return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1)
-    def prepare_tokens(self, x):
-        B, nc, w, h = x.shape
-        x = self.patch_embed(x)  # patch linear embedding
-        # add the [CLS] token to the embed patch tokens
-        cls_tokens = self.cls_token.expand(B, -1, -1)
-        x = torch.cat((cls_tokens, x), dim=1)
-        # add positional encoding to each token
-        x = x + self.interpolate_pos_encoding(x, w, h)
-        return self.pos_drop(x)
-    def forward(self, x):
-        x = self.prepare_tokens(x)
-        for blk in self.blocks:
-            x = blk(x)
-        x = self.norm(x)
-        return x[:, 0]
-    def forward_feats(self, x):
-        x = self.prepare_tokens(x)
-        for blk in self.blocks:
-            x = blk(x)
-        x = self.norm(x)
-        return x
-    def get_intermediate_feat(self, x, n=1, norm=True):
-        x = self.prepare_tokens(x)
-        # we return the output tokens from the `n` last blocks
-        feat = []
-        attns = []
-        qkvs = []
-        for i, blk in enumerate(self.blocks):
-            x, attn, qkv = blk(x, return_qkv=True)
-            if len(self.blocks) - i <= n:
-                if norm:
-                    feat.append(self.norm(x))
-                else:
-                    feat.append(x)
-                qkvs.append(qkv)
-                attns.append(attn)
-        return feat, attns, qkvs
-    def get_last_selfattention(self, x):
-        x = self.prepare_tokens(x)
-        for i, blk in enumerate(self.blocks):
-            if i < len(self.blocks) - 1:
-                x = blk(x)
-            else:
-                # return attention of the last block
-                return blk(x, return_attention=True)
-    def get_intermediate_layers(self, x, n=1):
-        x = self.prepare_tokens(x)
-        # we return the output tokens from the `n` last blocks
-        output = []
-        for i, blk in enumerate(self.blocks):
-            x = blk(x)
-            if len(self.blocks) - i <= n:
-                output.append(self.norm(x))
-        return output
-def vit_tiny(patch_size=16, **kwargs):
-    model = VisionTransformer(
-        patch_size=patch_size, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4,
-        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=eps), **kwargs)
-    return model
-def vit_small(patch_size=16, **kwargs):
-    model = VisionTransformer(
-        patch_size=patch_size, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4,
-        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=eps), **kwargs)
-    return model
-def vit_base(patch_size=16, **kwargs):
-    model = VisionTransformer(
-        patch_size=patch_size, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4,
-        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=eps), **kwargs)
-    return model
-class DINOHead(nn.Module):
-    def __init__(self, in_dim, out_dim, use_bn=False, norm_last_layer=True, nlayers=3, hidden_dim=2048,
-                 bottleneck_dim=256):
-        super().__init__()
-        nlayers = max(nlayers, 1)
-        if nlayers == 1:
-            self.mlp = nn.Linear(in_dim, bottleneck_dim)
-        else:
-            layers = [nn.Linear(in_dim, hidden_dim)]
-            if use_bn:
-                layers.append(nn.BatchNorm1d(hidden_dim))
-            layers.append(nn.GELU())
-            for _ in range(nlayers - 2):
-                layers.append(nn.Linear(hidden_dim, hidden_dim))
-                if use_bn:
-                    layers.append(nn.BatchNorm1d(hidden_dim))
-                layers.append(nn.GELU())
-            layers.append(nn.Linear(hidden_dim, bottleneck_dim))
-            self.mlp = nn.Sequential(*layers)
-        self.apply(self._init_weights)
-        self.last_layer = nn.utils.weight_norm(nn.Linear(bottleneck_dim, out_dim, bias=False))
-        self.last_layer.weight_g.data.fill_(1)
-        if norm_last_layer:
-            self.last_layer.weight_g.requires_grad = False
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            trunc_normal_(m.weight, std=.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-    def forward(self, x):
-        x = self.mlp(x)
-        x = nn.functional.normalize(x, dim=-1, p=2)
-        x = self.last_layer(x)
-        return x
-class DINOFeaturizer(nn.Module):
-    def __init__(self, arch, patch_size, feat_type):
-        super().__init__()
-        self.arch = arch
-        self.patch_size = patch_size
-        self.feat_type = feat_type
-        self.config = {
-            "arch": arch,
-            "patch_size": patch_size,
-            "feat_type": feat_type
-        }
-        self.model = vit_small(
-            patch_size=patch_size,
-            num_classes=0)
-        if "3d-dino" in arch:
-            state_dict = torch.load("../models/3d-dino-co3d.pth")["teacher"]
-            state_dict = {k.replace("module.", "").replace("backbone.", ""): v for k, v in state_dict.items()}
-            state_dict = {k: v for k, v in state_dict.items() if "head." not in k}
-        elif "iarpa-dino" in arch:
-            state_dict = torch.load("../models/dino_iarpa.pth")["teacher"]
-            state_dict = {k.replace("module.", "").replace("backbone.", ""): v for k, v in state_dict.items()}
-            state_dict = {k: v for k, v in state_dict.items() if "head." not in k}
-        elif "chk-dino" in arch:
-            state_dict = torch.load("../models/dino_deitsmall16_pretrain_full_checkpoint.pth")["teacher"]
-            state_dict = {k.replace("module.", "").replace("backbone.", ""): v for k, v in state_dict.items()}
-            state_dict = {k: v for k, v in state_dict.items() if "head." not in k}
-        elif "ft_dino" in arch:
-            arch = "_".join(arch.split("_")[:-1])
-            state_dict = torch.load("../models/{}.pth".format(arch))["teacher"]
-            state_dict = {k.replace("module.", "").replace("backbone.", ""): v for k, v in state_dict.items()}
-            state_dict = {k: v for k, v in state_dict.items() if "head." not in k}
-        elif "dino" in arch:
-            state_dict = torch.hub.load('facebookresearch/dino:main', self.arch).state_dict()
-        else:  # model from timm -- load weights from timm to dino model (enables working on arbitrary size images).
-            temp_model = timm.create_model(self.arch, pretrained=True)
-            state_dict = temp_model.state_dict()
-            del state_dict['head.weight']
-            del state_dict['head.bias']
-        self.model.load_state_dict(state_dict, strict=True)
-        if arch == "vit_small":
-            self.n_feats = 384
-        else:
-            self.n_feats = 768
-    def get_cls_token(self, img):
-        return self.model.forward(img)
-    def forward(self, img, n=1, include_cls=False):
-        assert (img.shape[2] % self.patch_size == 0)
-        assert (img.shape[3] % self.patch_size == 0)
-        feat, attn, qkv = self.model.get_intermediate_feat(img, n=n)
-        feat, attn, qkv = feat[0], attn[0], qkv[0]
-        feat_h = img.shape[2] // self.patch_size
-        feat_w = img.shape[3] // self.patch_size
-        if self.feat_type == "token":
-            image_feat = feat[:, 1:, :].reshape(feat.shape[0], feat_h, feat_w, -1).permute(0, 3, 1, 2)
-            cls_feat = feat[:, 0, :]
-        elif self.feat_type == "key":
-            x = qkv[1, :, :, 1:, :]  # remove cls token
-            desc = x.permute(0, 2, 3, 1).flatten(start_dim=-2, end_dim=-1)
-            image_feat = desc.reshape(desc.shape[0], feat_h, feat_w, desc.shape[2]) \
-                .permute(0, 3, 1, 2)
-            cls_feat = None
-        else:
-            raise ValueError("Unknown feat type:{}".format(self.feat_type))
-        if include_cls:
-            return image_feat, cls_feat
-        return image_feat

DenseAV/denseav/featurizers/DINOv2.py DELETED Viewed

@@ -1,49 +0,0 @@
-import torch
-import torch.nn as nn
-class DINOv2Featurizer(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14').cuda()
-        # self.model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
-        self.model.eval()
-        self.config = {}
-    def get_cls_token(self, img):
-        pass
-    def forward(self, img, include_cls):
-        feature_dict = self.model.forward_features(img)
-        _, _, h, w = img.shape
-        new_h, new_w = h // 14, w // 14
-        b, _, c = feature_dict["x_norm_patchtokens"].shape
-        spatial_tokens = feature_dict["x_norm_patchtokens"].permute(0, 2, 1).reshape(b, c, new_h, new_w)
-        if include_cls:
-            return spatial_tokens, feature_dict["x_norm_clstoken"]
-        else:
-            return spatial_tokens
-if __name__ == "__main__":
-    import torchvision.transforms as T
-    from PIL import Image
-    from shared import norm, crop_to_divisor
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    image = Image.open("../../samples/dog_man_1_crop.jpg")
-    load_size = 224  # * 3
-    transform = T.Compose([
-        T.Resize(load_size, Image.BILINEAR),
-        T.CenterCrop(load_size),
-        T.ToTensor(),
-        norm])
-    model = DINOv2Featurizer().cuda()
-    results = model(transform(image).cuda().unsqueeze(0), include_cls=False)
-    print(results.shape)

DenseAV/denseav/featurizers/Hubert.py DELETED Viewed

@@ -1,70 +0,0 @@
-import torch
-import torch.nn as nn
-from transformers import Wav2Vec2Processor, HubertModel, HubertConfig
-from transformers.pytorch_utils import Conv1D
-class HubertAudioTransform():
-    def __init__(self):
-        self.processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
-    def __call__(self, audio):
-        return self.processor(audio, return_tensors="pt", sampling_rate=16000).input_values.squeeze(0)
-def copy_conv(l):
-    new_l = Conv1D()
-class Hubert(nn.Module):
-    def __init__(self):
-        super().__init__()
-        model1 = HubertModel.from_pretrained("facebook/hubert-large-ls960-ft")
-        config = model1.config
-        del model1
-        config.layer_norm_eps = 1e-4
-        self.model = HubertModel.from_pretrained("facebook/hubert-large-ls960-ft", config=config)
-        self.config = dict()
-    def forward(self, audio, include_cls):
-        outputs = self.model(audio)
-        # outputs = deepspeed.checkpointing.checkpoint(self.model, audio)
-        patch_tokens = outputs.last_hidden_state.permute(0, 2, 1).unsqueeze(2)
-        # return patch_tokens
-        if include_cls:
-            return patch_tokens, None
-        else:
-            return patch_tokens
-    def get_last_params(self):
-        return self.model.encoder.layers[-1].parameters()
-if __name__ == "__main__":
-    import librosa
-    from shared import pca, remove_axes
-    import matplotlib.pyplot as plt
-    from pytorch_lightning import seed_everything
-    audio, _ = librosa.load("../../samples/example.wav", sr=16000)
-    audio = torch.from_numpy(audio).unsqueeze(0).to("cuda")
-    model = Hubert().to("cuda")
-    embeddings = model.forward(audio, include_cls=False)
-    print(embeddings.shape)
-    seed_everything(0)
-    with torch.no_grad():
-        [pca_feats], _ = pca([embeddings])
-        pca_feats = torch.broadcast_to(
-            pca_feats, (pca_feats.shape[0], pca_feats.shape[1], 25, pca_feats.shape[3]))
-        fig, axes = plt.subplots(2, 1, figsize=(10, 7))
-        axes[1].imshow(pca_feats.cpu().squeeze(0).permute(1, 2, 0))
-        remove_axes(axes)
-        plt.tight_layout()
-        plt.show()
-        print("here")

DenseAV/denseav/featurizers/ImageBind.py DELETED Viewed

@@ -1,2033 +0,0 @@
-import gzip
-import html
-import io
-import logging
-import math
-import os
-from functools import lru_cache
-from functools import partial
-from types import SimpleNamespace
-from typing import Callable, List
-from typing import Optional
-import einops
-import ftfy
-import numpy as np
-import regex as re
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.utils.checkpoint as checkpoint
-import torchaudio
-import torchvision.transforms as T
-from PIL import Image
-from timm.models.layers import DropPath, trunc_normal_
-from torchvision import transforms
-import matplotlib.pyplot as plt
-from iopath.common.file_io import g_pathmgr
-class Attention(nn.Module):
-    def __init__(
-            self,
-            dim,
-            num_heads=8,
-            qkv_bias=False,
-            qk_scale=None,
-            attn_drop=0.0,
-            proj_drop=0.0,
-    ):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        # NOTE scale factor was wrong in my original version,
-        # can set manually to be compat with prev weights
-        self.scale = qk_scale or head_dim ** -0.5
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-    def forward(self, x):
-        B, N, C = x.shape
-        qkv = (
-            self.qkv(x)
-            .reshape(B, N, 3, self.num_heads, C // self.num_heads)
-            .permute(2, 0, 3, 1, 4)
-        )
-        q, k, v = (
-            qkv[0],
-            qkv[1],
-            qkv[2],
-        )  # make torchscript happy (cannot use tensor as tuple)
-        attn = (q @ k.transpose(-2, -1)) * self.scale
-        attn = attn.softmax(dim=-1)
-        attn = self.attn_drop(attn)
-        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-class Mlp(nn.Module):
-    def __init__(
-            self,
-            in_features,
-            hidden_features=None,
-            out_features=None,
-            act_layer=nn.GELU,
-            drop=0.0,
-    ):
-        super().__init__()
-        out_features = out_features or in_features
-        hidden_features = hidden_features or in_features
-        self.fc1 = nn.Linear(in_features, hidden_features)
-        self.act = act_layer()
-        self.fc2 = nn.Linear(hidden_features, out_features)
-        self.drop = nn.Dropout(drop)
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        x = self.drop(x)
-        x = self.fc2(x)
-        x = self.drop(x)
-        return x
-class MultiheadAttention(nn.MultiheadAttention):
-    def forward(self, x: torch.Tensor, attn_mask: torch.Tensor):
-        return super().forward(x, x, x, need_weights=False, attn_mask=attn_mask)[0]
-class ViTAttention(Attention):
-    def forward(self, x: torch.Tensor, attn_mask: torch.Tensor):
-        assert attn_mask is None
-        return super().forward(x)
-class BlockWithMasking(nn.Module):
-    def __init__(
-            self,
-            dim: int,
-            attn_target: Callable,
-            mlp_ratio: int = 4,
-            act_layer: Callable = nn.GELU,
-            norm_layer: Callable = nn.LayerNorm,
-            ffn_dropout_rate: float = 0.0,
-            drop_path: float = 0.0,
-            layer_scale_type: str = None,
-            layer_scale_init_value: float = 1e-4,
-    ):
-        super().__init__()
-        assert not isinstance(
-            attn_target, nn.Module
-        ), "attn_target should be a Callable. Otherwise attn_target is shared across blocks!"
-        self.attn = attn_target()
-        if drop_path > 0.0:
-            self.drop_path = DropPath(drop_path)
-        else:
-            self.drop_path = nn.Identity()
-        self.norm_1 = norm_layer(dim)
-        mlp_hidden_dim = int(mlp_ratio * dim)
-        self.mlp = Mlp(
-            in_features=dim,
-            hidden_features=mlp_hidden_dim,
-            act_layer=act_layer,
-            drop=ffn_dropout_rate,
-        )
-        self.norm_2 = norm_layer(dim)
-        self.layer_scale_type = layer_scale_type
-        if self.layer_scale_type is not None:
-            assert self.layer_scale_type in [
-                "per_channel",
-                "scalar",
-            ], f"Found Layer scale type {self.layer_scale_type}"
-            if self.layer_scale_type == "per_channel":
-                # one gamma value per channel
-                gamma_shape = [1, 1, dim]
-            elif self.layer_scale_type == "scalar":
-                # single gamma value for all channels
-                gamma_shape = [1, 1, 1]
-            # two gammas: for each part of the fwd in the encoder
-            self.layer_scale_gamma1 = nn.Parameter(
-                torch.ones(size=gamma_shape) * layer_scale_init_value,
-                requires_grad=True,
-            )
-            self.layer_scale_gamma2 = nn.Parameter(
-                torch.ones(size=gamma_shape) * layer_scale_init_value,
-                requires_grad=True,
-            )
-    def forward(self, x: torch.Tensor, attn_mask: torch.Tensor):
-        if self.layer_scale_type is None:
-            x = x + self.drop_path(self.attn(self.norm_1(x), attn_mask))
-            x = x + self.drop_path(self.mlp(self.norm_2(x)))
-        else:
-            x = (
-                    x
-                    + self.drop_path(self.attn(self.norm_1(x), attn_mask))
-                    * self.layer_scale_gamma1
-            )
-            x = x + self.drop_path(self.mlp(self.norm_2(x))) * self.layer_scale_gamma2
-        return x
-_LAYER_NORM = partial(nn.LayerNorm, eps=1e-6)
-class SimpleTransformer(nn.Module):
-    def __init__(
-            self,
-            attn_target: Callable,
-            embed_dim: int,
-            num_blocks: int,
-            block: Callable = BlockWithMasking,
-            pre_transformer_layer: Callable = None,
-            post_transformer_layer: Callable = None,
-            drop_path_rate: float = 0.0,
-            drop_path_type: str = "progressive",
-            norm_layer: Callable = _LAYER_NORM,
-            mlp_ratio: int = 4,
-            ffn_dropout_rate: float = 0.0,
-            layer_scale_type: str = None,  # from cait; possible values are None, "per_channel", "scalar"
-            layer_scale_init_value: float = 1e-4,  # from cait; float
-            weight_init_style: str = "jax",  # possible values jax or pytorch
-    ):
-        """
-        Simple Transformer with the following features
-        1. Supports masked attention
-        2. Supports DropPath
-        3. Supports LayerScale
-        4. Supports Dropout in Attention and FFN
-        5. Makes few assumptions about the input except that it is a Tensor
-        """
-        super().__init__()
-        self.pre_transformer_layer = pre_transformer_layer
-        if drop_path_type == "progressive":
-            dpr = [x.item() for x in torch.linspace(0, drop_path_rate, num_blocks)]
-        elif drop_path_type == "uniform":
-            dpr = [drop_path_rate for i in range(num_blocks)]
-        else:
-            raise ValueError(f"Unknown drop_path_type: {drop_path_type}")
-        self.blocks = nn.Sequential(
-            *[
-                block(
-                    dim=embed_dim,
-                    attn_target=attn_target,
-                    mlp_ratio=mlp_ratio,
-                    ffn_dropout_rate=ffn_dropout_rate,
-                    drop_path=dpr[i],
-                    norm_layer=norm_layer,
-                    layer_scale_type=layer_scale_type,
-                    layer_scale_init_value=layer_scale_init_value,
-                )
-                for i in range(num_blocks)
-            ]
-        )
-        self.post_transformer_layer = post_transformer_layer
-        self.weight_init_style = weight_init_style
-        self.apply(self._init_weights)
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            if self.weight_init_style == "jax":
-                # Based on MAE and official Jax ViT implementation
-                torch.nn.init.xavier_uniform_(m.weight)
-            elif self.weight_init_style == "pytorch":
-                # PyTorch ViT uses trunc_normal_
-                trunc_normal_(m.weight, std=0.02)
-            if m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, (nn.LayerNorm)):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-    def forward(
-            self,
-            tokens: torch.Tensor,
-            attn_mask: torch.Tensor = None,
-            use_checkpoint: bool = False,
-            checkpoint_every_n: int = 1,
-            checkpoint_blk_ids: List[int] = None,
-    ):
-        """
-        Inputs
-        - tokens: data of shape N x L x D (or L x N x D depending on the attention implementation)
-        - attn: mask of shape L x L
-        Output
-        - x: data of shape N x L x D (or L x N x D depending on the attention implementation)
-        """
-        if self.pre_transformer_layer:
-            tokens = self.pre_transformer_layer(tokens)
-        if use_checkpoint and checkpoint_blk_ids is None:
-            checkpoint_blk_ids = [
-                blk_id
-                for blk_id in range(len(self.blocks))
-                if blk_id % checkpoint_every_n == 0
-            ]
-        if checkpoint_blk_ids:
-            checkpoint_blk_ids = set(checkpoint_blk_ids)
-        for blk_id, blk in enumerate(self.blocks):
-            if use_checkpoint and blk_id in checkpoint_blk_ids:
-                tokens = checkpoint.checkpoint(
-                    blk, tokens, attn_mask, use_reentrant=False
-                )
-            else:
-                tokens = blk(tokens, attn_mask=attn_mask)
-        if self.post_transformer_layer:
-            tokens = self.post_transformer_layer(tokens)
-        return tokens
-def get_sinusoid_encoding_table(n_position, d_hid):
-    """Sinusoid position encoding table"""
-    # TODO: make it with torch instead of numpy
-    def get_position_angle_vec(position):
-        return [
-            position / np.power(10000, 2 * (hid_j // 2) / d_hid)
-            for hid_j in range(d_hid)
-        ]
-    sinusoid_table = np.array(
-        [get_position_angle_vec(pos_i) for pos_i in range(n_position)]
-    )
-    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
-    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
-    return torch.FloatTensor(sinusoid_table).unsqueeze(0)
-def interpolate_pos_encoding_2d(target_spatial_size, pos_embed):
-    N = pos_embed.shape[1]
-    if N == target_spatial_size:
-        return pos_embed
-    dim = pos_embed.shape[-1]
-    # nn.functional.interpolate doesn't work with bfloat16 so we cast to float32
-    pos_embed, updated = cast_if_src_dtype(pos_embed, torch.bfloat16, torch.float32)
-    pos_embed = nn.functional.interpolate(
-        pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(
-            0, 3, 1, 2
-        ),
-        scale_factor=math.sqrt(target_spatial_size / N),
-        mode="bicubic",
-    )
-    if updated:
-        pos_embed, _ = cast_if_src_dtype(pos_embed, torch.float32, torch.bfloat16)
-    pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
-    return pos_embed
-def interpolate_pos_encoding(
-        npatch_per_img,
-        pos_embed,
-        patches_layout,
-        input_shape=None,
-        first_patch_idx=1,
-):
-    assert first_patch_idx == 0 or first_patch_idx == 1, "there is 1 CLS token or none"
-    N = pos_embed.shape[1] - first_patch_idx  # since it's 1 if cls_token exists
-    if npatch_per_img == N:
-        return pos_embed
-    # assert (
-    #         patches_layout[-1] == patches_layout[-2]
-    # ), "Interpolation of pos embed not supported for non-square layouts"
-    class_emb = pos_embed[:, :first_patch_idx]
-    pos_embed = pos_embed[:, first_patch_idx:]
-    if input_shape is None or patches_layout[0] == 1:
-        # simple 2D pos embedding, no temporal component
-        pos_embed = interpolate_pos_encoding_2d(npatch_per_img, pos_embed)
-    elif patches_layout[0] > 1:
-        # pos embed has a temporal component
-        assert len(input_shape) == 4, "temporal interpolation not supported"
-        # we only support 2D interpolation in this case
-        num_frames = patches_layout[0]
-        num_spatial_tokens = patches_layout[1] * patches_layout[2]
-        pos_embed = pos_embed.view(1, num_frames, num_spatial_tokens, -1)
-        # interpolate embedding for zeroth frame
-        pos_embed = interpolate_pos_encoding_2d(
-            npatch_per_img, pos_embed[0, 0, ...].unsqueeze(0)
-        )
-    else:
-        raise ValueError("This type of interpolation isn't implemented")
-    return torch.cat((class_emb, pos_embed), dim=1)
-def _get_pos_embedding(
-        npatch_per_img,
-        pos_embed,
-        patches_layout,
-        input_shape,
-        first_patch_idx=1,
-):
-    pos_embed = interpolate_pos_encoding(
-        npatch_per_img,
-        pos_embed,
-        patches_layout,
-        input_shape=input_shape,
-        first_patch_idx=first_patch_idx,
-    )
-    return pos_embed
-class VerboseNNModule(nn.Module):
-    """
-    Wrapper around nn.Module that prints registered buffers and parameter names.
-    """
-    @staticmethod
-    def get_readable_tensor_repr(name: str, tensor: torch.Tensor) -> str:
-        st = (
-                "("
-                + name
-                + "): "
-                + "tensor("
-                + str(tuple(tensor[1].shape))
-                + ", requires_grad="
-                + str(tensor[1].requires_grad)
-                + ")\n"
-        )
-        return st
-    def extra_repr(self) -> str:
-        named_modules = set()
-        for p in self.named_modules():
-            named_modules.update([p[0]])
-        named_modules = list(named_modules)
-        string_repr = ""
-        for p in self.named_parameters():
-            name = p[0].split(".")[0]
-            if name not in named_modules:
-                string_repr += self.get_readable_tensor_repr(name, p)
-        for p in self.named_buffers():
-            name = p[0].split(".")[0]
-            string_repr += self.get_readable_tensor_repr(name, p)
-        return string_repr
-class PatchEmbedGeneric(nn.Module):
-    """
-    PatchEmbed from Hydra
-    """
-    def __init__(self, proj_stem, norm_layer: Optional[nn.Module] = None):
-        super().__init__()
-        if len(proj_stem) > 1:
-            self.proj = nn.Sequential(*proj_stem)
-        else:
-            # Special case to be able to load pre-trained models that were
-            # trained with a standard stem
-            self.proj = proj_stem[0]
-        self.norm_layer = norm_layer
-    def get_patch_layout(self, img_size):
-        with torch.no_grad():
-            dummy_img = torch.zeros(
-                [
-                    1,
-                ]
-                + img_size
-            )
-            dummy_out = self.proj(dummy_img)
-        embed_dim = dummy_out.shape[1]
-        patches_layout = tuple(dummy_out.shape[2:])
-        num_patches = np.prod(patches_layout)
-        return patches_layout, num_patches, embed_dim
-    def forward(self, x):
-        x = self.proj(x)
-        # B C (T) H W -> B (T)HW C
-        x = x.flatten(2).transpose(1, 2)
-        if self.norm_layer is not None:
-            x = self.norm_layer(x)
-        return x
-class SpatioTemporalPosEmbeddingHelper(VerboseNNModule):
-    def __init__(
-            self,
-            patches_layout: List,
-            num_patches: int,
-            num_cls_tokens: int,
-            embed_dim: int,
-            learnable: bool,
-    ) -> None:
-        super().__init__()
-        self.num_cls_tokens = num_cls_tokens
-        self.patches_layout = patches_layout
-        self.num_patches = num_patches
-        self.num_tokens = num_cls_tokens + num_patches
-        self.learnable = learnable
-        if self.learnable:
-            self.pos_embed = nn.Parameter(torch.zeros(1, self.num_tokens, embed_dim))
-            trunc_normal_(self.pos_embed, std=0.02)
-        else:
-            self.register_buffer(
-                "pos_embed", get_sinusoid_encoding_table(self.num_tokens, embed_dim)
-            )
-    def get_pos_embedding(self, vision_input, all_vision_tokens):
-        input_shape = vision_input.shape
-        pos_embed = _get_pos_embedding(
-            all_vision_tokens.size(1) - self.num_cls_tokens,
-            pos_embed=self.pos_embed,
-            patches_layout=self.patches_layout,
-            input_shape=input_shape,
-            first_patch_idx=self.num_cls_tokens,
-        )
-        return pos_embed
-class RGBDTPreprocessor(VerboseNNModule):
-    def __init__(
-            self,
-            rgbt_stem: PatchEmbedGeneric,
-            depth_stem: PatchEmbedGeneric,
-            img_size: List = (3, 224, 224),
-            num_cls_tokens: int = 1,
-            pos_embed_fn: Callable = None,
-            use_type_embed: bool = False,
-            init_param_style: str = "openclip",
-    ) -> None:
-        super().__init__()
-        stem = rgbt_stem if rgbt_stem is not None else depth_stem
-        (
-            self.patches_layout,
-            self.num_patches,
-            self.embed_dim,
-        ) = stem.get_patch_layout(img_size)
-        self.rgbt_stem = rgbt_stem
-        self.depth_stem = depth_stem
-        self.use_pos_embed = pos_embed_fn is not None
-        self.use_type_embed = use_type_embed
-        self.num_cls_tokens = num_cls_tokens
-        if self.use_pos_embed:
-            self.pos_embedding_helper = pos_embed_fn(
-                patches_layout=self.patches_layout,
-                num_cls_tokens=num_cls_tokens,
-                num_patches=self.num_patches,
-                embed_dim=self.embed_dim,
-            )
-        if self.num_cls_tokens > 0:
-            self.cls_token = nn.Parameter(
-                torch.zeros(1, self.num_cls_tokens, self.embed_dim)
-            )
-        if self.use_type_embed:
-            self.type_embed = nn.Parameter(torch.zeros(1, 1, self.embed_dim))
-        self.init_parameters(init_param_style)
-    @torch.no_grad()
-    def init_parameters(self, init_param_style):
-        if init_param_style == "openclip":
-            # OpenCLIP style initialization
-            scale = self.embed_dim ** -0.5
-            if self.use_pos_embed:
-                nn.init.normal_(self.pos_embedding_helper.pos_embed)
-                self.pos_embedding_helper.pos_embed *= scale
-            if self.num_cls_tokens > 0:
-                nn.init.normal_(self.cls_token)
-                self.cls_token *= scale
-        elif init_param_style == "vit":
-            self.cls_token.data.fill_(0)
-        else:
-            raise ValueError(f"Unknown init {init_param_style}")
-        if self.use_type_embed:
-            nn.init.normal_(self.type_embed)
-    def get_pos_emb_2(self, input, stem):
-        patches = stem.proj(input)
-        target_size = patches.shape[-2:]
-        original_size = list(self.pos_embedding_helper.patches_layout)[-2:]
-        orig_ce = self.pos_embedding_helper.pos_embed[:, 0, :]
-        orig_pe = ((self.pos_embedding_helper.pos_embed[:, 1:, :]
-                    .reshape(1, *original_size, self.embed_dim))
-                   .permute(0, 3, 1, 2))
-        new_pe = F.interpolate(orig_pe, size=target_size, mode="bicubic")
-        new_full_pe = torch.cat([orig_ce.unsqueeze(1), new_pe.permute(0, 2, 3, 1).reshape(1, -1, self.embed_dim)],
-                                dim=1)
-        return new_full_pe
-    def tokenize_input_and_cls_pos(self, input, stem, mask):
-        # tokens is of shape B x L x D
-        tokens = stem(input)
-        assert tokens.ndim == 3
-        assert tokens.shape[2] == self.embed_dim
-        B = tokens.shape[0]
-        if self.num_cls_tokens > 0:
-            class_tokens = self.cls_token.expand(
-                B, -1, -1
-            )  # stole class_tokens impl from Phil Wang, thanks
-            tokens = torch.cat((class_tokens, tokens), dim=1)
-        if self.use_pos_embed:
-            pos_embed = self.pos_embedding_helper.get_pos_embedding(input, tokens)
-            # pos_embed = self.get_pos_emb_2(input, stem)
-            tokens = tokens + pos_embed
-        if self.use_type_embed:
-            tokens = tokens + self.type_embed.expand(B, -1, -1)
-        return tokens
-    def forward(self, vision=None, depth=None, patch_mask=None):
-        if patch_mask is not None:
-            raise NotImplementedError()
-        if vision is not None:
-            vision_tokens = self.tokenize_input_and_cls_pos(
-                vision, self.rgbt_stem, patch_mask
-            )
-        if depth is not None:
-            depth_tokens = self.tokenize_input_and_cls_pos(
-                depth, self.depth_stem, patch_mask
-            )
-        # aggregate tokens
-        if vision is not None and depth is not None:
-            final_tokens = vision_tokens + depth_tokens
-        else:
-            final_tokens = vision_tokens if vision is not None else depth_tokens
-        return_dict = {
-            "trunk": {
-                "tokens": final_tokens,
-            },
-            "head": {},
-        }
-        return return_dict
-class AudioPreprocessor(RGBDTPreprocessor):
-    def __init__(self, audio_stem: PatchEmbedGeneric, **kwargs) -> None:
-        super().__init__(rgbt_stem=audio_stem, depth_stem=None, **kwargs)
-    def forward(self, audio=None):
-        return super().forward(vision=audio)
-class ThermalPreprocessor(RGBDTPreprocessor):
-    def __init__(self, thermal_stem: PatchEmbedGeneric, **kwargs) -> None:
-        super().__init__(rgbt_stem=thermal_stem, depth_stem=None, **kwargs)
-    def forward(self, thermal=None):
-        return super().forward(vision=thermal)
-def build_causal_attention_mask(context_length):
-    # lazily create causal attention mask, with full attention between the vision tokens
-    # pytorch uses additive attention mask; fill with -inf
-    mask = torch.empty(context_length, context_length, requires_grad=False)
-    mask.fill_(float("-inf"))
-    mask.triu_(1)  # zero out the lower diagonal
-    return mask
-class TextPreprocessor(VerboseNNModule):
-    def __init__(
-            self,
-            vocab_size: int,
-            context_length: int,
-            embed_dim: int,
-            causal_masking: bool,
-            supply_seq_len_to_head: bool = True,
-            num_cls_tokens: int = 0,
-            init_param_style: str = "openclip",
-    ) -> None:
-        super().__init__()
-        self.vocab_size = vocab_size
-        self.context_length = context_length
-        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
-        self.pos_embed = nn.Parameter(
-            torch.empty(1, self.context_length + num_cls_tokens, embed_dim)
-        )
-        self.causal_masking = causal_masking
-        if self.causal_masking:
-            mask = build_causal_attention_mask(self.context_length)
-            # register the mask as a buffer, so it can be moved to the right device
-            self.register_buffer("mask", mask)
-        self.supply_seq_len_to_head = supply_seq_len_to_head
-        self.num_cls_tokens = num_cls_tokens
-        self.embed_dim = embed_dim
-        if num_cls_tokens > 0:
-            assert self.causal_masking is False, "Masking + CLS token isn't implemented"
-            self.cls_token = nn.Parameter(
-                torch.zeros(1, self.num_cls_tokens, embed_dim)
-            )
-        self.init_parameters(init_param_style)
-    @torch.no_grad()
-    def init_parameters(self, init_param_style="openclip"):
-        # OpenCLIP style initialization
-        nn.init.normal_(self.token_embedding.weight, std=0.02)
-        nn.init.normal_(self.pos_embed, std=0.01)
-        if init_param_style == "openclip":
-            # OpenCLIP style initialization
-            scale = self.embed_dim ** -0.5
-            if self.num_cls_tokens > 0:
-                nn.init.normal_(self.cls_token)
-                self.cls_token *= scale
-        elif init_param_style == "vit":
-            self.cls_token.data.fill_(0)
-        else:
-            raise ValueError(f"Unknown init {init_param_style}")
-    def forward(self, text):
-        # text tokens are of shape B x L x D
-        text_tokens = self.token_embedding(text)
-        # concat CLS tokens if any
-        if self.num_cls_tokens > 0:
-            B = text_tokens.shape[0]
-            class_tokens = self.cls_token.expand(
-                B, -1, -1
-            )  # stole class_tokens impl from Phil Wang, thanks
-            text_tokens = torch.cat((class_tokens, text_tokens), dim=1)
-        text_tokens = text_tokens + self.pos_embed
-        return_dict = {
-            "trunk": {
-                "tokens": text_tokens,
-            },
-            "head": {},
-        }
-        # Compute sequence length after adding CLS tokens
-        if self.supply_seq_len_to_head:
-            text_lengths = text.argmax(dim=-1)
-            return_dict["head"] = {
-                "seq_len": text_lengths,
-            }
-        if self.causal_masking:
-            return_dict["trunk"].update({"attn_mask": self.mask})
-        return return_dict
-class Im2Video(nn.Module):
-    """Convert an image into a trivial video."""
-    def __init__(self, time_dim=2):
-        super().__init__()
-        self.time_dim = time_dim
-    def forward(self, x):
-        if x.ndim == 4:
-            # B, C, H, W -> B, C, T, H, W
-            return x.unsqueeze(self.time_dim)
-        elif x.ndim == 5:
-            return x
-        else:
-            raise ValueError(f"Dimension incorrect {x.shape}")
-class PadIm2Video(Im2Video):
-    def __init__(self, ntimes, pad_type, time_dim=2):
-        super().__init__(time_dim=time_dim)
-        assert ntimes > 0
-        assert pad_type in ["zero", "repeat"]
-        self.ntimes = ntimes
-        self.pad_type = pad_type
-    def forward(self, x):
-        x = super().forward(x)
-        if x.shape[self.time_dim] == 1:
-            if self.pad_type == "repeat":
-                new_shape = [1] * len(x.shape)
-                new_shape[self.time_dim] = self.ntimes
-                x = x.repeat(new_shape)
-            elif self.pad_type == "zero":
-                padarg = [0, 0] * len(x.shape)
-                padarg[2 * self.time_dim + 1] = self.ntimes - x.shape[self.time_dim]
-                x = nn.functional.pad(x, padarg)
-        return x
-# Modified from github.com/openai/CLIP
-@lru_cache()
-def bytes_to_unicode():
-    """
-    Returns list of utf-8 byte and a corresponding list of unicode strings.
-    The reversible bpe codes work on unicode strings.
-    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
-    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
-    This is a signficant percentage of your normal, say, 32K bpe vocab.
-    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
-    And avoids mapping to whitespace/control characters the bpe code barfs on.
-    """
-    bs = (
-            list(range(ord("!"), ord("~") + 1))
-            + list(range(ord("¡"), ord("¬") + 1))
-            + list(range(ord("®"), ord("ÿ") + 1))
-    )
-    cs = bs[:]
-    n = 0
-    for b in range(2 ** 8):
-        if b not in bs:
-            bs.append(b)
-            cs.append(2 ** 8 + n)
-            n += 1
-    cs = [chr(n) for n in cs]
-    return dict(zip(bs, cs))
-def get_pairs(word):
-    """Return set of symbol pairs in a word.
-    Word is represented as tuple of symbols (symbols being variable-length strings).
-    """
-    pairs = set()
-    prev_char = word[0]
-    for char in word[1:]:
-        pairs.add((prev_char, char))
-        prev_char = char
-    return pairs
-def basic_clean(text):
-    text = ftfy.fix_text(text)
-    text = html.unescape(html.unescape(text))
-    return text.strip()
-def whitespace_clean(text):
-    text = re.sub(r"\s+", " ", text)
-    text = text.strip()
-    return text
-class SimpleTokenizer(object):
-    def __init__(self, bpe_path: str, context_length=77):
-        self.byte_encoder = bytes_to_unicode()
-        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
-        with g_pathmgr.open(bpe_path, "rb") as fh:
-            bpe_bytes = io.BytesIO(fh.read())
-            merges = gzip.open(bpe_bytes).read().decode("utf-8").split("\n")
-        merges = merges[1: 49152 - 256 - 2 + 1]
-        merges = [tuple(merge.split()) for merge in merges]
-        vocab = list(bytes_to_unicode().values())
-        vocab = vocab + [v + "</w>" for v in vocab]
-        for merge in merges:
-            vocab.append("".join(merge))
-        vocab.extend(["<|startoftext|>", "<|endoftext|>"])
-        self.encoder = dict(zip(vocab, range(len(vocab))))
-        self.decoder = {v: k for k, v in self.encoder.items()}
-        self.bpe_ranks = dict(zip(merges, range(len(merges))))
-        self.cache = {
-            "<|startoftext|>": "<|startoftext|>",
-            "<|endoftext|>": "<|endoftext|>",
-        }
-        self.pat = re.compile(
-            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
-            re.IGNORECASE,
-        )
-        self.context_length = context_length
-    def bpe(self, token):
-        if token in self.cache:
-            return self.cache[token]
-        word = tuple(token[:-1]) + (token[-1] + "</w>",)
-        pairs = get_pairs(word)
-        if not pairs:
-            return token + "</w>"
-        while True:
-            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
-            if bigram not in self.bpe_ranks:
-                break
-            first, second = bigram
-            new_word = []
-            i = 0
-            while i < len(word):
-                try:
-                    j = word.index(first, i)
-                    new_word.extend(word[i:j])
-                    i = j
-                except:
-                    new_word.extend(word[i:])
-                    break
-                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
-                    new_word.append(first + second)
-                    i += 2
-                else:
-                    new_word.append(word[i])
-                    i += 1
-            new_word = tuple(new_word)
-            word = new_word
-            if len(word) == 1:
-                break
-            else:
-                pairs = get_pairs(word)
-        word = " ".join(word)
-        self.cache[token] = word
-        return word
-    def encode(self, text):
-        bpe_tokens = []
-        text = whitespace_clean(basic_clean(text)).lower()
-        for token in re.findall(self.pat, text):
-            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
-            bpe_tokens.extend(
-                self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
-            )
-        return bpe_tokens
-    def decode(self, tokens):
-        text = "".join([self.decoder[token] for token in tokens])
-        text = (
-            bytearray([self.byte_decoder[c] for c in text])
-            .decode("utf-8", errors="replace")
-            .replace("</w>", " ")
-        )
-        return text
-    def __call__(self, texts, context_length=None):
-        if not context_length:
-            context_length = self.context_length
-        if isinstance(texts, str):
-            texts = [texts]
-        sot_token = self.encoder["<|startoftext|>"]
-        eot_token = self.encoder["<|endoftext|>"]
-        all_tokens = [[sot_token] + self.encode(text) + [eot_token] for text in texts]
-        result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)
-        for i, tokens in enumerate(all_tokens):
-            tokens = tokens[:context_length]
-            result[i, : len(tokens)] = torch.tensor(tokens)
-        if len(result) == 1:
-            return result[0]
-        return result
-class Normalize(nn.Module):
-    def __init__(self, dim: int) -> None:
-        super().__init__()
-        self.dim = dim
-    def forward(self, x):
-        return torch.nn.functional.normalize(x, dim=self.dim, p=2)
-class LearnableLogitScaling(nn.Module):
-    def __init__(
-            self,
-            logit_scale_init: float = 1 / 0.07,
-            learnable: bool = True,
-            max_logit_scale: float = 100,
-    ) -> None:
-        super().__init__()
-        self.max_logit_scale = max_logit_scale
-        self.logit_scale_init = logit_scale_init
-        self.learnable = learnable
-        log_logit_scale = torch.ones([]) * np.log(self.logit_scale_init)
-        if learnable:
-            self.log_logit_scale = nn.Parameter(log_logit_scale)
-        else:
-            self.register_buffer("log_logit_scale", log_logit_scale)
-    def forward(self, x):
-        return torch.clip(self.log_logit_scale.exp(), max=self.max_logit_scale) * x
-    def extra_repr(self):
-        st = f"logit_scale_init={self.logit_scale_init},learnable={self.learnable}, max_logit_scale={self.max_logit_scale}"
-        return st
-class EinOpsRearrange(nn.Module):
-    def __init__(self, rearrange_expr: str, **kwargs) -> None:
-        super().__init__()
-        self.rearrange_expr = rearrange_expr
-        self.kwargs = kwargs
-    def forward(self, x):
-        assert isinstance(x, torch.Tensor)
-        return einops.rearrange(x, self.rearrange_expr, **self.kwargs)
-class IMUPreprocessor(VerboseNNModule):
-    def __init__(
-            self,
-            kernel_size: int,
-            imu_stem: PatchEmbedGeneric,
-            embed_dim: int,
-            img_size: List = (6, 2000),
-            num_cls_tokens: int = 1,
-            pos_embed_fn: Callable = None,
-            init_param_style: str = "openclip",
-    ) -> None:
-        super().__init__()
-        stem = imu_stem
-        self.imu_stem = imu_stem
-        self.embed_dim = embed_dim
-        self.use_pos_embed = pos_embed_fn is not None
-        self.num_cls_tokens = num_cls_tokens
-        self.kernel_size = kernel_size
-        self.pos_embed = nn.Parameter(
-            torch.empty(1, (img_size[1] // kernel_size) + num_cls_tokens, embed_dim)
-        )
-        if self.num_cls_tokens > 0:
-            self.cls_token = nn.Parameter(
-                torch.zeros(1, self.num_cls_tokens, self.embed_dim)
-            )
-        self.init_parameters(init_param_style)
-    @torch.no_grad()
-    def init_parameters(self, init_param_style):
-        nn.init.normal_(self.pos_embed, std=0.01)
-        if init_param_style == "openclip":
-            # OpenCLIP style initialization
-            scale = self.embed_dim ** -0.5
-            if self.num_cls_tokens > 0:
-                nn.init.normal_(self.cls_token)
-                self.cls_token *= scale
-        elif init_param_style == "vit":
-            self.cls_token.data.fill_(0)
-        else:
-            raise ValueError(f"Unknown init {init_param_style}")
-    def tokenize_input_and_cls_pos(self, input, stem):
-        # tokens is of shape B x L x D
-        tokens = stem.norm_layer(stem.proj(input))
-        assert tokens.ndim == 3
-        assert tokens.shape[2] == self.embed_dim
-        B = tokens.shape[0]
-        if self.num_cls_tokens > 0:
-            class_tokens = self.cls_token.expand(
-                B, -1, -1
-            )  # stole class_tokens impl from Phil Wang, thanks
-            tokens = torch.cat((class_tokens, tokens), dim=1)
-        if self.use_pos_embed:
-            tokens = tokens + self.pos_embed
-        return tokens
-    def forward(self, imu):
-        # Patchify
-        imu = imu.unfold(
-            -1,
-            self.kernel_size,
-            self.kernel_size,
-        ).permute(0, 2, 1, 3)
-        imu = imu.reshape(imu.size(0), imu.size(1), -1)
-        imu_tokens = self.tokenize_input_and_cls_pos(
-            imu,
-            self.imu_stem,
-        )
-        return_dict = {
-            "trunk": {
-                "tokens": imu_tokens,
-            },
-            "head": {},
-        }
-        return return_dict
-def cast_if_src_dtype(
-        tensor: torch.Tensor, src_dtype: torch.dtype, tgt_dtype: torch.dtype
-):
-    updated = False
-    if tensor.dtype == src_dtype:
-        tensor = tensor.to(dtype=tgt_dtype)
-        updated = True
-    return tensor, updated
-class QuickGELU(nn.Module):
-    # From https://github.com/openai/CLIP/blob/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1/clip/model.py#L166
-    def forward(self, x: torch.Tensor):
-        return x * torch.sigmoid(1.702 * x)
-class SelectElement(nn.Module):
-    def __init__(self, index) -> None:
-        super().__init__()
-        self.index = index
-    def forward(self, x):
-        assert x.ndim >= 3
-        return x[:, self.index, ...]
-class ReshapeSpatial(nn.Module):
-    def __init__(self, shape) -> None:
-        super().__init__()
-        self.h, self.w = shape
-    def forward(self, x):
-        assert x.ndim >= 3
-        return x[:, 1:, ...].reshape(x.shape[0], self.h, self.w, -1), x[:, 0, :]
-class ReshapeAudio(nn.Module):
-    def __init__(self, shape) -> None:
-        super().__init__()
-        self.h, self.w = shape
-    def forward(self, x):
-        assert x.ndim == 3
-        return x[:, 1:, :].reshape(-1, 5, self.h, self.w, x.shape[-1]), x[:, 0, :]
-class ApplyTwice(nn.Module):
-    def __init__(self, module) -> None:
-        super().__init__()
-        self.module = module
-    def forward(self, pair):
-        return self.module(pair[0]), self.module(pair[1])
-class SelectEOSAndProject(nn.Module):
-    """
-    Text Pooling used in OpenCLIP
-    """
-    def __init__(self, proj: nn.Module) -> None:
-        super().__init__()
-        self.proj = proj
-    def forward(self, x, seq_len):
-        assert x.ndim == 3
-        # x is of shape B x L x D
-        # take features from the eot embedding (eot_token is the highest number in each sequence)
-        x = x[torch.arange(x.shape[0]), seq_len]
-        x = self.proj(x)
-        return x
-ModalityType = SimpleNamespace(
-    VISION="vision",
-    TEXT="text",
-    AUDIO="audio",
-    THERMAL="thermal",
-    DEPTH="depth",
-    IMU="imu",
-)
-class ImageBindModel(nn.Module):
-    def __init__(
-            self,
-            video_frames=2,
-            kernel_size=(2, 14, 14),
-            audio_kernel_size=16,
-            audio_stride=10,
-            out_embed_dim=768,
-            vision_embed_dim=1024,
-            vision_num_blocks=24,
-            vision_num_heads=16,
-            audio_embed_dim=768,
-            audio_num_blocks=12,
-            audio_num_heads=12,
-            audio_num_mel_bins=128,
-            audio_target_len=204,
-            audio_drop_path=0.1,
-            text_embed_dim=768,
-            text_num_blocks=12,
-            text_num_heads=12,
-            depth_embed_dim=384,
-            depth_kernel_size=16,
-            depth_num_blocks=12,
-            depth_num_heads=8,
-            depth_drop_path=0.0,
-            thermal_embed_dim=768,
-            thermal_kernel_size=16,
-            thermal_num_blocks=12,
-            thermal_num_heads=12,
-            thermal_drop_path=0.0,
-            imu_embed_dim=512,
-            imu_kernel_size=8,
-            imu_num_blocks=6,
-            imu_num_heads=8,
-            imu_drop_path=0.7,
-    ):
-        super().__init__()
-        self.modality_preprocessors = self._create_modality_preprocessors(
-            video_frames,
-            vision_embed_dim,
-            kernel_size,
-            text_embed_dim,
-            audio_embed_dim,
-            audio_kernel_size,
-            audio_stride,
-            audio_num_mel_bins,
-            audio_target_len,
-            depth_embed_dim,
-            depth_kernel_size,
-            thermal_embed_dim,
-            thermal_kernel_size,
-            imu_embed_dim,
-        )
-        self.modality_trunks = self._create_modality_trunks(
-            vision_embed_dim,
-            vision_num_blocks,
-            vision_num_heads,
-            text_embed_dim,
-            text_num_blocks,
-            text_num_heads,
-            audio_embed_dim,
-            audio_num_blocks,
-            audio_num_heads,
-            audio_drop_path,
-            depth_embed_dim,
-            depth_num_blocks,
-            depth_num_heads,
-            depth_drop_path,
-            thermal_embed_dim,
-            thermal_num_blocks,
-            thermal_num_heads,
-            thermal_drop_path,
-            imu_embed_dim,
-            imu_num_blocks,
-            imu_num_heads,
-            imu_drop_path,
-        )
-        self.modality_heads = self._create_modality_heads(
-            out_embed_dim,
-            vision_embed_dim,
-            text_embed_dim,
-            audio_embed_dim,
-            depth_embed_dim,
-            thermal_embed_dim,
-            imu_embed_dim,
-        )
-        self.modality_postprocessors = self._create_modality_postprocessors(
-            out_embed_dim
-        )
-    def _create_modality_preprocessors(
-            self,
-            video_frames=2,
-            vision_embed_dim=1024,
-            kernel_size=(2, 14, 14),
-            text_embed_dim=768,
-            audio_embed_dim=768,
-            audio_kernel_size=16,
-            audio_stride=10,
-            audio_num_mel_bins=128,
-            audio_target_len=204,
-            depth_embed_dim=768,
-            depth_kernel_size=16,
-            thermal_embed_dim=768,
-            thermal_kernel_size=16,
-            imu_embed_dim=512,
-    ):
-        rgbt_stem = PatchEmbedGeneric(
-            proj_stem=[
-                PadIm2Video(pad_type="repeat", ntimes=2),
-                nn.Conv3d(
-                    in_channels=3,
-                    kernel_size=kernel_size,
-                    out_channels=vision_embed_dim,
-                    stride=kernel_size,
-                    bias=False,
-                ),
-            ]
-        )
-        rgbt_preprocessor = RGBDTPreprocessor(
-            img_size=[3, video_frames, 224, 224],
-            num_cls_tokens=1,
-            pos_embed_fn=partial(SpatioTemporalPosEmbeddingHelper, learnable=True),
-            rgbt_stem=rgbt_stem,
-            depth_stem=None,
-        )
-        text_preprocessor = TextPreprocessor(
-            context_length=77,
-            vocab_size=49408,
-            embed_dim=text_embed_dim,
-            causal_masking=True,
-        )
-        audio_stem = PatchEmbedGeneric(
-            proj_stem=[
-                nn.Conv2d(
-                    in_channels=1,
-                    kernel_size=audio_kernel_size,
-                    stride=audio_stride,
-                    out_channels=audio_embed_dim,
-                    bias=False,
-                ),
-            ],
-            norm_layer=nn.LayerNorm(normalized_shape=audio_embed_dim),
-        )
-        audio_preprocessor = AudioPreprocessor(
-            img_size=[1, audio_num_mel_bins, audio_target_len],
-            num_cls_tokens=1,
-            pos_embed_fn=partial(SpatioTemporalPosEmbeddingHelper, learnable=True),
-            audio_stem=audio_stem,
-        )
-        # depth_stem = PatchEmbedGeneric(
-        #     [
-        #         nn.Conv2d(
-        #             kernel_size=depth_kernel_size,
-        #             in_channels=1,
-        #             out_channels=depth_embed_dim,
-        #             stride=depth_kernel_size,
-        #             bias=False,
-        #         ),
-        #     ],
-        #     norm_layer=nn.LayerNorm(normalized_shape=depth_embed_dim),
-        # )
-        #
-        # depth_preprocessor = RGBDTPreprocessor(
-        #     img_size=[1, 224, 224],
-        #     num_cls_tokens=1,
-        #     pos_embed_fn=partial(SpatioTemporalPosEmbeddingHelper, learnable=True),
-        #     rgbt_stem=None,
-        #     depth_stem=depth_stem,
-        # )
-        #
-        # thermal_stem = PatchEmbedGeneric(
-        #     [
-        #         nn.Conv2d(
-        #             kernel_size=thermal_kernel_size,
-        #             in_channels=1,
-        #             out_channels=thermal_embed_dim,
-        #             stride=thermal_kernel_size,
-        #             bias=False,
-        #         ),
-        #     ],
-        #     norm_layer=nn.LayerNorm(normalized_shape=thermal_embed_dim),
-        # )
-        # thermal_preprocessor = ThermalPreprocessor(
-        #     img_size=[1, 224, 224],
-        #     num_cls_tokens=1,
-        #     pos_embed_fn=partial(SpatioTemporalPosEmbeddingHelper, learnable=True),
-        #     thermal_stem=thermal_stem,
-        # )
-        #
-        # imu_stem = PatchEmbedGeneric(
-        #     [
-        #         nn.Linear(
-        #             in_features=48,
-        #             out_features=imu_embed_dim,
-        #             bias=False,
-        #         ),
-        #     ],
-        #     norm_layer=nn.LayerNorm(normalized_shape=imu_embed_dim),
-        # )
-        #
-        # imu_preprocessor = IMUPreprocessor(
-        #     img_size=[6, 2000],
-        #     num_cls_tokens=1,
-        #     kernel_size=8,
-        #     embed_dim=imu_embed_dim,
-        #     pos_embed_fn=partial(SpatioTemporalPosEmbeddingHelper, learnable=True),
-        #     imu_stem=imu_stem,
-        # )
-        modality_preprocessors = {
-            ModalityType.VISION: rgbt_preprocessor,
-            ModalityType.TEXT: text_preprocessor,
-            ModalityType.AUDIO: audio_preprocessor,
-            # ModalityType.DEPTH: depth_preprocessor,
-            # ModalityType.THERMAL: thermal_preprocessor,
-            # ModalityType.IMU: imu_preprocessor,
-        }
-        return nn.ModuleDict(modality_preprocessors)
-    def _create_modality_trunks(
-            self,
-            vision_embed_dim=1024,
-            vision_num_blocks=24,
-            vision_num_heads=16,
-            text_embed_dim=768,
-            text_num_blocks=12,
-            text_num_heads=12,
-            audio_embed_dim=768,
-            audio_num_blocks=12,
-            audio_num_heads=12,
-            audio_drop_path=0.0,
-            depth_embed_dim=768,
-            depth_num_blocks=12,
-            depth_num_heads=12,
-            depth_drop_path=0.0,
-            thermal_embed_dim=768,
-            thermal_num_blocks=12,
-            thermal_num_heads=12,
-            thermal_drop_path=0.0,
-            imu_embed_dim=512,
-            imu_num_blocks=6,
-            imu_num_heads=8,
-            imu_drop_path=0.7,
-    ):
-        def instantiate_trunk(
-                embed_dim, num_blocks, num_heads, pre_transformer_ln, add_bias_kv, drop_path
-        ):
-            return SimpleTransformer(
-                embed_dim=embed_dim,
-                num_blocks=num_blocks,
-                ffn_dropout_rate=0.0,
-                drop_path_rate=drop_path,
-                attn_target=partial(
-                    MultiheadAttention,
-                    embed_dim=embed_dim,
-                    num_heads=num_heads,
-                    bias=True,
-                    add_bias_kv=add_bias_kv,
-                ),
-                pre_transformer_layer=nn.Sequential(
-                    nn.LayerNorm(embed_dim, eps=1e-6)
-                    if pre_transformer_ln
-                    else nn.Identity(),
-                    EinOpsRearrange("b l d -> l b d"),
-                ),
-                post_transformer_layer=EinOpsRearrange("l b d -> b l d"),
-            )
-        modality_trunks = {}
-        modality_trunks[ModalityType.VISION] = instantiate_trunk(
-            vision_embed_dim,
-            vision_num_blocks,
-            vision_num_heads,
-            pre_transformer_ln=True,
-            add_bias_kv=False,
-            drop_path=0.0,
-        )
-        modality_trunks[ModalityType.TEXT] = instantiate_trunk(
-            text_embed_dim,
-            text_num_blocks,
-            text_num_heads,
-            pre_transformer_ln=False,
-            add_bias_kv=False,
-            drop_path=0.0,
-        )
-        modality_trunks[ModalityType.AUDIO] = instantiate_trunk(
-            audio_embed_dim,
-            audio_num_blocks,
-            audio_num_heads,
-            pre_transformer_ln=False,
-            add_bias_kv=True,
-            drop_path=audio_drop_path,
-        )
-        # modality_trunks[ModalityType.DEPTH] = instantiate_trunk(
-        #     depth_embed_dim,
-        #     depth_num_blocks,
-        #     depth_num_heads,
-        #     pre_transformer_ln=False,
-        #     add_bias_kv=True,
-        #     drop_path=depth_drop_path,
-        # )
-        # modality_trunks[ModalityType.THERMAL] = instantiate_trunk(
-        #     thermal_embed_dim,
-        #     thermal_num_blocks,
-        #     thermal_num_heads,
-        #     pre_transformer_ln=False,
-        #     add_bias_kv=True,
-        #     drop_path=thermal_drop_path,
-        # )
-        # modality_trunks[ModalityType.IMU] = instantiate_trunk(
-        #     imu_embed_dim,
-        #     imu_num_blocks,
-        #     imu_num_heads,
-        #     pre_transformer_ln=False,
-        #     add_bias_kv=True,
-        #     drop_path=imu_drop_path,
-        # )
-        return nn.ModuleDict(modality_trunks)
-    def _create_modality_heads(
-            self,
-            out_embed_dim,
-            vision_embed_dim,
-            text_embed_dim,
-            audio_embed_dim,
-            depth_embed_dim,
-            thermal_embed_dim,
-            imu_embed_dim,
-    ):
-        modality_heads = {}
-        modality_heads[ModalityType.VISION] = nn.Sequential(
-            nn.LayerNorm(normalized_shape=vision_embed_dim, eps=1e-6),
-            SelectElement(index=0),
-            nn.Linear(vision_embed_dim, out_embed_dim, bias=False),
-        )
-        modality_heads[ModalityType.TEXT] = SelectEOSAndProject(
-            proj=nn.Sequential(
-                nn.LayerNorm(normalized_shape=text_embed_dim, eps=1e-6),
-                nn.Linear(text_embed_dim, out_embed_dim, bias=False),
-            )
-        )
-        modality_heads[ModalityType.AUDIO] = nn.Sequential(
-            nn.LayerNorm(normalized_shape=audio_embed_dim, eps=1e-6),
-            SelectElement(index=0),
-            nn.Linear(audio_embed_dim, out_embed_dim, bias=False),
-        )
-        # modality_heads[ModalityType.DEPTH] = nn.Sequential(
-        #     nn.LayerNorm(normalized_shape=depth_embed_dim, eps=1e-6),
-        #     SelectElement(index=0),
-        #     nn.Linear(depth_embed_dim, out_embed_dim, bias=False),
-        # )
-        #
-        # modality_heads[ModalityType.THERMAL] = nn.Sequential(
-        #     nn.LayerNorm(normalized_shape=thermal_embed_dim, eps=1e-6),
-        #     SelectElement(index=0),
-        #     nn.Linear(thermal_embed_dim, out_embed_dim, bias=False),
-        # )
-        #
-        # modality_heads[ModalityType.IMU] = nn.Sequential(
-        #     nn.LayerNorm(normalized_shape=imu_embed_dim, eps=1e-6),
-        #     SelectElement(index=0),
-        #     nn.Dropout(p=0.5),
-        #     nn.Linear(imu_embed_dim, out_embed_dim, bias=False),
-        # )
-        return nn.ModuleDict(modality_heads)
-    def _create_modality_postprocessors(self, out_embed_dim):
-        modality_postprocessors = {}
-        modality_postprocessors[ModalityType.VISION] = Normalize(dim=-1)
-        modality_postprocessors[ModalityType.TEXT] = nn.Sequential(
-            Normalize(dim=-1), LearnableLogitScaling(learnable=True)
-        )
-        modality_postprocessors[ModalityType.AUDIO] = nn.Sequential(
-            Normalize(dim=-1),
-            LearnableLogitScaling(logit_scale_init=20.0, learnable=False),
-        )
-        # modality_postprocessors[ModalityType.DEPTH] = nn.Sequential(
-        #     Normalize(dim=-1),
-        #     LearnableLogitScaling(logit_scale_init=5.0, learnable=False),
-        # )
-        # modality_postprocessors[ModalityType.THERMAL] = nn.Sequential(
-        #     Normalize(dim=-1),
-        #     LearnableLogitScaling(logit_scale_init=10.0, learnable=False),
-        # )
-        # modality_postprocessors[ModalityType.IMU] = nn.Sequential(
-        #     Normalize(dim=-1),
-        #     LearnableLogitScaling(logit_scale_init=5.0, learnable=False),
-        # )
-        return nn.ModuleDict(modality_postprocessors)
-    def forward(self, inputs):
-        outputs = {}
-        for modality_key, modality_value in inputs.items():
-            reduce_list = (
-                    modality_value.ndim >= 5
-            )  # Audio and Video inputs consist of multiple clips
-            if reduce_list:
-                B, S = modality_value.shape[:2]
-                modality_value = modality_value.reshape(
-                    B * S, *modality_value.shape[2:]
-                )
-            if modality_value is not None:
-                modality_value = self.modality_preprocessors[modality_key](
-                    **{modality_key: modality_value}
-                )
-                trunk_inputs = modality_value["trunk"]
-                head_inputs = modality_value["head"]
-                modality_value = self.modality_trunks[modality_key](**trunk_inputs)
-                modality_value = self.modality_heads[modality_key](
-                    modality_value, **head_inputs
-                )
-                modality_value = self.modality_postprocessors[modality_key](
-                    modality_value
-                )
-                if reduce_list:
-                    modality_value = modality_value.reshape(B, S, -1)
-                    modality_value = modality_value.mean(dim=1)
-                outputs[modality_key] = modality_value
-        return outputs
-    def reconfigure_head(self, k, v):
-        if k == ModalityType.AUDIO:
-            return torch.nn.Sequential(v[0], v[2])
-        elif k == ModalityType.VISION:
-            return torch.nn.Sequential(v[0], v[2])
-        else:
-            return v
-    def forward_features(self, inputs):
-        outputs = {}
-        reconfigured_heads = {k: self.reconfigure_head(k, v) for k, v in self.modality_heads.items()}
-        for modality_key, modality_value in inputs.items():
-            reduce_list = (
-                    modality_value.ndim >= 5
-            )  # Audio and Video inputs consist of multiple clips
-            if reduce_list:
-                B, S = modality_value.shape[:2]
-                modality_value = modality_value.reshape(
-                    B * S, *modality_value.shape[2:]
-                )
-            if modality_value is not None:
-                modality_value = self.modality_preprocessors[modality_key](
-                    **{modality_key: modality_value}
-                )
-                trunk_inputs = modality_value["trunk"]
-                head_inputs = modality_value["head"]
-                modality_value = self.modality_trunks[modality_key](**trunk_inputs)
-                modality_value = reconfigured_heads[modality_key](
-                    modality_value, **head_inputs
-                )
-                modality_value = self.modality_postprocessors[modality_key](
-                    modality_value
-                )
-                if modality_key == ModalityType.AUDIO:
-                    modality_value = ReshapeAudio((12, 19))(modality_value)
-                elif modality_key == ModalityType.VISION:
-                    modality_value = ReshapeSpatial((16, 16))(modality_value)
-                outputs[modality_key] = modality_value
-        return outputs
-def imagebind_huge(output_path, pretrained=False):
-    model = ImageBindModel(
-        vision_embed_dim=1280,
-        vision_num_blocks=32,
-        vision_num_heads=16,
-        text_embed_dim=1024,
-        text_num_blocks=24,
-        text_num_heads=16,
-        out_embed_dim=1024,
-        audio_drop_path=0.1,
-        imu_drop_path=0.7,
-    )
-    if pretrained:
-        path = os.path.join(output_path, 'models/imagebind_huge.pth')
-        if not os.path.exists(path):
-            print(f"Downloading imagebind weights to {path} ...")
-            os.makedirs(os.path.dirname(path), exist_ok=True)
-            torch.hub.download_url_to_file(
-                "https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth",
-                path,
-                progress=True,
-            )
-        model.load_state_dict(torch.load(path), strict=False)
-    return model
-DEFAULT_AUDIO_FRAME_SHIFT_MS = 10  # in milliseconds
-def waveform2melspec(waveform, sample_rate, num_mel_bins, target_length):
-    # Based on https://github.com/YuanGongND/ast/blob/d7d8b4b8e06cdaeb6c843cdb38794c1c7692234c/src/dataloader.py#L102
-    waveform -= waveform.mean()
-    fbank = torchaudio.compliance.kaldi.fbank(
-        waveform,
-        htk_compat=True,
-        sample_frequency=sample_rate,
-        use_energy=False,
-        window_type="hanning",
-        num_mel_bins=num_mel_bins,
-        dither=0.0,
-        frame_length=25,
-        frame_shift=DEFAULT_AUDIO_FRAME_SHIFT_MS,
-    )
-    # Convert to [mel_bins, num_frames] shape
-    fbank = fbank.transpose(0, 1)
-    # Pad to target_length
-    n_frames = fbank.size(1)
-    p = target_length - n_frames
-    # if p is too large (say >20%), flash a warning
-    if abs(p) / n_frames > 0.2:
-        logging.warning(
-            "Large gap between audio n_frames(%d) and "
-            "target_length (%d). Is the audio_target_length "
-            "setting correct?",
-            n_frames,
-            target_length,
-        )
-    # cut and pad
-    if p > 0:
-        fbank = torch.nn.functional.pad(fbank, (0, p), mode="constant", value=0)
-    elif p < 0:
-        fbank = fbank[:, 0:target_length]
-    # Convert to [1, mel_bins, num_frames] shape, essentially like a 1
-    # channel image
-    fbank = fbank.unsqueeze(0)
-    return fbank
-def get_clip_timepoints(clip_sampler, duration):
-    # Read out all clips in this video
-    all_clips_timepoints = []
-    is_last_clip = False
-    end = 0.0
-    while not is_last_clip:
-        start, end, _, _, is_last_clip = clip_sampler(end, duration, annotation=None)
-        all_clips_timepoints.append((start, end))
-    return all_clips_timepoints
-def load_and_transform_vision_data(image_paths, device):
-    if image_paths is None:
-        return None
-    image_ouputs = []
-    for image_path in image_paths:
-        data_transform = transforms.Compose(
-            [
-                transforms.Resize(
-                    224, interpolation=transforms.InterpolationMode.BICUBIC
-                ),
-                transforms.CenterCrop(224),
-                transforms.ToTensor(),
-                transforms.Normalize(
-                    mean=(0.48145466, 0.4578275, 0.40821073),
-                    std=(0.26862954, 0.26130258, 0.27577711),
-                ),
-            ]
-        )
-        with open(image_path, "rb") as fopen:
-            image = Image.open(fopen).convert("RGB")
-        image = data_transform(image).to(device)
-        image_ouputs.append(image)
-    return torch.stack(image_ouputs, dim=0)
-def load_and_transform_audio_data(
-        audio_paths,
-        device,
-        num_mel_bins=128,
-        target_length=204,
-        sample_rate=16000,
-        clip_duration=2,
-        clips_per_video=3,
-        mean=-4.268,
-        std=9.138,
-):
-    from pytorchvideo.data.clip_sampling import ConstantClipsPerVideoSampler
-    if audio_paths is None:
-        return None
-    audio_outputs = []
-    clip_sampler = ConstantClipsPerVideoSampler(
-        clip_duration=clip_duration, clips_per_video=clips_per_video
-    )
-    for audio_path in audio_paths:
-        waveform, sr = torchaudio.load(audio_path)
-        if sample_rate != sr:
-            waveform = torchaudio.functional.resample(
-                waveform, orig_freq=sr, new_freq=sample_rate
-            )
-        all_clips_timepoints = get_clip_timepoints(
-            clip_sampler, waveform.size(1) / sample_rate
-        )
-        all_clips = []
-        for clip_timepoints in all_clips_timepoints:
-            waveform_clip = waveform[
-                            :,
-                            int(clip_timepoints[0] * sample_rate): int(
-                                clip_timepoints[1] * sample_rate
-                            ),
-                            ]
-            waveform_melspec = waveform2melspec(
-                waveform_clip, sample_rate, num_mel_bins, target_length
-            )
-            all_clips.append(waveform_melspec)
-        normalize = transforms.Normalize(mean=mean, std=std)
-        all_clips = [normalize(ac).to(device) for ac in all_clips]
-        all_clips = torch.stack(all_clips, dim=0)
-        audio_outputs.append(all_clips)
-    return torch.stack(audio_outputs, dim=0)
-class UnNormalize(object):
-    def __init__(self, mean, std):
-        self.mean = mean
-        self.std = std
-    def __call__(self, image):
-        image2 = torch.clone(image)
-        for t, m, s in zip(image2, self.mean, self.std):
-            t.mul_(s).add_(m)
-        return image2
-norm = T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-unnorm = UnNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-class TorchPCA(object):
-    def __init__(self, n_components):
-        self.n_components = n_components
-    def fit(self, X):
-        self.mean_ = X.mean(dim=0)
-        unbiased = X - self.mean_.unsqueeze(0)
-        U, S, V = torch.pca_lowrank(unbiased, q=self.n_components, center=False, niter=4)
-        self.components_ = V.T
-        self.singular_values_ = S
-        return self
-    def transform(self, X):
-        t0 = X - self.mean_.unsqueeze(0)
-        projected = t0 @ self.components_.T
-        return projected
-def pca(image_feats_list, dim=3, fit_pca=None):
-    # from sklearn.decomposition import PCA
-    device = image_feats_list[0].device
-    def flatten(tensor, target_size=None):
-        if target_size is not None and fit_pca is None:
-            F.interpolate(tensor, (target_size, target_size), mode="bilinear")
-        B, C, H, W = tensor.shape
-        return feats.permute(1, 0, 2, 3).reshape(C, B * H * W).permute(1, 0).detach().cpu()
-    if len(image_feats_list) > 1 and fit_pca is None:
-        target_size = image_feats_list[0].shape[2]
-    else:
-        target_size = None
-    flattened_feats = []
-    for feats in image_feats_list:
-        flattened_feats.append(flatten(feats, target_size))
-    x = torch.cat(flattened_feats, dim=0)
-    if fit_pca is None:
-        # fit_pca = PCA(n_components=dim, svd_solver='arpack').fit(np.nan_to_num(x.detach().numpy()))
-        fit_pca = TorchPCA(n_components=dim).fit(x)
-    reduced_feats = []
-    for feats in image_feats_list:
-        # x_red = torch.from_numpy(fit_pca.transform(flatten(feats)))
-        x_red = fit_pca.transform(flatten(feats))
-        x_red -= x_red.min(dim=0, keepdim=True).values
-        x_red /= x_red.max(dim=0, keepdim=True).values
-        B, C, H, W = feats.shape
-        reduced_feats.append(x_red.reshape(B, H, W, dim).permute(0, 3, 1, 2).to(device))
-    return reduced_feats, fit_pca
-def my_load_audio(audio_file):
-    loaded_waveform, obs_sr = torchaudio.load(audio_file)
-    loaded_waveform = loaded_waveform[0]
-    neg_waveform, neg_obs_sr = None, None
-    from data.AVDatasets import prep_waveform
-    (waveform,
-     spectrogram,
-     audio_length,
-     total_length,
-     original_length,
-     mask,
-     pos_mask) = prep_waveform(
-        loaded_waveform,
-        obs_sr,
-        10,
-        128,
-        -4.268,
-        9.138,
-        16000,
-        True,
-        False,
-        False,
-        neg_waveform,
-        neg_obs_sr,
-        False,
-    )
-    patch_size = 204
-    n_tiles = spectrogram.shape[1] // patch_size
-    assert n_tiles == 5
-    patches = []
-    for i in range(n_tiles):
-        patches.append(spectrogram[:, i * patch_size:(i + 1) * patch_size, :])
-    patches = torch.cat(patches, dim=0).permute(0, 2, 1).unsqueeze(1)
-    return patches
-class ImageBindImageFeaturizer(nn.Module):
-    def __init__(self, output_path, model=None):
-        super().__init__()
-        if model is not None:
-            self.model = model
-        else:
-            self.model = imagebind_huge(output_path, pretrained=True).cuda()
-    def forward(self, image, include_cls):
-        inputs = {
-            ModalityType.VISION: image,
-        }
-        patch_tokens, cls_tokens = self.model.forward_features(inputs)[ModalityType.VISION]
-        patch_tokens = patch_tokens.permute(0, 3, 1, 2)
-        if include_cls:
-            return patch_tokens, cls_tokens
-        else:
-            return patch_tokens
-class ImageBindAudioFeaturizer(nn.Module):
-    def __init__(self, output_path, model=None):
-        super().__init__()
-        if model is not None:
-            self.model = model
-        else:
-            self.model = imagebind_huge(output_path, pretrained=True).cuda()
-    def forward(self, spec, include_cls):
-        patch_size = 204
-        n_tiles = spec.shape[2] // patch_size
-        assert n_tiles == 5
-        patches = []
-        for i in range(n_tiles):
-            patches.append(spec[:, :, i * patch_size:(i + 1) * patch_size, :])
-        patches = torch.cat(patches, dim=1).permute(0, 1, 3, 2).unsqueeze(2)
-        inputs = {
-            ModalityType.AUDIO: patches,
-        }
-        patch_tokens, cls_token = self.model.forward_features(inputs)[ModalityType.AUDIO]
-        patch_tokens = patch_tokens.permute(0, 4, 2, 1, 3)
-        b, c, h, p, w = patch_tokens.shape
-        patch_tokens = patch_tokens.reshape(b, c, h, w * p)
-        cls_token = cls_token.reshape(b, p, -1).mean(1)
-        if include_cls:
-            return patch_tokens, cls_token
-        else:
-            return patch_tokens
-if __name__ == "__main__":
-    image_paths = ["../../samples/dog_image.jpg", "../../samples/car_image.jpg", "../../samples/bird_image.jpg"]
-    audio_paths = ["../../samples/dog_audio.wav", "../../samples/car_audio.wav", "../../samples/bird_audio.wav"]
-    device = "cuda:0" if torch.cuda.is_available() else "cpu"
-    # Instantiate model
-    model = imagebind_huge("../../", pretrained=True)
-    model.eval()
-    model.to(device)
-    audio_inputs = torch.cat([my_load_audio(af).unsqueeze(0) for af in audio_paths], dim=0).cuda()
-    # Load data
-    inputs = {
-        ModalityType.VISION: load_and_transform_vision_data(image_paths, device),
-        # ModalityType.AUDIO: load_and_transform_audio_data(audio_paths, device, clip_duration=2, clips_per_video=5),
-        ModalityType.AUDIO: audio_inputs,
-    }
-    with torch.no_grad():
-        embeddings = model.forward_features(inputs)
-        cls_tokens = model.forward(inputs)
-    audio_cls_token = embeddings["audio"][1].reshape(3, 5, -1).mean(1)
-    sims1 = torch.einsum(
-        "bc,dc->bd",
-        embeddings["vision"][1],
-        audio_cls_token)
-    print(torch.softmax(sims1, dim=1).cpu().numpy())
-    #
-    # sims2 = torch.einsum(
-    #     "bc,dc->bd",
-    #     embeddings["vision"].mean(1).mean(1),
-    #     embeddings["audio"].mean(1).mean(1).mean(1)
-    # )
-    #
-    # print(torch.softmax(sims2, dim=1).cpu().numpy())
-    #
-    #
-    # img_num = 0
-    # img_feats = F.normalize(embeddings["vision"].permute(0, 3, 1, 2), dim=1)
-    # [red_img_feats], fit_pca = pca([img_feats])
-    #
-    # fig, axes = plt.subplots(2, 2, figsize=(4 * 2, 4 * 2))
-    # axes[0][0].imshow(unnorm(inputs["vision"][0].unsqueeze(0))[0].permute(1, 2, 0).detach().cpu())
-    # axes[0][1].imshow(unnorm(inputs["vision"][1].unsqueeze(0))[0].permute(1, 2, 0).detach().cpu())
-    # axes[1][0].imshow(red_img_feats[0].permute(1, 2, 0).detach().cpu())
-    # axes[1][1].imshow(red_img_feats[1].permute(1, 2, 0).detach().cpu())
-    # plt.tight_layout()
-    # plt.show()
-    #
-    audio_embs = F.normalize(embeddings["audio"][0], dim=-1)
-    b, n, h, w, c = audio_embs.shape
-    audio_embs = audio_embs.permute(0, 4, 2, 1, 3).reshape(b, c, h, w * n)
-    b, n, c, h, w = inputs["audio"].shape
-    audio_inputs = inputs["audio"].permute(0, 2, 3, 1, 4).reshape(b, c, h, w * n)
-    print("here")
-    for img_num in range(3):
-        [red_audio], fit_pca = pca([audio_embs[img_num].unsqueeze(0)])
-        fig, axes = plt.subplots(2, 1, figsize=(10 * 1, 4 * 2))
-        axes[0].imshow(audio_inputs[img_num, 0].detach().cpu())
-        axes[1].imshow(red_audio[0].permute(1, 2, 0).detach().cpu())
-        plt.tight_layout()
-        plt.show()
-    print("here")

DenseAV/denseav/featurizers/__init__.py DELETED Viewed

File without changes

DenseAV/denseav/plotting.py DELETED Viewed

@@ -1,244 +0,0 @@
-import os
-from collections import defaultdict
-import matplotlib.colors as mcolors
-import matplotlib.pyplot as plt
-import numpy as np
-import scipy.io.wavfile as wavfile
-import torch
-import torch.nn.functional as F
-import torchvision
-from moviepy.editor import VideoFileClip, AudioFileClip
-from base64 import b64encode
-from denseav.shared import pca
-def write_video_with_audio(video_frames, audio_array, video_fps, audio_fps, output_path):
-    """
-    Writes video frames and audio to a specified path.
-    Parameters:
-    - video_frames: torch.Tensor of shape (num_frames, height, width, channels)
-    - audio_array: torch.Tensor of shape (num_samples, num_channels)
-    - video_fps: int, frames per second of the video
-    - audio_fps: int, sample rate of the audio
-    - output_path: str, path to save the final video with audio
-    """
-    os.makedirs(os.path.dirname(output_path), exist_ok=True)
-    temp_video_path = output_path.replace('.mp4', '_temp.mp4')
-    temp_audio_path = output_path.replace('.mp4', '_temp_audio.wav')
-    video_options = {
-        'crf': '23',
-        'preset': 'slow',
-        'bit_rate': '1000k'}
-    if audio_array is not None:
-        torchvision.io.write_video(
-            filename=temp_video_path,
-            video_array=video_frames,
-            fps=video_fps,
-            options=video_options
-        )
-        wavfile.write(temp_audio_path, audio_fps, audio_array.cpu().to(torch.float64).permute(1, 0).numpy())
-        video_clip = VideoFileClip(temp_video_path)
-        audio_clip = AudioFileClip(temp_audio_path)
-        final_clip = video_clip.set_audio(audio_clip)
-        final_clip.write_videofile(output_path, codec='libx264', verbose=False)
-        os.remove(temp_video_path)
-        os.remove(temp_audio_path)
-    else:
-        torchvision.io.write_video(
-            filename=output_path,
-            video_array=video_frames,
-            fps=video_fps,
-            options=video_options
-        )
-def alpha_blend_layers(layers):
-    blended_image = layers[0]
-    for layer in layers[1:]:
-        rgb1, alpha1 = blended_image[:, :3, :, :], blended_image[:, 3:4, :, :]
-        rgb2, alpha2 = layer[:, :3, :, :], layer[:, 3:4, :, :]
-        alpha_out = alpha2 + alpha1 * (1 - alpha2)
-        rgb_out = (rgb2 * alpha2 + rgb1 * alpha1 * (1 - alpha2)) / alpha_out.clamp(min=1e-7)
-        blended_image = torch.cat([rgb_out, alpha_out], dim=1)
-    return (blended_image[:, :3] * 255).clamp(0, 255).to(torch.uint8).permute(0, 2, 3, 1)
-def _prep_sims_for_plotting(sim_by_head, frames):
-    with torch.no_grad():
-        results = defaultdict(list)
-        n_frames, _, vh, vw = frames.shape
-        sims = sim_by_head.max(dim=1).values
-        n_audio_feats = sims.shape[-1]
-        for frame_num in range(n_frames):
-            selected_audio_feat = int((frame_num / n_frames) * n_audio_feats)
-            selected_sim = F.interpolate(
-                sims[frame_num, :, :, selected_audio_feat].unsqueeze(0).unsqueeze(0),
-                size=(vh, vw),
-                mode="bicubic")
-            results["sims_all"].append(selected_sim)
-            for head in range(sim_by_head.shape[1]):
-                selected_sim = F.interpolate(
-                    sim_by_head[frame_num, head, :, :, selected_audio_feat].unsqueeze(0).unsqueeze(0),
-                    size=(vh, vw),
-                    mode="bicubic")
-                results[f"sims_{head + 1}"].append(selected_sim)
-        results = {k: torch.cat(v, dim=0) for k, v in results.items()}
-        return results
-def get_plasma_with_alpha():
-    plasma = plt.cm.plasma(np.linspace(0, 1, 256))
-    alphas = np.linspace(0, 1, 256)
-    plasma_with_alpha = np.zeros((256, 4))
-    plasma_with_alpha[:, 0:3] = plasma[:, 0:3]
-    plasma_with_alpha[:, 3] = alphas
-    return mcolors.ListedColormap(plasma_with_alpha)
-def get_inferno_with_alpha_2(alpha=0.5, k=30):
-    k_fraction = k / 100.0
-    custom_cmap = np.zeros((256, 4))
-    threshold_index = int(k_fraction * 256)
-    custom_cmap[:threshold_index, :3] = 0  # RGB values for black
-    custom_cmap[:threshold_index, 3] = alpha  # Alpha value
-    remaining_inferno = plt.cm.inferno(np.linspace(0, 1, 256 - threshold_index))
-    custom_cmap[threshold_index:, :3] = remaining_inferno[:, :3]
-    custom_cmap[threshold_index:, 3] = alpha  # Alpha value
-    return mcolors.ListedColormap(custom_cmap)
-def get_inferno_with_alpha():
-    plasma = plt.cm.inferno(np.linspace(0, 1, 256))
-    alphas = np.linspace(0, 1, 256)
-    plasma_with_alpha = np.zeros((256, 4))
-    plasma_with_alpha[:, 0:3] = plasma[:, 0:3]
-    plasma_with_alpha[:, 3] = alphas
-    return mcolors.ListedColormap(plasma_with_alpha)
-red_cmap = mcolors.LinearSegmentedColormap('RedMap', segmentdata={
-    'red': [(0.0, 1.0, 1.0), (1.0, 1.0, 1.0)],
-    'green': [(0.0, 0.0, 0.0), (1.0, 0.0, 0.0)],
-    'blue': [(0.0, 0.0, 0.0), (1.0, 0.0, 0.0)],
-    'alpha': [(0.0, 0.0, 0.0), (1.0, 1.0, 1.0)]
-})
-blue_cmap = mcolors.LinearSegmentedColormap('BlueMap', segmentdata={
-    'red': [(0.0, 0.0, 0.0), (1.0, 0.0, 0.0)],
-    'green': [(0.0, 0.0, 0.0), (1.0, 0.0, 0.0)],
-    'blue': [(0.0, 1.0, 1.0), (1.0, 1.0, 1.0)],
-    'alpha': [(0.0, 0.0, 0.0), (1.0, 1.0, 1.0)]
-})
-def plot_attention_video(sims_by_head, frames, audio, video_fps, audio_fps, output_filename):
-    prepped_sims = _prep_sims_for_plotting(sims_by_head, frames)
-    n_frames, _, vh, vw = frames.shape
-    sims_all = prepped_sims["sims_all"].clamp_min(0)
-    sims_all -= sims_all.min()
-    sims_all = sims_all / sims_all.max()
-    cmap = get_inferno_with_alpha()
-    layer1 = torch.cat([frames, torch.ones(n_frames, 1, vh, vw)], axis=1)
-    layer2 = torch.tensor(cmap(sims_all.squeeze().detach().cpu())).permute(0, 3, 1, 2)
-    write_video_with_audio(
-        alpha_blend_layers([layer1, layer2]),
-        audio,
-        video_fps,
-        audio_fps,
-        output_filename)
-def plot_2head_attention_video(sims_by_head, frames, audio, video_fps, audio_fps, output_filename):
-    prepped_sims = _prep_sims_for_plotting(sims_by_head, frames)
-    sims_1 = prepped_sims["sims_1"]
-    sims_2 = prepped_sims["sims_2"]
-    n_frames, _, vh, vw = frames.shape
-    mask = sims_1 > sims_2
-    sims_1 *= mask
-    sims_2 *= (~mask)
-    sims_1 = sims_1.clamp_min(0)
-    sims_1 -= sims_1.min()
-    sims_1 = sims_1 / sims_1.max()
-    sims_2 = sims_2.clamp_min(0)
-    sims_2 -= sims_2.min()
-    sims_2 = sims_2 / sims_2.max()
-    layer1 = torch.cat([frames, torch.ones(n_frames, 1, vh, vw)], axis=1)
-    layer2_head1 = torch.tensor(red_cmap(sims_1.squeeze().detach().cpu())).permute(0, 3, 1, 2)
-    layer2_head2 = torch.tensor(blue_cmap(sims_2.squeeze().detach().cpu())).permute(0, 3, 1, 2)
-    write_video_with_audio(
-        alpha_blend_layers([layer1, layer2_head1, layer2_head2]),
-        audio,
-        video_fps,
-        audio_fps,
-        output_filename)
-def plot_feature_video(image_feats,
-                       audio_feats,
-                       frames,
-                       audio,
-                       video_fps,
-                       audio_fps,
-                       video_filename,
-                       audio_filename):
-    with torch.no_grad():
-        image_feats_ = image_feats.cpu()
-        audio_feats_ = audio_feats.cpu()
-        [red_img_feats, red_audio_feats], _ = pca([
-            image_feats_,
-            audio_feats_,  # .tile(image_feats_.shape[0], 1, 1, 1)
-        ])
-        _, _, vh, vw = frames.shape
-        red_img_feats = F.interpolate(red_img_feats, size=(vh, vw), mode="bicubic")
-        red_audio_feats = red_audio_feats[0].unsqueeze(0)
-        red_audio_feats = F.interpolate(red_audio_feats, size=(50, red_img_feats.shape[0]), mode="bicubic")
-    write_video_with_audio(
-        (red_img_feats.permute(0, 2, 3, 1) * 255).clamp(0, 255).to(torch.uint8),
-        audio,
-        video_fps,
-        audio_fps,
-        video_filename)
-    red_audio_feats_expanded = red_audio_feats.tile(red_img_feats.shape[0], 1, 1, 1)
-    red_audio_feats_expanded = F.interpolate(red_audio_feats_expanded, scale_factor=6, mode="bicubic")
-    for i in range(red_img_feats.shape[0]):
-        center_index = i * 6
-        min_index = max(center_index - 2, 0)
-        max_index = min(center_index + 2, red_audio_feats_expanded.shape[-1])
-        red_audio_feats_expanded[i, :, :, min_index:max_index] = 1
-    write_video_with_audio(
-        (red_audio_feats_expanded.permute(0, 2, 3, 1) * 255).clamp(0, 255).to(torch.uint8),
-        audio,
-        video_fps,
-        audio_fps,
-        audio_filename)
-def display_video_in_notebook(path):
-    from IPython.display import HTML, display
-    mp4 = open(path, 'rb').read()
-    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
-    display(HTML("""
-  <video width=400 controls>
-        <source src="%s" type="video/mp4">
-  </video>
-  """ % data_url))

DenseAV/denseav/saved_models.py DELETED Viewed

@@ -1,262 +0,0 @@
-import os
-import re
-from os.path import join
-import torch
-def get_latest(name, checkpoint_dir, extra_args=None):
-    if extra_args is None:
-        extra_args = dict()
-    files = os.listdir(join(checkpoint_dir, name))
-    steps = torch.tensor([int(f.split("step=")[-1].split(".")[0]) for f in files])
-    selected = files[steps.argmax()]
-    return dict(
-        chkpt_name=os.path.join(name, selected),
-        extra_args=extra_args)
-DS_PARAM_REGEX = r'_forward_module\.(.+)'
-def convert_deepspeed_checkpoint(deepspeed_ckpt_path: str, pl_ckpt_path: str = None):
-    '''
-    Creates a PyTorch Lightning checkpoint from the DeepSpeed checkpoint directory, while patching
-    in parameters which are improperly loaded by the DeepSpeed conversion utility.
-    deepspeed_ckpt_path: Path to the DeepSpeed checkpoint folder.
-    pl_ckpt_path: Path to the reconstructed PyTorch Lightning checkpoint. If not specified, will be
-        placed in the same directory as the DeepSpeed checkpoint directory with the same name but
-        a .pt extension.
-    Returns: path to the converted checkpoint.
-    '''
-    from pytorch_lightning.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict
-    if not (deepspeed_ckpt_path.endswith('.ckpt') and os.path.isdir(deepspeed_ckpt_path)):
-        raise ValueError(
-            'args.ckpt_dir should point to the checkpoint directory'
-            ' output by DeepSpeed (e.g. "last.ckpt" or "epoch=4-step=39150.ckpt").'
-        )
-    # Convert state dict to PyTorch format
-    if not pl_ckpt_path:
-        pl_ckpt_path = f'{deepspeed_ckpt_path[:-4]}pt'  # .ckpt --> .pt
-    if not os.path.exists(pl_ckpt_path):
-        convert_zero_checkpoint_to_fp32_state_dict(deepspeed_ckpt_path, pl_ckpt_path)
-    # Patch in missing parameters that failed to be converted by DeepSpeed utility
-    pl_ckpt = _merge_deepspeed_weights(deepspeed_ckpt_path, pl_ckpt_path)
-    torch.save(pl_ckpt, pl_ckpt_path)
-    return pl_ckpt_path
-def get_optim_files(checkpoint_dir):
-    files = sorted([f for f in os.listdir(checkpoint_dir) if "optim" in f])
-    return [join(checkpoint_dir, f) for f in files]
-def get_model_state_file(checkpoint_dir, zero_stage):
-    f = [f for f in os.listdir(checkpoint_dir) if "model_states" in f][0]
-    return join(checkpoint_dir, f)
-def _merge_deepspeed_weights(deepspeed_ckpt_path: str, fp32_ckpt_path: str):
-    '''
-    Merges tensors with keys in the DeepSpeed checkpoint but not in the fp32_checkpoint
-    into the fp32 state dict.
-    deepspeed_ckpt_path: Path to the DeepSpeed checkpoint folder.
-    fp32_ckpt_path: Path to the reconstructed
-    '''
-    from pytorch_lightning.utilities.deepspeed import ds_checkpoint_dir
-    # This first part is based on pytorch_lightning.utilities.deepspeed.convert_zero_checkpoint_to_fp32_state_dict
-    checkpoint_dir = ds_checkpoint_dir(deepspeed_ckpt_path)
-    optim_files = get_optim_files(checkpoint_dir)
-    optim_state = torch.load(optim_files[0], map_location='cpu')
-    zero_stage = optim_state["optimizer_state_dict"]["zero_stage"]
-    deepspeed_model_file = get_model_state_file(checkpoint_dir, zero_stage)
-    # Start adding all parameters from DeepSpeed ckpt to generated PyTorch Lightning ckpt
-    ds_ckpt = torch.load(deepspeed_model_file, map_location='cpu')
-    ds_sd = ds_ckpt['module']
-    fp32_ckpt = torch.load(fp32_ckpt_path, map_location='cpu')
-    fp32_sd = fp32_ckpt['state_dict']
-    for k, v in ds_sd.items():
-        try:
-            match = re.match(DS_PARAM_REGEX, k)
-            param_name = match.group(1)
-        except:
-            print(f'Failed to extract parameter from DeepSpeed key {k}')
-            continue
-        v = v.to(torch.float32)
-        if param_name not in fp32_sd:
-            print(f'Adding parameter {param_name} from DeepSpeed state_dict to fp32_sd')
-            fp32_sd[param_name] = v
-        else:
-            assert torch.allclose(v, fp32_sd[param_name].to(torch.float32), atol=1e-2)
-    return fp32_ckpt
-def get_version_and_step(f, i):
-    step = f.split("step=")[-1].split(".")[0]
-    if "-v" in step:
-        [step, version] = step.split("-v")
-    else:
-        step, version = step, 0
-    return int(version), int(step), i
-def get_latest_ds(name, extra_args=None):
-    if extra_args is None:
-        extra_args = dict()
-    files = os.listdir(f"../checkpoints/{name}")
-    latest = sorted([get_version_and_step(f, i) for i, f in enumerate(files)], reverse=True)[0]
-    selected = files[latest[-1]]
-    # print(f"Selecting file: {selected}")
-    ds_chkpt = join(name, selected)
-    reg_chkpt = join(name + "_fp32", selected)
-    reg_chkpt_path = join("../checkpoints", reg_chkpt)
-    if not os.path.exists(reg_chkpt_path):
-        os.makedirs(os.path.dirname(reg_chkpt_path), exist_ok=True)
-        print(f"Checkpoint {reg_chkpt} does not exist, converting from deepspeed")
-        convert_deepspeed_checkpoint(join("../checkpoints", ds_chkpt), reg_chkpt_path)
-    return dict(
-        chkpt_name=reg_chkpt,
-        extra_args=extra_args)
-def get_all_models_in_dir(name, checkpoint_dir, extra_args=None):
-    ret = {}
-    for model_dir in os.listdir(join(checkpoint_dir, name)):
-        full_name = f"{name}/{model_dir}/train"
-        # print(f'"{full_name}",')
-        ret[full_name] = get_latest(full_name, checkpoint_dir, extra_args)
-    return ret
-def saved_model_dict(checkpoint_dir):
-    model_info = {
-        **get_all_models_in_dir(
-            "9-5-23-mixed",
-            checkpoint_dir,
-            extra_args=dict(
-                mixup_weight=0.0,
-                sim_use_cls=False,
-                audio_pool_width=1,
-                memory_buffer_size=0,
-                loss_leak=0.0)
-        ),
-        **get_all_models_in_dir(
-            "1-23-24-rebuttal-heads",
-            checkpoint_dir,
-            extra_args=dict(
-                loss_leak=0.0)
-        ),
-        **get_all_models_in_dir(
-            "11-8-23",
-            checkpoint_dir,
-            extra_args=dict(loss_leak=0.0)),
-        **get_all_models_in_dir(
-            "10-30-23-3",
-            checkpoint_dir,
-            extra_args=dict(loss_leak=0.0)),
-        "davenet": dict(
-            chkpt_name=None,
-            extra_args=dict(
-                audio_blur=1,
-                image_model_type="davenet",
-                image_aligner_type=None,
-                audio_model_type="davenet",
-                audio_aligner_type=None,
-                audio_input="davenet_spec",
-                use_cached_embs=False,
-                dropout=False,
-                sim_agg_heads=1,
-                nonneg_sim=False,
-                audio_lora=False,
-                image_lora=False,
-                norm_vectors=False,
-            ),
-            data_args=dict(
-                use_cached_embs=False,
-                use_davenet_spec=True,
-                override_target_length=20,
-                audio_model_type="davenet",
-            ),
-        ),
-        "cavmae": dict(
-            chkpt_name=None,
-            extra_args=dict(
-                audio_blur=1,
-                image_model_type="cavmae",
-                image_aligner_type=None,
-                audio_model_type="cavmae",
-                audio_aligner_type=None,
-                audio_input="spec",
-                use_cached_embs=False,
-                sim_agg_heads=1,
-                dropout=False,
-                nonneg_sim=False,
-                audio_lora=False,
-                image_lora=False,
-                norm_vectors=False,
-                learn_audio_cls=False,
-                sim_agg_type="cavmae",
-            ),
-            data_args=dict(
-                use_cached_embs=False,
-                use_davenet_spec=True,
-                audio_model_type="cavmae",
-                override_target_length=10,
-            ),
-        ),
-        "imagebind": dict(
-            chkpt_name=None,
-            extra_args=dict(
-                audio_blur=1,
-                image_model_type="imagebind",
-                image_aligner_type=None,
-                audio_model_type="imagebind",
-                audio_aligner_type=None,
-                audio_input="spec",
-                use_cached_embs=False,
-                sim_agg_heads=1,
-                dropout=False,
-                nonneg_sim=False,
-                audio_lora=False,
-                image_lora=False,
-                norm_vectors=False,
-                learn_audio_cls=False,
-                sim_agg_type="imagebind",
-            ),
-            data_args=dict(
-                use_cached_embs=False,
-                use_davenet_spec=True,
-                audio_model_type="imagebind",
-                override_target_length=10,
-            ),
-        ),
-    }
-    model_info["denseav_language"] = model_info["10-30-23-3/places_base/train"]
-    model_info["denseav_sound"] = model_info["11-8-23/hubert_1h_asf_cls_full_image_train_small_lr/train"]
-    model_info["denseav_2head"] = model_info["1-23-24-rebuttal-heads/mixed-2h/train"]
-    return model_info

DenseAV/denseav/shared.py DELETED Viewed

@@ -1,739 +0,0 @@
-import random
-from collections import defaultdict, deque
-from typing import Any
-import math
-import matplotlib.pyplot as plt
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn.functional as F
-import torchaudio
-import torchvision.transforms as T
-from PIL import Image
-from torch.utils.data import Dataset
-from torchaudio.functional import resample
-class UnNormalize(object):
-    def __init__(self, mean, std):
-        self.mean = mean
-        self.std = std
-    def __call__(self, image):
-        image2 = torch.clone(image)
-        for t, m, s in zip(image2, self.mean, self.std):
-            t.mul_(s).add_(m)
-        return image2
-class SliceDataset(Dataset):
-    def __init__(self, ds, start, end):
-        self.ds = ds
-        self.start = start
-        self.end = end
-    def __len__(self):
-        return self.end - self.start
-    def __getitem__(self, item):
-        return self.ds[item + self.start]
-class SubsetDataset(Dataset):
-    def __init__(self, ds, subset):
-        self.ds = ds
-        self.subset = subset
-    def __len__(self):
-        return len(self.subset)
-    def __getitem__(self, item):
-        return self.ds[self.subset[item]]
-norm = T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-unnorm = UnNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-def crop_to_divisor(x, patch_size):
-    if len(x.shape) == 3:
-        C, H, W = x.shape
-        return x[:, :(patch_size * (H // patch_size)), :(patch_size * (W // patch_size))]
-    elif len(x.shape) == 4:
-        B, C, H, W = x.shape
-        return x[:, :, :(patch_size * (H // patch_size)), :(patch_size * (W // patch_size))]
-    else:
-        raise ValueError("x should have 3 or 4 dimensions")
-def _remove_axes(ax):
-    ax.xaxis.set_major_formatter(plt.NullFormatter())
-    ax.yaxis.set_major_formatter(plt.NullFormatter())
-    ax.set_xticks([])
-    ax.set_yticks([])
-def remove_axes(axes):
-    if len(axes.shape) == 2:
-        for ax1 in axes:
-            for ax in ax1:
-                _remove_axes(ax)
-    else:
-        for ax in axes:
-            _remove_axes(ax)
-def get_image_featurizer(name, token_type="key", **kwargs):
-    name = name.lower()
-    if name == "vit":
-        from denseav.featurizers.DINO import DINOFeaturizer
-        patch_size = 16
-        model = DINOFeaturizer("vit_small_patch16_224", patch_size, token_type)
-        dim = 384
-    elif name == "dino16":
-        from denseav.featurizers.DINO import DINOFeaturizer
-        patch_size = 16
-        model = DINOFeaturizer("dino_vits16", patch_size, token_type)
-        dim = 384
-    elif name == "dino8":
-        from denseav.featurizers.DINO import DINOFeaturizer
-        patch_size = 8
-        model = DINOFeaturizer("dino_vits8", patch_size, token_type)
-        dim = 384
-    elif name == "clip":
-        from denseav.featurizers.CLIP import CLIPFeaturizer
-        patch_size = 16
-        model = CLIPFeaturizer()
-        dim = 512
-    elif name == "cavmae":
-        from denseav.featurizers.CAVMAE import CAVMAEImageFeaturizer
-        model = CAVMAEImageFeaturizer(kwargs["output_root"], model=kwargs.get("model"))
-        dim = 768
-        patch_size = 16
-    elif name == "fnac":
-        from denseav.featurizers.FNACAVL import FNACImageFeaturizer
-        model = FNACImageFeaturizer(kwargs["output_root"], model=kwargs.get("model"))
-        dim = 512
-        patch_size = 16
-    elif name == "imagebind":
-        from denseav.featurizers.ImageBind import ImageBindImageFeaturizer
-        model = ImageBindImageFeaturizer(kwargs["output_root"], model=kwargs.get("model"))
-        dim = 1024
-        patch_size = 16
-    elif name == "resnet50":
-        from torchvision import models
-        model = models.resnet50(pretrained=True)
-        model = torch.nn.Sequential(*list(model.children())[:-2])
-        patch_size = 1
-        dim = 2048
-    elif name == "davenet":
-        from fdenseav.eaturizers.DAVENet import DavenetImageFeaturizer
-        model = DavenetImageFeaturizer()
-        patch_size = 1
-        dim = 1024
-    elif name == "dinov2":
-        from denseav.featurizers.DINOv2 import DINOv2Featurizer
-        model = DINOv2Featurizer()
-        patch_size = 14
-        dim = 768
-    else:
-        raise ValueError("unknown model: {}".format(name))
-    return model, patch_size, dim
-def get_audio_featurizer(name, **kwargs):
-    if name == "davenet":
-        from denseav.featurizers.DAVENet import DavenetAudioFeaturizer
-        model = DavenetAudioFeaturizer()
-        dim = 1024
-    elif name == "dino8":
-        model, _, dim = get_image_featurizer("dino8")
-    elif name == "hubert":
-        from denseav.featurizers.Hubert import Hubert
-        model = Hubert()
-        dim = 1024
-    elif name == "cavmae":
-        from denseav.featurizers.CAVMAE import CAVMAEAudioFeaturizer
-        model = CAVMAEAudioFeaturizer(kwargs["output_root"], model=kwargs.get("model"))
-        dim = 768
-    elif name == "imagebind":
-        from denseav.featurizers.ImageBind import ImageBindAudioFeaturizer
-        model = ImageBindAudioFeaturizer(kwargs["output_root"], model=kwargs.get("model"))
-        dim = 1024
-    elif name == "audiomae":
-        from denseav.featurizers.AudioMAE import AudioMAE
-        model = AudioMAE(kwargs["output_root"], False)
-        dim = 768
-    elif name == "audiomae-finetuned":
-        from denseav.featurizers.AudioMAE import AudioMAE
-        model = AudioMAE(kwargs["output_root"], True)
-        dim = 768
-    else:
-        raise ValueError("Unknown audio model type")
-    return model, dim
-def load_img(image_path, transform):
-    return transform(Image.open(image_path)).unsqueeze(0)
-def pytorch_to_pil(tensor):
-    return Image.fromarray((unnorm(tensor).permute(0, 2, 3, 1).cpu() * 255)
-                           .clamp(0, 255).to(torch.uint8).detach().numpy()[0])
-def _get_random_window(waveform, mask, min_size, max_size):
-    effective_size = mask.sum().to(torch.int64)
-    if effective_size <= min_size:
-        return waveform, mask
-    else:
-        window_size = min(torch.randint(low=min_size, high=min(effective_size, max_size), size=()), waveform.shape[0])
-        if window_size == waveform.shape[0]:
-            window_start = 0
-        else:
-            window_start = torch.randint(low=0, high=effective_size - window_size, size=())
-        new_waveform = torch.zeros_like(waveform)
-        new_mask = torch.zeros_like(mask)
-        new_waveform[window_start:window_start + window_size] = waveform[window_start:window_start + window_size]
-        new_mask[window_start:window_start + window_size] = mask[window_start:window_start + window_size]
-        return new_waveform, new_mask
-def _splice_clips(clip1, clip2, loc, easing_size):
-    assert loc >= 0 and loc < len(clip1), "Invalid location"
-    assert easing_size > 0 and easing_size <= len(clip2), "Invalid easing size"
-    try:
-        assert loc + clip2.shape[0] < clip1.shape[0]
-    except Exception as e:
-        print(loc, clip2.shape[0], clip1.shape[0])
-        raise e
-    # Split clip1 into three parts: before splice, easing region, after splice
-    before_splice = clip1[:loc]
-    after_splice = clip1[loc + clip2.shape[0]:]
-    # Compute the fading weights for the easing region
-    # fade_in_weights = torch.cos(torch.linspace(1, 0, easing_size, device=clip1.device))
-    fade_in_weights = 0.5 * (1 + torch.cos(math.pi * torch.linspace(0, 1, easing_size)))
-    fade_out_weights = 1 - fade_in_weights
-    clip1_ease = torch.cat([
-        fade_in_weights,
-        torch.zeros(clip2.shape[0] - easing_size * 2),
-        fade_out_weights,
-    ])
-    mask = torch.cat([torch.ones(loc), clip1_ease, torch.ones(clip1.shape[0] - (loc + clip2.shape[0]))])
-    # Apply fading weights to clip1 and clip2 within the easing region
-    splice = clip1_ease * clip1[loc:loc + clip2.shape[0]] + (1 - clip1_ease) * clip2
-    # Concatenate all parts back together
-    spliced_clip = torch.cat((before_splice, splice, after_splice))
-    return spliced_clip, mask
-def _generate_random_subset(waveform, low, high):
-    length = len(waveform)
-    # If waveform is smaller than low or has zero length, return unmodified
-    if length < low or length == 0:
-        return waveform
-    # Generate random start index within valid range
-    start = random.randint(0, length - low)
-    # Generate random subset size within valid range
-    subset_size = random.randint(low, min(high, length - start))
-    # Extract the random subset from the waveform
-    subset = waveform[start: start + subset_size]
-    return subset
-def level_audio(waveform):
-    waveform -= waveform.mean()
-    waveform /= waveform.abs.max().valus.clamp_min(.0001)
-    return waveform
-def prep_waveform(waveform,
-                  obs_sr,
-                  target_length,
-                  spec_mel_bins,
-                  spec_mean,
-                  spec_std,
-                  sample_rate,
-                  return_spec,
-                  random_clip,
-                  extra_audio_masking,
-                  neg_waveform,
-                  neg_obs_sr,
-                  audio_level,
-                  audio_aug,
-                  ):
-    if obs_sr != sample_rate:
-        waveform = resample(waveform, obs_sr, sample_rate)
-        if audio_level:
-            waveform = level_audio(waveform)
-    if neg_obs_sr is not None and neg_obs_sr != sample_rate:
-        neg_waveform = resample(neg_waveform, neg_obs_sr, sample_rate)
-        if audio_level:
-            neg_waveform = level_audio(neg_waveform)
-    if neg_obs_sr is not None:  # and random.random() > .5:
-        neg_waveform_clip = _generate_random_subset(neg_waveform, sample_rate, sample_rate * 4)
-        if waveform.shape[0] - neg_waveform_clip.shape[0] > 0:
-            start = random.randint(0, waveform.shape[0] - neg_waveform_clip.shape[0] - 1)
-            easing = max(int(neg_waveform_clip.shape[0] * 1 / 4), sample_rate // 2)
-            easing = min(int(neg_waveform_clip.shape[0] * 1 / 2), easing)
-            waveform, pos_mask = _splice_clips(waveform, neg_waveform_clip, start, easing_size=easing)
-        else:
-            waveform, pos_mask = waveform, torch.ones_like(waveform)
-    else:
-        waveform, pos_mask = waveform, torch.ones_like(waveform)
-    mask = torch.ones_like(waveform)
-    original_length = waveform.shape[0]
-    if target_length == 10:
-        target_samples = 164200  # Result is 1024 after spec
-    else:
-        target_samples = int(target_length * sample_rate)
-    padding = target_samples - original_length
-    if padding > 0:
-        p = torch.nn.ZeroPad2d((0, padding))
-        waveform = p(waveform)
-        mask = p(mask)
-        pos_mask = p(pos_mask)
-    else:
-        if random_clip:
-            start = torch.randint(0, waveform.shape[0] - target_samples, size=())
-        else:
-            start = 0
-        end = start + target_samples
-        waveform = waveform[start:end]
-        mask = mask[start:end]
-        pos_mask = pos_mask[start:end]
-    audio_length = min(original_length, target_samples)
-    total_length = target_samples
-    if extra_audio_masking:
-        min_size = sample_rate // 2
-        max_size = total_length
-        if original_length > min_size and random.random() > .5:
-            waveform, mask = _get_random_window(waveform, mask, min_size, max_size)
-    if audio_aug:
-        import torchaudio_augmentations as AA
-        from torchvision.transforms import RandomApply, Compose
-        transform = Compose([
-            RandomApply([AA.PolarityInversion()], p=0.5),
-            RandomApply([AA.Noise(min_snr=0.001, max_snr=0.005)], p=0.2),
-            RandomApply([AA.Gain()], p=0.2),
-            RandomApply([AA.HighLowPass(sample_rate=sample_rate)], p=0.2),
-            RandomApply([AA.PitchShift(n_samples=waveform.shape[-1], sample_rate=sample_rate)], p=0.2),
-            RandomApply([AA.Reverb(sample_rate=sample_rate)], p=0.2)
-        ])
-        waveform = transform(waveform.unsqueeze(0)).squeeze(0)
-    if return_spec:
-        spectrogram = torchaudio.compliance.kaldi.fbank(
-            waveform.unsqueeze(0) - waveform.mean(),
-            htk_compat=True,
-            sample_frequency=sample_rate,
-            use_energy=False,
-            window_type='hanning',
-            num_mel_bins=spec_mel_bins,
-            dither=0.0,
-            frame_shift=10)
-        spectrogram = ((spectrogram - spec_mean) / spec_std).unsqueeze(0)
-    else:
-        spectrogram = None
-    if mask.mean() < .04:
-        print(f"Bad entry: {mask.mean()}")
-    return waveform, spectrogram, audio_length, total_length, original_length, mask, pos_mask
-class ToTargetTensor(object):
-    def __call__(self, target):
-        return torch.as_tensor(np.array(target), dtype=torch.int64).unsqueeze(0)
-def show_heatmap(ax,
-                 image,
-                 heatmap,
-                 cmap="bwr",
-                 color=False,
-                 center=False,
-                 show_negative=False,
-                 cax=None,
-                 vmax=None,
-                 vmin=None):
-    frame = []
-    if color:
-        frame.append(ax.imshow(image))
-    else:
-        bw = np.dot(np.array(image)[..., :3] / 255, [0.2989, 0.5870, 0.1140])
-        bw = np.ones_like(image) * np.expand_dims(bw, -1)
-        frame.append(ax.imshow(bw))
-    if center:
-        heatmap -= heatmap.mean()
-    if not show_negative:
-        heatmap = heatmap.clamp_min(0)
-    heatmap = F.interpolate(heatmap.unsqueeze(0).unsqueeze(0), (image.shape[0], image.shape[1])) \
-        .squeeze(0).squeeze(0)
-    if vmax is None:
-        vmax = np.abs(heatmap).max()
-    if vmin is None:
-        vmin = -vmax
-    hm = ax.imshow(heatmap, alpha=.5, cmap=cmap, vmax=vmax, vmin=vmin)
-    if cax is not None:
-        plt.colorbar(hm, cax=cax, orientation='vertical')
-    frame.extend([hm])
-    return frame
-class TorchPCA(object):
-    def __init__(self, n_components):
-        self.n_components = n_components
-    def fit(self, X):
-        self.mean_ = X.mean(dim=0)
-        unbiased = X - self.mean_.unsqueeze(0)
-        U, S, V = torch.pca_lowrank(unbiased, q=self.n_components, center=False, niter=4)
-        self.components_ = V.T
-        self.singular_values_ = S
-        return self
-    def transform(self, X):
-        t0 = X - self.mean_.unsqueeze(0)
-        projected = t0 @ self.components_.T
-        return projected
-def pca(image_feats_list, dim=3, fit_pca=None):
-    device = image_feats_list[0].device
-    def flatten(tensor, target_size=None):
-        if target_size is not None and fit_pca is None:
-            F.interpolate(tensor, (target_size, target_size), mode="bilinear")
-        B, C, H, W = tensor.shape
-        return feats.permute(1, 0, 2, 3).reshape(C, B * H * W).permute(1, 0).detach().cpu()
-    if len(image_feats_list) > 1 and fit_pca is None:
-        target_size = image_feats_list[0].shape[2]
-    else:
-        target_size = None
-    flattened_feats = []
-    for feats in image_feats_list:
-        flattened_feats.append(flatten(feats, target_size))
-    x = torch.cat(flattened_feats, dim=0)
-    if fit_pca is None:
-        # fit_pca = PCA(n_components=dim, svd_solver='arpack').fit(np.nan_to_num(x.detach().numpy()))
-        fit_pca = TorchPCA(n_components=dim).fit(x)
-    reduced_feats = []
-    for feats in image_feats_list:
-        # x_red = torch.from_numpy(fit_pca.transform(flatten(feats)))
-        x_red = fit_pca.transform(flatten(feats))
-        x_red -= x_red.min(dim=0, keepdim=True).values
-        x_red /= x_red.max(dim=0, keepdim=True).values
-        B, C, H, W = feats.shape
-        reduced_feats.append(x_red.reshape(B, H, W, dim).permute(0, 3, 1, 2).to(device))
-    return reduced_feats, fit_pca
-def merge_col(fig, axes, col):
-    gs = axes[0, col].get_gridspec()
-    for ax in axes[:, col]:
-        ax.remove()
-    return fig.add_subplot(gs[:, col])
-def visualize_av_features(
-        audio,
-        video,
-        feat_a,
-        feat_v,
-        att_a,
-        n_frames,
-        norm_before_pca=True,
-        axes=None,
-        fig=None,
-        modify_fig=True,
-        video_time=0,
-        fit_pca=None
-):
-    assert (len(audio.shape) == 3)  # C, F, T
-    assert (len(video.shape) == 4)  # T, C, H, W
-    assert (len(feat_a.shape) == 2)  # C, T
-    assert (len(feat_v.shape) == 4)  # T, C, H, W
-    assert (len(att_a.shape) == 2)  # F, T
-    ac, af, at = audio.shape
-    fac, fat = feat_a.shape
-    if modify_fig:
-        if axes is None:
-            fig, axes = plt.subplots(3, 3, figsize=(5 * 3, 5))
-            fig.tight_layout()
-        bigax1 = merge_col(fig, axes, 0)
-        bigax2 = merge_col(fig, axes, 1)
-        _remove_axes(bigax1)
-        _remove_axes(bigax2)
-        remove_axes(axes[:, 2])
-    else:
-        bigax1 = fig.axes[-2]
-        bigax2 = fig.axes[-1]
-    frame_v = unnorm(video).permute(0, 2, 3, 1).detach().cpu()
-    frame_v -= frame_v.min()
-    frame_v /= frame_v.max()
-    frame_a = audio.detach().cpu()
-    frame_a -= frame_a.min()
-    frame_a /= frame_a.max()
-    if norm_before_pca:
-        [red_feat_v], fit_pca = pca([F.normalize(feat_v, dim=1)], fit_pca=fit_pca)
-        [red_feat_a], _ = pca([F.normalize(feat_a.unsqueeze(0).unsqueeze(-1), dim=1)], fit_pca=fit_pca)
-    else:
-        [red_feat_v], fit_pca = pca([feat_v], fit_pca=fit_pca)
-        [red_feat_a], _ = pca([feat_a.unsqueeze(0).unsqueeze(-1)], fit_pca=fit_pca)
-    red_feat_v = red_feat_v.permute(0, 2, 3, 1).detach().cpu()
-    red_feat_a = red_feat_a.permute(0, 2, 3, 1)[0].detach().cpu()
-    if red_feat_a.shape[0] == 1:
-        new_height = int((frame_a.shape[0] / frame_a.shape[1]) * red_feat_a.shape[1])
-        red_feat_a = torch.broadcast_to(
-            red_feat_a, (new_height, red_feat_a.shape[1], red_feat_a.shape[2]))
-        plt_att_a = torch.broadcast_to(att_a, (new_height, att_a.shape[1]))
-    else:
-        plt_att_a = att_a
-    frac_signal = n_frames / fat
-    n_at = int(at * frac_signal)
-    return [bigax1.imshow(frame_v[video_time]),
-            bigax2.imshow(red_feat_v[video_time]),
-            axes[0, 2].imshow(frame_a[:, :n_at]),
-            axes[0, 2].set_title("Spectrogram"),
-            axes[1, 2].imshow(red_feat_a[:, :n_frames]),
-            axes[1, 2].set_title("Audio Features"),
-            axes[2, 2].imshow(plt_att_a[:, :n_frames], vmin=0),
-            axes[2, 2].set_title("Audio Attention")], fig, fit_pca
-def create_label_tensor(labels, starts, ends, max_time, n_steps):
-    assert isinstance(starts, torch.Tensor)
-    assert isinstance(ends, torch.Tensor)
-    ends[ends < 0] = max_time
-    fps = n_steps / max_time
-    times = (torch.arange(0, n_steps, device=labels.device, dtype=torch.float32) + .5) / fps
-    after_start = starts.unsqueeze(1) <= times.unsqueeze(0)
-    before_end = ends.unsqueeze(1) >= times.unsqueeze(0)
-    # Find when you are inside of a word
-    in_word = (after_start * before_end)
-    # Find which word you are inside of
-    word_to_use = in_word.to(torch.float32).argmax(0)
-    # Get the label for that word, or mask out the label if in no word
-    final_labels = labels[word_to_use] * in_word.any(0).reshape(-1, 1, 1)
-    return final_labels
-def generate_subset(n, batch, seed=0):
-    np.random.seed(seed)
-    return np.random.permutation(n)[:batch]
-def channel_blur(t, window=5, std_dev=1):
-    tb, tc, th, tw = t.shape
-    x = torch.linspace(-2, 2, window, device=t.device, dtype=torch.float32)
-    k = torch.exp((-x ** 2 / (2 * std_dev ** 2)))
-    k = k / k.sum()
-    pad = window // 2
-    t_pad = F.pad(t, [0, 0, 0, 0, pad, pad], mode="replicate")
-    tpb, tpc, tph, tpw = t_pad.shape
-    flattened_t = t_pad.permute(0, 2, 3, 1).reshape(tpb * tph * tpw, 1, -1)
-    return F.conv1d(flattened_t, k.reshape(1, 1, window)).reshape(tpb, tph, tpw, tc).permute(0, 3, 1, 2)
-def time_blur(t, window=5, std_dev=1):
-    tb, tc, tt = t.shape
-    with torch.no_grad():
-        x = torch.linspace(-2, 2, window, device=t.device, dtype=torch.float32)
-        k = torch.exp((-x ** 2 / (2 * std_dev ** 2)))
-        k = k / k.sum()
-        k = k.reshape(1, 1, window).detach()
-    pad = window // 2
-    t_pad = F.pad(t, [pad, pad], mode="replicate")
-    return F.conv1d(t_pad.reshape(tb * tc, 1, -1), k).reshape(tb, tc, tt)
-def create_model_from_cfg(clazz, cfg, extra_args):
-    import inspect
-    expected_args = inspect.getfullargspec(clazz.__init__).args[1:]
-    new_args = {k: v for k, v in {**cfg, **extra_args}.items() if k in expected_args}
-    return clazz(**new_args)
-def load_trained_model(chkpt_dir, extra_args, strict=True):
-    from train_av_alignment import LitAVAligner
-    model = LitAVAligner.load_from_checkpoint(chkpt_dir, **extra_args, strict=strict).cuda()
-    return model
-def flatten(l):
-    return [item for sublist in l for item in sublist]
-def flatten_preds(preds):
-    results = {}
-    for k in preds[0].keys():
-        if k == "caption_labels":
-            continue
-        if isinstance(preds[0][k], torch.Tensor):
-            results[k] = torch.cat([p[k] for p in preds], dim=0)
-    if "caption" in preds[0]:
-        results["caption"] = flatten([p["caption"] for p in preds])
-    if "metadata" in preds[0]:
-        results["frame_files"] = flatten([list(p["metadata"]["frame_files"][0]) for p in preds])
-        results["audio_file"] = flatten([list(p["metadata"]["audio_file"]) for p in preds])
-        results["id"] = flatten([list(p["metadata"]["id"]) for p in preds])
-        results["index"] = torch.tensor(flatten([list(p["metadata"]["index"]) for p in preds]))
-    return results
-def batch(iterable, n=1):
-    l = len(iterable)
-    for ndx in range(0, l, n):
-        yield iterable[ndx:min(ndx + n, l)]
-class GatherLayer(torch.autograd.Function):
-    """Gather tensors from all process, supporting backward propagation."""
-    @staticmethod
-    def jvp(ctx: Any, *grad_inputs: Any) -> Any:
-        pass
-    @staticmethod
-    def forward(ctx, inputs):
-        ctx.save_for_backward(inputs)
-        output = [torch.zeros_like(inputs) for _ in range(dist.get_world_size())]
-        dist.all_gather(output, inputs)
-        return tuple(output)
-    @staticmethod
-    def backward(ctx, *grads):
-        (inputs,) = ctx.saved_tensors
-        grad_out = torch.zeros_like(inputs)
-        grad_out[:] = grads[dist.get_rank()]
-        return grad_out
-class RollingAvg:
-    def __init__(self, length, nonzero=False):
-        self.length = length
-        self.nonzero = nonzero
-        self.metrics = defaultdict(lambda: deque(maxlen=self.length))
-    def add(self, name, metric):
-        if self.nonzero and metric == 0:
-            return
-        if isinstance(metric, torch.Tensor):
-            metric = metric.detach()
-        self.metrics[name].append(metric)
-    def get(self, name):
-        with torch.no_grad():
-            return torch.tensor(list(self.metrics[name])).mean()
-    def get_all(self):
-        return {k: self.get(k) for k in self.metrics.keys()}
-    def add_all(self, values):
-        for k, v in values.items():
-            self.add(k, v)
-    def logall(self, log_func):
-        for k in self.metrics.keys():
-            log_func(k, self.get(k))
-def gaussian_kernel(k, sigma):
-    kernel = torch.tensor([math.exp(-0.5 * (x - (k // 2)) ** 2 / sigma ** 2) for x in range(k)], dtype=torch.float32)
-    kernel /= kernel.sum()  # Normalize the kernel
-    return kernel
-def blur_dim(t, window=5, std_dev=1, dim=-1):
-    shape = t.shape
-    n_dims = len(shape)
-    # Create the Gaussian kernel
-    with torch.no_grad():
-        x = torch.linspace(-2, 2, window, device=t.device, dtype=torch.float32)
-        k = torch.exp(-x ** 2 / (2 * std_dev ** 2))
-        k = k / k.sum()
-        k = k.view(1, 1, window).detach()
-    # Calculate padding
-    pad = window // 2
-    # Move the target dimension to the end
-    permute_order = list(range(n_dims))
-    permute_order.append(permute_order.pop(dim))
-    t_permuted = t.permute(permute_order)
-    # Flatten all dimensions except the last one
-    new_shape = (-1, t_permuted.size(-1))
-    t_flattened = t_permuted.reshape(new_shape)
-    # Pad the tensor
-    t_padded = F.pad(t_flattened.unsqueeze(1), (pad, pad), mode="replicate")
-    # Apply convolution
-    blurred = F.conv1d(t_padded, k)
-    # Reshape back to original
-    blurred = blurred.squeeze(1).reshape(*t_permuted.shape)
-    blurred = blurred.permute([permute_order.index(i) for i in range(n_dims)])
-    return blurred

DenseAV/denseav/train.py DELETED Viewed

@@ -1,1213 +0,0 @@
-import os
-from collections import deque
-from itertools import combinations
-from os.path import join
-import hydra
-import numpy as np
-import pytorch_lightning as pl
-import torch
-import torch.distributed as dist
-import torch.nn.functional as F
-from omegaconf import DictConfig, OmegaConf
-from peft import get_peft_model, LoraConfig
-from pytorch_lightning import Trainer
-from pytorch_lightning import seed_everything
-from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint
-from pytorch_lightning.loggers import TensorBoardLogger
-from pytorch_lightning.utilities import grad_norm
-from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts, SequentialLR, LambdaLR
-from torchmetrics.functional.classification import binary_average_precision
-from huggingface_hub import PyTorchModelHubMixin
-from denseav.aggregators import get_aggregator
-from denseav.aligners import get_aligner, ProgressiveGrowing
-from denseav.constants import *
-from denseav.data.AVDatasets import AVDataModule
-from denseav.shared import flatten_preds, GatherLayer, \
-    get_image_featurizer, get_audio_featurizer, RollingAvg, create_model_from_cfg
-torch.multiprocessing.set_sharing_strategy('file_system')
-def _imposter_indices_helper(true_indices: torch.Tensor, samples: torch.Tensor):
-    mask = (true_indices == samples).to(torch.int64)
-    n = mask.shape[0]
-    if not mask.any():
-        return samples
-    else:
-        new_samples = torch.randint(0, n, size=(n,), device=true_indices.device)
-        comb_samples = mask * new_samples + (1 - mask) * samples
-        return _imposter_indices_helper(true_indices, comb_samples)
-def imposter_indices(n, device):
-    return _imposter_indices_helper(
-        torch.arange(0, n, device=device),
-        torch.randint(0, n, size=(n,), device=device))
-def get_sim_per_row(image_outputs, audio_outputs, n_frames, sim_type):
-    max_t = audio_outputs.shape[-1]
-    oh = F.one_hot(n_frames - 1, num_classes=max_t)
-    audio_mask = 1 - torch.cumsum(oh, dim=1)
-    audio_mask = F.pad(audio_mask, [1, 0], value=1)[:, :max_t].to(audio_outputs.dtype)
-    full_sim = torch.einsum("bct,bchw->bthw", audio_outputs, image_outputs)
-    expanded_am = audio_mask.unsqueeze(-1).unsqueeze(-1)
-    if sim_type.endswith("mi"):
-        offset = 10 * (full_sim.max() - full_sim.min())
-        full_sim = (full_sim - ((1 - expanded_am) * offset)).max(1, keepdim=True).values
-    if sim_type.startswith("mi"):
-        full_sim = full_sim.max(-1, keepdim=True).values.max(-2, keepdim=True).values
-    if sim_type.endswith("sa"):
-        full_sim = (full_sim * (expanded_am / expanded_am.sum(1, keepdim=True).clamp_min(1))).sum(1, keepdim=True)
-    return full_sim.mean(dim=[1, 2, 3])
-def sampled_margin_rank_loss(image_outputs, audio_outputs, n_frames, sim_type, margin=1.):
-    """
-    Computes the triplet margin ranking loss for each anchor image/caption pair
-    The impostor image/caption is randomly sampled from the minibatch
-    """
-    assert (image_outputs.dim() == 4)
-    assert (audio_outputs.dim() == 3)
-    n = image_outputs.size(0)
-    imp_ind_i = imposter_indices(n, image_outputs.device)
-    imp_ind_a = imposter_indices(n, image_outputs.device)
-    true_sim = get_sim_per_row(image_outputs, audio_outputs, n_frames, sim_type)
-    imp_sim_i = get_sim_per_row(image_outputs[imp_ind_i], audio_outputs, n_frames, sim_type)
-    imp_sim_a = get_sim_per_row(image_outputs, audio_outputs[imp_ind_a], n_frames[imp_ind_a], sim_type)
-    a2i_loss = (margin + imp_sim_i - true_sim).clamp_min(0)
-    i2a_loss = (margin + imp_sim_a - true_sim).clamp_min(0)
-    return (a2i_loss + i2a_loss).mean() / 2
-class SimilarityCalibrator(torch.nn.Module):
-    def __init__(self, cal_init, max_w=100, min_w=.01, subtract_mean=True, use_bias=False):
-        super().__init__()
-        self.max_w = max_w
-        self.min_w = min_w
-        self.w = torch.nn.Parameter(torch.tensor([cal_init]).log())
-        self.use_bias = use_bias
-        if self.use_bias:
-            self.b = torch.nn.Parameter(torch.tensor([0.0]))
-        self.subtract_mean = subtract_mean
-    def get_w(self):
-        return torch.exp(self.w).clamp_max(self.max_w).clamp_min(self.min_w)
-    def forward(self, x):
-        sims = self.get_w() * x
-        if self.use_bias:
-            sims = sims + self.b
-        if self.subtract_mean:
-            return sims - sims.mean()
-        else:
-            return sims
-class SpatialDropout(torch.nn.Module):
-    def __init__(self, p, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.p = p
-    def forward(self, x):
-        b, c, h, w = x.shape
-        dropout = torch.rand((b, 1, h, w), dtype=x.dtype, device=x.device) > self.p
-        if self.training:
-            return x * dropout
-        else:
-            return x
-class LitAVAligner(pl.LightningModule, PyTorchModelHubMixin, repo_url="https://github.com/mhamilton723/DenseAV", license="mit", tags=["denseav"]):
-    def __init__(self,
-                 code_dim,
-                 image_model_type,
-                 image_model_token_type,
-                 image_aligner_type,
-                 image_pool_width,
-                 audio_model_type,
-                 audio_aligner_type,
-                 audio_pool_width,
-                 audio_lora,
-                 audio_lora_rank,
-                 image_lora,
-                 image_lora_rank,
-                 gradient_clipping,
-                 learn_audio_cls,
-                 silence_l1,
-                 silence_l2,
-                 tv_weight,
-                 nonneg_sim,
-                 nonneg_pressure,
-                 pretrain_lr,
-                 lr,
-                 lr_warmup,
-                 lr_schedule,
-                 lr_cycle_length,
-                 optimizer,
-                 gather_tensors,
-                 sim_agg_type,
-                 sim_agg_heads,
-                 sim_use_cls,
-                 disentangle_weight,
-                 norm_vectors,
-                 cal_init,
-                 cal_balance_weight,
-                 loss_type,
-                 loss_margin,
-                 mask_silence,
-                 finetune_image_model,
-                 finetune_audio_model,
-                 use_cached_embs,
-                 output_root,
-                 neg_audio,
-                 neg_audio_weight,
-                 head_agg,
-                 adaptive_clipping,
-                 specialization_weight,
-                 spatial_dropout,
-                 channel_dropout,
-                 mixup_weight,
-                 memory_buffer_size,
-                 loss_leak,
-                 ):
-        super().__init__()
-        self.code_dim = code_dim
-        self.image_model_type = image_model_type
-        self.image_model_token_type = image_model_token_type
-        self.image_aligner_type = image_aligner_type
-        self.image_pool_width = image_pool_width
-        self.audio_model_type = audio_model_type
-        self.audio_aligner_type = audio_aligner_type
-        self.audio_pool_width = audio_pool_width
-        self.gradient_clipping = gradient_clipping
-        self.learn_audio_cls = learn_audio_cls
-        self.silence_l1 = silence_l1
-        self.silence_l2 = silence_l2
-        self.tv_weight = tv_weight
-        self.nonneg_sim = nonneg_sim
-        self.nonneg_pressure = nonneg_pressure
-        self.pretrain_lr = pretrain_lr
-        self.lr = lr
-        self.lr_warmup = lr_warmup
-        self.lr_schedule = lr_schedule
-        self.lr_cycle_length = lr_cycle_length
-        self.optimizer = optimizer
-        self.gather_tensors = gather_tensors
-        self.sim_agg_type = sim_agg_type
-        self.sim_agg_heads = sim_agg_heads
-        self.sim_use_cls = sim_use_cls
-        self.disentangle_weight = disentangle_weight
-        self.norm_vectors = norm_vectors
-        self.cal_init = cal_init
-        self.cal_balance_weight = cal_balance_weight
-        self.loss_type = loss_type
-        self.loss_margin = loss_margin
-        self.mask_silence = mask_silence
-        self.finetune_image_model = finetune_image_model
-        self.finetune_audio_model = finetune_audio_model
-        self.use_cached_embs = use_cached_embs
-        self.output_root = output_root
-        self.audio_lora = audio_lora
-        self.audio_lora_rank = audio_lora_rank
-        self.image_lora = image_lora
-        self.image_lora_rank = image_lora_rank
-        self.neg_audio = neg_audio
-        self.neg_audio_weight = neg_audio_weight
-        self.head_agg = head_agg
-        self.adaptive_clipping = adaptive_clipping
-        self.specialization_weight = specialization_weight
-        self.spatial_dropout = spatial_dropout
-        self.channel_dropout = channel_dropout
-        self.mixup_weight = mixup_weight
-        self.memory_buffer_size = memory_buffer_size
-        self.memory_buffer = deque(maxlen=self.memory_buffer_size)
-        self.loss_leak = loss_leak
-        if self.audio_model_type in {"audiomae", "audiomae-finetuned", "cavmae", "cavmae-mixed", "imagebind"}:
-            self.audio_input = "spec"
-        elif self.audio_model_type == "davenet":
-            self.audio_input = "davenet_spec"
-        elif self.audio_model_type == "fnac":
-            self.audio_input = "fnac_spec"
-        else:
-            self.audio_input = "audio"
-        extra_model_args = dict(output_root=output_root)
-        self.image_model, _, self.image_feat_dim = get_image_featurizer(
-            image_model_type, token_type=self.image_model_token_type, **extra_model_args)
-        self.image_model.eval()
-        if not self.finetune_image_model:
-            for param in self.image_model.parameters():
-                param.requires_grad = False
-        if image_model_type in {"cavmae", "cavmae-mixed", "imagebind", "fnac"}:
-            extra_model_args["model"] = self.image_model.model
-        if use_cached_embs:
-            _, self.audio_feat_dim = get_audio_featurizer(audio_model_type, **extra_model_args)
-        else:
-            self.audio_model, self.audio_feat_dim = get_audio_featurizer(audio_model_type, **extra_model_args)
-            self.audio_model.eval()
-            if not self.finetune_audio_model:
-                for param in self.audio_model.parameters():
-                    param.requires_grad = False
-        if self.image_lora:
-            if self.image_model_type in {"sam", "dino8", "dinov2", "cavmae", "cavmae-mixed"}:
-                target_modules = ["qkv"]
-            elif self.image_model_type == "clip":
-                target_modules = ["out_proj"]
-            elif self.image_model_type == "imagebind":
-                target_modules = ["out_proj", "fc1", "fc2"]
-            else:
-                target_modules = ["q", "k", "v"]
-            peft_config = LoraConfig(
-                target_modules=target_modules,
-                inference_mode=False,
-                r=image_lora_rank,
-                lora_alpha=32,
-                lora_dropout=0.1
-            )
-            self.image_model = get_peft_model(self.image_model, peft_config)
-            self.image_model.print_trainable_parameters()
-        if self.audio_lora:
-            if self.audio_model_type == "hubert":
-                target_modules = ["q_proj", "k_proj", "v_proj"]
-            else:
-                target_modules = ["q", "k", "v"]
-            peft_config = LoraConfig(
-                inference_mode=False,
-                target_modules=target_modules,
-                r=audio_lora_rank,
-                lora_alpha=32,
-                lora_dropout=0.1
-            )
-            self.audio_model = get_peft_model(self.audio_model, peft_config)
-            self.audio_model.print_trainable_parameters()
-        shared_aligner_args = dict(out_dim=self.code_dim)
-        self.audio_aligner = get_aligner(
-            self.audio_aligner_type, self.audio_feat_dim, **shared_aligner_args)
-        self.image_aligner = get_aligner(
-            self.image_aligner_type, self.image_feat_dim, **shared_aligner_args)
-        if self.loss_type == "nce":
-            self.sim_cal = SimilarityCalibrator(self.cal_init, subtract_mean=True, use_bias=False)
-        else:
-            self.sim_cal = SimilarityCalibrator(self.cal_init, subtract_mean=False, use_bias=True)
-        if self.learn_audio_cls:
-            self.audio_cls = torch.nn.Parameter(torch.randn(self.audio_feat_dim))
-        if self.spatial_dropout > 0.0:
-            self.spatial_dropout_layer = SpatialDropout(self.spatial_dropout)
-        if self.channel_dropout > 0.0:
-            self.channel_dropout_layer = torch.nn.Dropout2d(self.channel_dropout)
-        self.sim_agg = get_aggregator(
-            self.sim_agg_type,
-            self.nonneg_sim,
-            self.mask_silence,
-            self.sim_agg_heads,
-            self.head_agg,
-            self.sim_use_cls,
-            dim=self.image_feat_dim
-        )
-        self.hparams_logged = False
-        self.rolling_avg = RollingAvg(50)
-        self.grad_avg = RollingAvg(50, nonzero=True)
-        self.save_hyperparameters()
-    def set_full_train(self, full_train):
-        self.full_train = full_train
-    def prep_feats(self, feats, is_audio):
-        if not is_audio and self.training and self.image_pool_width > 1:
-            feats = torch.nn.AvgPool2d(self.image_pool_width)(feats)
-        if is_audio and self.training and self.audio_pool_width > 1:
-            feats = torch.nn.AvgPool2d((1, self.audio_pool_width))(feats)
-        if self.norm_vectors:
-            feats = F.normalize(feats, dim=1)
-        return feats
-    def on_before_optimizer_step(self, optimizer, optimizer_idx):
-        norms = grad_norm(self, norm_type=2)
-        avg_grads = self.grad_avg.get_all()
-        params = {
-            f"grad_2.0_norm/{name}": p
-            for name, p in self.named_parameters()
-            if p.grad is not None
-        }
-        if self.adaptive_clipping:
-            for k in norms.keys():
-                if k in params:
-                    avg_grad = max(avg_grads.get(k, norms[k]), 1e-5)
-                    if self.global_step > 10 and norms[k] > avg_grad * 5:
-                        print(f"Bad grad for {k}: {norms[k]} scaling to {avg_grad * 5}")
-                        torch.nn.utils.clip_grad_norm_(params[k], avg_grad * 5)
-                        norms[k] = avg_grad * 5
-                    if norms[k] > self.gradient_clipping:
-                        # print(f"Bad grad for {k}: {norms[k]} scaling to {self.gradient_clipping}")
-                        torch.nn.utils.clip_grad_norm_(params[k], self.gradient_clipping)
-        # self.grad_avg.add_all(norms)
-        # self.log_dict(norms)
-    def interpolate_mask(self, mask, target_length, discrete):
-        b, t = mask.shape
-        mask = F.interpolate(mask.reshape(b, 1, 1, t), (1, target_length), mode="bilinear") \
-            .reshape(b, target_length)
-        if discrete:
-            mask = mask > 0.01
-            sums = mask.sum(1)
-            all_zeros = torch.where(sums == 0)[0]
-            if len(all_zeros) > 0:
-                print("Fixing a bad mask")
-                for entry in all_zeros:
-                    mask[entry, torch.randint(0, target_length - 1, size=())] = True
-        else:
-            return mask
-        return mask
-    def forward_audio(self, batch):
-        if self.use_cached_embs:
-            audio_feats = batch["audio_emb"]
-            if "audio_cls" in batch:
-                audio_cls = batch["audio_cls"]
-            else:
-                audio_cls = None
-        else:
-            audio = batch[self.audio_input]
-            if self.full_train:
-                audio_feats, audio_cls = self.audio_model(audio, include_cls=True)
-            else:
-                with torch.no_grad():
-                    audio_feats, audio_cls = self.audio_model(audio, include_cls=True)
-        mask = batch[AUDIO_MASK] if AUDIO_MASK in batch else torch.ones_like(audio)
-        pos_mask = batch[AUDIO_POS_MASK] if AUDIO_POS_MASK in batch else torch.ones_like(audio)
-        if self.learn_audio_cls:
-            assert audio_cls is None
-            audio_cls = torch.broadcast_to(self.audio_cls.unsqueeze(0), (audio_feats.shape[0], audio_feats.shape[1]))
-        aligned_audio_feats, aligned_audio_cls = self.audio_aligner(audio_feats, audio_cls)
-        if self.channel_dropout > 0.0:
-            aligned_audio_feats = self.channel_dropout_layer(aligned_audio_feats)
-        aligned_audio_feats = self.prep_feats(aligned_audio_feats, is_audio=True)
-        audio_mask = self.interpolate_mask(mask, aligned_audio_feats.shape[-1], True)
-        audio_pos_mask = self.interpolate_mask(pos_mask, aligned_audio_feats.shape[-1], False)
-        ret = {
-            AUDIO_MASK: audio_mask,
-            AUDIO_POS_MASK: audio_pos_mask,
-            AUDIO_FEATS: aligned_audio_feats,
-        }
-        if aligned_audio_cls is not None:
-            ret[AUDIO_CLS] = aligned_audio_cls
-        return ret
-    # @autocast(device_type="cuda", enabled=False)
-    def forward_image(self, batch, max_batch_size=None):
-        with torch.no_grad():
-            image = batch[IMAGE_INPUT]
-            b, nf, c, h, w = image.shape
-            image = image.reshape(b * nf, c, h, w)
-            if max_batch_size is None:
-                max_batch_size = image.shape[0]
-            chunks = [image[i:i + max_batch_size] for i in range(0, image.shape[0], max_batch_size)]
-            all_image_feats = []
-            all_image_cls = []
-            for chunk in chunks:
-                if self.full_train:
-                    image_feats, image_cls = self.image_model(chunk, include_cls=True)
-                else:
-                    with torch.no_grad():
-                        image_feats, image_cls = self.image_model(chunk, include_cls=True)
-                aligned_image_feats, aligned_image_cls = self.image_aligner(image_feats, image_cls)
-                all_image_feats.append(aligned_image_feats)
-                all_image_cls.append(aligned_image_cls)
-            # Stitch the chunks back together
-            aligned_image_feats = torch.cat(all_image_feats, dim=0)
-            aligned_image_cls = torch.cat(all_image_cls, dim=0)
-        if self.channel_dropout > 0.0:
-            aligned_image_feats = self.channel_dropout_layer(aligned_image_feats)
-        if self.spatial_dropout > 0.0:
-            aligned_image_feats = self.spatial_dropout_layer(aligned_image_feats)
-        aligned_image_feats = self.prep_feats(aligned_image_feats, is_audio=False)
-        ret = {IMAGE_FEATS: aligned_image_feats}
-        if IMAGE_MASK in batch:
-            with torch.no_grad():
-                mask = batch[IMAGE_MASK]
-                mask = mask.reshape(b * nf, 1, h, w)
-                b, c, h, w = aligned_image_feats.shape
-                mask = F.adaptive_avg_pool2d(mask.to(aligned_image_feats), output_size=(h, w))
-                ret[IMAGE_MASK] = mask
-        if aligned_image_cls is not None:
-            ret[IMAGE_CLS] = aligned_image_cls
-        return ret
-    def forward(self, batch):
-        audio_feat_dict = self.forward_audio(batch)
-        image_feat_dict = self.forward_image(batch)
-        return {**image_feat_dict, **audio_feat_dict}
-    def contrast_loss(self, sims):
-        b = sims.shape[0]
-        sims = sims - torch.eye(b, b, device=sims.device) * self.loss_margin
-        sims_1 = sims
-        sims_2 = sims.permute(1, 0)
-        if self.loss_leak > 0.0:
-            id = torch.eye(sims_1.shape[0], sims_1.shape[1], device=sims.device, dtype=sims.dtype)
-            label_mask = id * (1 - self.loss_leak)
-            label_mask += (1 - id) * self.loss_leak / (sims_1.shape[0] - 1)
-            label_mask /= label_mask.sum(dim=1, keepdim=True)
-        else:
-            label_mask = torch.eye(sims_1.shape[0], sims_1.shape[1], device=sims.device, dtype=sims.dtype)
-        labels = torch.arange(0, sims.shape[0], device=sims.device)
-        self.rolling_avg.add(f"acc/1", (sims.argmax(dim=1) == labels).to(sims).mean())
-        self.rolling_avg.add(f"acc/2", (sims.argmax(dim=0) == labels).to(sims).mean())
-        if self.loss_type == "margin":
-            margin_loss_tensor = (sims - torch.diag(sims)).clamp_min(0)
-            margin_loss = margin_loss_tensor.mean()
-            self.rolling_avg.add(f"loss/frac_nonzero", (margin_loss_tensor > 0).to(sims).mean())
-            self.rolling_avg.add(f"loss/margin", margin_loss)
-            return margin_loss
-        elif self.loss_type == "ce":
-            ce_loss = 1 / 2 * F.cross_entropy(sims_1, labels) + \
-                      1 / 2 * F.cross_entropy(sims_2, labels)
-            self.rolling_avg.add(f"loss/ce", ce_loss)
-            return ce_loss
-        elif self.loss_type == "bce":
-            bce_loss = F.binary_cross_entropy_with_logits(sims_1.flatten(), label_mask.flatten())
-            self.rolling_avg.add(f"loss/bce", bce_loss)
-            return bce_loss
-        elif self.loss_type == "nce":
-            nce_loss = 1 / 2 * (-F.log_softmax(sims_1, dim=-1) * label_mask).sum(1).mean() + \
-                       1 / 2 * (-F.log_softmax(sims_2, dim=-1) * label_mask).sum(1).mean()
-            self.rolling_avg.add(f"loss/nce", nce_loss)
-            return nce_loss
-        else:
-            raise ValueError(f"Unknown loss type {self.loss_type}")
-    def loss(self, preds):
-        image_feats = preds[IMAGE_FEATS]
-        audio_feats = preds[AUDIO_FEATS]
-        audio_mask = preds[AUDIO_MASK]
-        image_mask = preds[IMAGE_MASK]
-        audio_pos_mask = preds[AUDIO_POS_MASK]
-        if DATA_SOURCE in preds:
-            source = preds[DATA_SOURCE].to(torch.int64)
-        else:
-            source = None
-        uncal_sims = self.sim_agg(preds, agg_heads=True)
-        sims = self.sim_cal(uncal_sims)
-        _mask = 1 - torch.eye(sims.shape[0], device=sims.device)
-        self.log(f"sim/pos", torch.diag(sims).mean())
-        self.log(f"sim/neg", (sims * _mask).sum() / (_mask.sum()))
-        self.log(f"sim/uncal_pos", torch.diag(uncal_sims).mean())
-        self.log(f"sim/uncal_neg", (uncal_sims * _mask).sum() / (_mask.sum()))
-        b, c, h, w = image_feats.shape
-        b, c, f, t = audio_feats.shape
-        n_samples = 250
-        nh = self.sim_agg_heads
-        image_feats_by_head = image_feats.reshape(b, self.sim_agg_heads, c // nh, h, w)
-        audio_feats_by_head = audio_feats.reshape(b, self.sim_agg_heads, c // nh, f, t)
-        def maybe_clamp(t):
-            return t.clamp_min(0) if self.nonneg_sim else t
-        paired_sim_raw = self.sim_agg.get_pairwise_sims(preds, raw=True, agg_sim=False, agg_heads=False)
-        paired_sim = maybe_clamp(paired_sim_raw)
-        loss = 0.0
-        if self.nonneg_pressure:
-            afb, afk, afc, aff, aft = audio_feats_by_head.shape
-            ifb, ifk, ifc, ifh, ifw = image_feats_by_head.shape
-            assert (afb == ifb)
-            device = audio_feats_by_head.device
-            random_b = torch.randint(0, afb, size=(n_samples,), device=device)
-            random_t = torch.randint(0, aft, size=(n_samples,), device=device)
-            random_f = torch.randint(0, aff, size=(n_samples,), device=device)
-            random_h = torch.randint(0, ifh, size=(n_samples,), device=device)
-            random_w = torch.randint(0, ifw, size=(n_samples,), device=device)
-            random_audio_feats = audio_feats_by_head[random_b, :, :, random_f, random_t]
-            random_image_feats = image_feats_by_head[random_b, :, :, random_h, random_w]
-            random_sim_raw = torch.einsum("bkc,dkc->bdk", random_audio_feats, random_image_feats)
-            nonneg_loss = random_sim_raw.clamp_max(0).square().mean()
-            self.rolling_avg.add(f"loss/nonneg", nonneg_loss)
-            loss += nonneg_loss * self.nonneg_pressure
-        if self.silence_l1 > 0 or self.silence_l2 > 0:
-            masked_b, masked_t = torch.where(~audio_mask)
-            if len(masked_b) > n_samples:
-                subset = torch.randperm(len(masked_b))[:n_samples]
-                masked_b = masked_b[subset]
-                masked_t = masked_t[subset]
-            if len(masked_b) == n_samples:
-                silent_audio_feats = audio_feats_by_head[masked_b, :, :, :, masked_t].mean(-1)  # d k c
-                silence_tensor = maybe_clamp(
-                    torch.einsum("bkchw,dkc->bkdhw", image_feats_by_head, silent_audio_feats))
-                silence_l1_loss = silence_tensor.abs().mean()
-                self.rolling_avg.add(f"loss/silence_l1", silence_l1_loss)
-                loss += silence_l1_loss * self.silence_l1
-                silence_l2_loss = silence_tensor.square().mean()
-                self.rolling_avg.add(f"loss/silence_l2", silence_l2_loss)
-                loss += silence_l2_loss * self.silence_l2
-            else:
-                pass
-        if self.neg_audio_weight > 0 and self.neg_audio:
-            b, t = audio_pos_mask.shape
-            negative_weight = ((1 - audio_pos_mask) * audio_mask.to(sims)).reshape(b, 1, 1, 1, 1, t)
-            negative_weight = torch.broadcast_to(negative_weight, paired_sim.shape)
-            if negative_weight.sum() > 0:
-                neg_audio_loss = (paired_sim.square() * negative_weight).sum() \
-                                 / negative_weight.sum().clamp_min(0.1)
-                self.rolling_avg.add(f"loss/neg_audio", neg_audio_loss)
-                self.rolling_avg.add(f"loss/neg_weight_avg", negative_weight.mean())
-                loss += neg_audio_loss * self.neg_audio_weight
-            else:
-                print("WARNING: No negative samples found in batch")
-        if self.tv_weight > 0:
-            tv_loss = (paired_sim[:, :, :, :, :, 1:] - paired_sim[:, :, :, :, :, :-1]).square().mean()
-            self.rolling_avg.add(f"loss/tv", tv_loss)
-            loss += tv_loss * self.tv_weight
-        self.log(f"cal/w", self.sim_cal.get_w())
-        if self.cal_balance_weight > 0.0:
-            cal_balance = (np.log(self.cal_init) - torch.log(self.sim_cal.get_w().clamp_min(.00000001))) \
-                .clamp_min(0).square().mean()
-            self.rolling_avg.add(f"loss/cal_balance", cal_balance)
-            loss += cal_balance * self.cal_balance_weight
-        if self.disentangle_weight > 0.0:
-            assert source is not None
-            assert self.sim_agg_heads % 2 == 0
-            dilation = self.sim_agg_heads // 2
-            sources_oh = F.one_hot(source, num_classes=2)
-            b, h = sources_oh.shape
-            sources_mask = 1 - torch.broadcast_to(sources_oh.unsqueeze(-1), (b, h, dilation)) \
-                .reshape(b, h * dilation).to(paired_sim)
-            disentangle_loss = torch.einsum("bkhwft,bk->bhwft", paired_sim, sources_mask).square().mean()
-            self.rolling_avg.add(f"loss/disentangle", disentangle_loss)
-            loss += disentangle_loss * self.disentangle_weight
-        if self.specialization_weight > 0.0 and self.sim_agg_heads > 1:
-            total_specialization_loss = 0.0
-            combos = list(combinations(range(self.sim_agg_heads), 2))
-            for i, j in combos:
-                specialization_loss_pair = (paired_sim[:, i].abs() * paired_sim[:, j].abs()).mean()
-                total_specialization_loss += specialization_loss_pair
-            avg_specialization_loss = total_specialization_loss / len(combos)
-            self.rolling_avg.add(f"loss/specialize", avg_specialization_loss)
-            loss += avg_specialization_loss * self.specialization_weight
-        if self.mixup_weight > 0.0:
-            b, _, h, w = image_mask.shape
-            neg_img_mask = torch.broadcast_to(
-                1 - image_mask.to(paired_sim).reshape(b, 1, h, w, 1, 1),
-                paired_sim.shape)
-            image_mixup_loss = (paired_sim * neg_img_mask).square().sum() / neg_img_mask.sum().clamp_min(0.1)
-            self.rolling_avg.add(f"loss/image_mixup", image_mixup_loss)
-            loss += image_mixup_loss * self.mixup_weight
-        sims = sims
-        loss += self.contrast_loss(sims)
-        self.rolling_avg.add(f"loss/total", loss)
-        return loss
-    def setup_hparams(self):
-        recalls = ['A_r1', 'A_r5', 'A_r10', 'I_r1', 'I_r5', 'I_r10']
-        if self.trainer.datamodule.use_extra_val_sets:
-            datasets = ["Places", "AudioSet"]
-        else:
-            datasets = ["Val"]
-        heads = ["total"]
-        metric_names = [
-            "hp/speech_basic_ap", "hp/speech_advanced_ap", "hp/sound_basic_ap",
-            "hp/speech_basic_iou", "hp/speech_advanced_iou", "hp/sound_basic_iou",
-        ]
-        for dataset in datasets:
-            for head in heads:
-                for recall in recalls:
-                    metric_names.append(f"hp/{dataset}/{head}/{recall}")
-        if self.sim_agg_heads == 2:
-            metric_names.extend(["hp/ap_dis", "hp/act_dis"])
-        if hasattr(self.trainer, "datamodule"):
-            all_hparams = {**self.hparams, **self.trainer.datamodule.hparams}
-        else:
-            all_hparams = self.hparams
-        starting_values = {n: torch.nan for n in metric_names}
-        self.logger.log_hyperparams(all_hparams, starting_values)
-    def on_train_start(self):
-        self.setup_hparams()
-        self.hparams_logged = True
-    def on_train_batch_start(self, batch, batch_idx):
-        remake_optimizers = False
-        if isinstance(self.image_aligner, ProgressiveGrowing):
-            should_remake = self.image_aligner.maybe_change_phase(self.global_step)
-            remake_optimizers = remake_optimizers or should_remake
-        if isinstance(self.audio_aligner, ProgressiveGrowing):
-            should_remake = self.audio_aligner.maybe_change_phase(self.global_step)
-            remake_optimizers = remake_optimizers or should_remake
-        if remake_optimizers:
-            raise NotImplementedError()
-    def _combine_preds(self, all_preds):
-        temp = {}
-        new_preds = {}
-        # Collect tensors for each key into lists
-        for d in all_preds:
-            for key, value in d.items():
-                if isinstance(value, torch.Tensor):
-                    if key not in temp:
-                        temp[key] = []
-                    temp[key].append(value)
-        # Concatenate all tensors for each key using a single call to torch.cat
-        for key, tensor_list in temp.items():
-            new_preds[key] = torch.cat(tensor_list)
-        return new_preds
-    def training_step(self, batch, batch_idx):
-        assert batch[IMAGE_INPUT].shape[1] == 1
-        preds = self.forward(batch)
-        if DATA_SOURCE in batch:
-            preds[DATA_SOURCE] = batch[DATA_SOURCE]
-        if self.trainer.world_size > 1 and self.gather_tensors:
-            for k, v in preds.items():
-                new_v = v.contiguous()
-                preds[k] = torch.cat(GatherLayer.apply(new_v), dim=0)
-        if self.memory_buffer_size > 0:
-            new_preds = self._combine_preds(list(self.memory_buffer) + [preds])
-        else:
-            new_preds = preds
-        loss = self.loss(new_preds)
-        if self.memory_buffer_size > 0:
-            self.memory_buffer.append(self._recursive_detach(preds, gather=False))
-        if self.trainer.is_global_zero and self.global_step % 50 == 1:
-            writer = self.logger.experiment
-            self.rolling_avg.logall(lambda k, v: writer.add_scalar(k, v, global_step=self.global_step))
-        if self.trainer.scaler is not None:
-            self.log("loss_scale", self.trainer.scaler.get_scale())
-        if self.global_step % 10000 == 0 and self.global_step > 0:
-            print("RESETTING TFEVENT FILE")
-            self.logger.experiment.close()
-            self.logger.experiment._get_file_writer()
-        return loss
-    def on_validation_start(self) -> None:
-        if not self.hparams_logged:
-            self.setup_hparams()
-            self.hparams_logged = True
-    def _auto_gather(self, t):
-        if t.dtype == torch.bool:
-            t = t.to(torch.float)
-        if self.trainer.num_devices == 1:
-            return t.cpu()
-        t = torch.clone(t).contiguous()
-        if self.trainer.is_global_zero:
-            gather_list = [torch.zeros_like(t) for _ in range(dist.get_world_size())]
-            dist.gather(t, gather_list)
-            return torch.cat(gather_list, dim=0).cpu()
-        else:
-            dist.gather(t)
-    def validation_step(self, batch, batch_idx, dataloader_idx=0):
-        with torch.no_grad():
-            preds = self.forward(batch)
-            ret = {}
-            for k in preds.keys():
-                if k in preds:
-                    ret[k] = self._auto_gather(preds[k])
-            batch_keys = [IMAGE_INPUT, "spec", "semseg", "num_pixels_per_class", 'total_length']
-            for k in batch_keys:
-                if k in batch:
-                    ret[k] = self._auto_gather(batch[k])
-            if "metadata" in batch:
-                if isinstance(batch["metadata"]["id"], torch.Tensor):
-                    ret["id"] = self._auto_gather(batch["metadata"]["id"])
-                ret["index"] = self._auto_gather(batch["metadata"]["index"])
-            return ret
-    def _calc_recalls(self, sim):
-        top_10_a = sim.topk(10, 0).indices == torch.arange(sim.shape[0]).unsqueeze(0)
-        top_10_i = (sim.topk(10, 1).indices == torch.arange(sim.shape[0]).unsqueeze(1)).permute(1, 0)
-        a_recall = lambda p: top_10_a[0:p].any(0).to(sim).mean()
-        i_recall = lambda p: top_10_i[0:p].any(0).to(sim).mean()
-        return {'A_r1': a_recall(1),
-                'A_r5': a_recall(5),
-                'A_r10': a_recall(10),
-                'I_r1': i_recall(1),
-                'I_r5': i_recall(5),
-                'I_r10': i_recall(10)}
-    def calc_recalls(self, preds, dataset):
-        sim = self.sim_agg.forward_batched(
-            preds=preds,
-            agg_heads=False,
-            batch_size=4,
-        ).cpu()
-        all_metrics = dict()
-        for k, v in self._calc_recalls(sim.sum(-1)).items():
-            all_metrics[f"hp/{dataset}/total/" + k] = v
-        return all_metrics
-    def retrieval_validation(self, outputs, dataset_name):
-        if len(outputs) == 0:
-            return
-        if self.trainer.is_global_zero:
-            results = flatten_preds(outputs)
-            if not self.trainer.sanity_checking:
-                print(results[IMAGE_FEATS].shape[0])
-                # assert (results[IMAGE_FEATS].shape[0] == 1000)
-            results[IMAGE_FEATS] = results[IMAGE_FEATS].cpu()
-            results[AUDIO_FEATS] = results[AUDIO_FEATS].cuda()
-            if self.sim_use_cls:
-                results[AUDIO_CLS] = results[AUDIO_CLS].cuda()
-                results[AUDIO_CLS] = results[AUDIO_CLS].cuda()
-            results[AUDIO_MASK] = results[AUDIO_MASK].cuda()
-            recalls = self.calc_recalls(results, dataset_name)
-            results[IMAGE_FEATS] = results[IMAGE_FEATS].cuda()
-            writer = self.logger.experiment
-            print("here")
-            for name, v in recalls.items():
-                writer.add_scalar(f"{name}", v, self.global_step + 1)
-    def semseg_validation(self, speech_preds, sound_preds):
-        if self.trainer.is_global_zero:
-            from eval_utils import get_paired_heatmaps
-            def prep_preds(preds, loader):
-                results = flatten_preds(preds)
-                metadata = loader.dataset.metadata
-                ordered_metadata = metadata.iloc[results["index"].numpy(), :].copy()
-                ordered_metadata["order"] = range(len(ordered_metadata))
-                return results, ordered_metadata
-            [_, _, speech_loader, sound_loader] = self.trainer.val_dataloaders
-            speech_results, speech_metadata = prep_preds(speech_preds, speech_loader)
-            sound_results, sound_metadata = prep_preds(sound_preds, sound_loader)
-            self.sound_metrics, unique_sound_indices = get_paired_heatmaps(
-                self, sound_results, sound_metadata["ade_class_id"], None)
-            self.speech_metrics, unique_word_indices = get_paired_heatmaps(
-                self, speech_results, speech_metadata["ade_class_id"], speech_metadata["timing"])
-            writer = self.logger.experiment
-            all_metrics = {
-                **{"sound_" + k: v for k, v in self.sound_metrics.items()},
-                **{"speech_" + k: v for k, v in self.speech_metrics.items()},
-            }
-            for k, v in all_metrics.items():
-                writer.add_scalar(f"hp/{k}", torch.tensor(v).mean(), self.global_step + 1)
-    def disentangle_validation(self, word_preds, sound_preds):
-        if len(word_preds) == 0 or len(sound_preds) == 0:
-            return
-        if self.trainer.is_global_zero:
-            word_preds = flatten_preds(word_preds)
-            sound_preds = flatten_preds(sound_preds)
-            word_scores = self.sim_agg.get_pairwise_sims(
-                word_preds,
-                raw=False,
-                agg_sim=True,
-                agg_heads=False,
-            )
-            sound_scores = self.sim_agg.get_pairwise_sims(
-                sound_preds,
-                raw=False,
-                agg_sim=True,
-                agg_heads=False,
-            )
-            all_scores = torch.cat([word_scores, sound_scores], dim=0)
-            all_scores -= all_scores.min(dim=0, keepdim=True).values
-            all_scores /= all_scores.max(dim=0, keepdim=True).values.clamp_min(.0001)
-            is_words = torch.cat([
-                torch.ones(word_scores.shape[0]),
-                torch.zeros(sound_scores.shape[0])], dim=0).to(torch.bool)
-            assert all_scores.shape[1] == 2
-            ap_matrix = torch.zeros(2, 2)
-            act_matrix = torch.zeros(2, 2)
-            for head in range(2):
-                # writer.add_histogram(f"h{head}_all_scores", all_scores[:, head])
-                for dataset_num in range(2):
-                    if dataset_num == 0:
-                        labels = is_words
-                    else:
-                        labels = ~is_words
-                    ap_matrix[head, dataset_num] = binary_average_precision(
-                        all_scores[:, head].cpu(), labels.to(torch.int64).cpu())
-                    act_matrix[head, dataset_num] = 1 - (all_scores[:, head][labels]).mean()
-            ap_dis = max(.5 * (ap_matrix[0, 0] + ap_matrix[1, 1]),
-                         .5 * (ap_matrix[0, 1] + ap_matrix[1, 0]))
-            act_dis = max(.5 * (act_matrix[0, 0] + act_matrix[1, 1]),
-                          .5 * (act_matrix[0, 1] + act_matrix[1, 0]))
-            print("AP", ap_matrix)
-            print("AP dis", ap_dis)
-            print("Act", act_matrix)
-            print("Act dis", act_dis)
-            writer = self.logger.experiment
-            writer.add_scalar("hp/ap_dis", ap_dis, self.global_step + 1)
-            writer.add_scalar("hp/act_dis", act_dis, self.global_step + 1)
-    def validation_epoch_end(self, outputs) -> None:
-        print("Val end")
-        with torch.no_grad():
-            if self.trainer.datamodule.use_extra_val_sets:
-                if self.sim_agg_heads == 2:
-                    self.disentangle_validation(outputs[0], outputs[1])
-                self.retrieval_validation(outputs[0], "Places")
-                self.retrieval_validation(outputs[1], "AudioSet")
-                self.semseg_validation(outputs[2], outputs[3])
-            else:
-                print("HERE!")
-                self.retrieval_validation(outputs, "Val")
-        writer = self.logger.experiment
-        writer.flush()
-    def _recursive_detach(self, obj, gather=True):
-        if isinstance(obj, torch.Tensor):
-            if gather:
-                return self._auto_gather(obj)
-            else:
-                obj.detach()
-        elif isinstance(obj, dict):
-            return {k: self._recursive_detach(v, gather) for k, v in obj.items()}
-        elif isinstance(obj, list):
-            return [self._recursive_detach(v, gather) for v in obj]
-        else:
-            return obj
-    def predict_step(self, batch, batch_idx: int, dataloader_idx: int = 0):
-        with torch.no_grad():
-            predictions = {}
-            for k, v in batch.items():
-                predictions[k] = self._recursive_detach(v)
-            for k, v in self.forward(batch).items():
-                predictions[k] = self._auto_gather(v)
-            return predictions
-    def _configure_optimizers(self, full_train, lr):
-        params = [
-            *self.audio_aligner.parameters(),
-            *self.image_aligner.parameters(),
-            *self.sim_cal.parameters(),
-            *self.sim_agg.parameters()
-        ]
-        if (self.finetune_image_model or self.image_lora) and full_train:
-            params.extend(self.image_model.parameters())
-        if (self.finetune_audio_model or self.audio_lora) and full_train:
-            params.extend(self.audio_model.parameters())
-        if self.learn_audio_cls:
-            params.append(self.audio_cls)
-        last_epoch = self.global_step - 1
-        if self.optimizer == "adam":
-            opt = torch.optim.Adam(params, lr=lr, eps=1e-7)
-        elif self.optimizer == "nadam":
-            opt = torch.optim.NAdam(params, lr=lr, eps=1e-7)
-        else:
-            raise ValueError(f"Unknown optimizer {self.optimizer}")
-        if self.lr_schedule == "sgdr":
-            scheduler = CosineAnnealingWarmRestarts(
-                opt, self.lr_cycle_length, 2, eta_min=lr * 2e-2, last_epoch=last_epoch)
-        else:
-            scheduler = LambdaLR(opt, lr_lambda=lambda step: 1.0, last_epoch=last_epoch)
-        if self.lr_warmup > 0:
-            warmup = LambdaLR(
-                opt,
-                lr_lambda=lambda step: min(max(float(step), 0.0) / self.lr_warmup, 1.0),
-                last_epoch=last_epoch,
-            )
-            scheduler = SequentialLR(
-                opt,
-                schedulers=[warmup, scheduler],
-                milestones=[self.lr_warmup],
-                last_epoch=last_epoch)
-        scheduler = {"scheduler": scheduler, "interval": "step"}
-        return [opt], [scheduler]
-    def configure_optimizers(self):
-        if self.full_train:
-            return self._configure_optimizers(self.full_train, self.lr)
-        else:
-            return self._configure_optimizers(self.full_train, self.pretrain_lr)
-@hydra.main(config_path="configs", config_name="av_align.yaml", version_base=None)
-def my_app(cfg: DictConfig) -> None:
-    print(OmegaConf.to_yaml(cfg))
-    seed_everything(cfg.seed, workers=True)
-    exp_name = f"{cfg.resume_prefix}"
-    if cfg.image_model_type == "dino8":
-        patch_size = 8 * cfg.image_pool_width
-    elif cfg.image_model_type == "cavmae":
-        patch_size = 16 * cfg.image_pool_width
-    elif cfg.image_model_type == "imagebind":
-        patch_size = 16 * cfg.image_pool_width
-    elif cfg.image_model_type == "clip":
-        patch_size = 16 * cfg.image_pool_width
-    elif cfg.image_model_type == "cavmae-mixed":
-        patch_size = 16 * cfg.image_pool_width
-    elif cfg.image_model_type == "dinov2":
-        patch_size = 14 * cfg.image_pool_width
-    else:
-        raise ValueError(f"Unknown patch size for model {cfg.image_model_type}")
-    datamodule = AVDataModule(
-        dataset_name=cfg.dataset_name,
-        load_size=cfg.load_size,
-        image_aug=cfg.image_aug,
-        audio_aug=cfg.audio_aug,
-        extra_audio_masking=cfg.extra_audio_masking,
-        audio_model_type=cfg.audio_model_type,
-        pytorch_data_dir=cfg.pytorch_data_dir,
-        use_cached_embs=cfg.use_cached_embs,
-        batch_size=cfg.batch_size,
-        num_workers=cfg.num_workers,
-        audio_level=cfg.audio_level,
-        neg_audio=cfg.neg_audio,
-        use_original_val_set=not cfg.use_extra_val_sets,
-        use_extra_val_sets=cfg.use_extra_val_sets,
-        data_for_plotting=False,
-        quad_mixup=cfg.quad_mixup,
-        bg_mixup=cfg.bg_mixup,
-        patch_mixup=cfg.patch_mixup,
-        patch_size=patch_size
-    )
-    datamodule.maybe_unpack(remove_source=cfg.submitting_to_aml)
-    aligner = create_model_from_cfg(LitAVAligner, cfg, {})
-    if cfg.starting_weights is not None:
-        loaded = torch.load(join(cfg.output_root, cfg.starting_weights), map_location='cpu')
-        state = loaded["state_dict"]
-        aligner.load_state_dict(state, strict=cfg.load_strict)
-        del state
-        del loaded
-    if cfg.num_gpus > 1:
-        # strategy = "ddp_sharded"  # _find_unused_parameters_true"
-        strategy = "ddp"  # _find_unused_parameters_true"
-    else:
-        strategy = "auto"
-    if cfg.dataset_name in {"places-audio", "mixed", "audio-set", "mixed-full"}:
-        val_args = dict(check_val_every_n_epoch=2)
-    elif cfg.dataset_name in {"dolphin"}:
-        val_args = dict(check_val_every_n_epoch=5)
-    else:
-        val_args = dict(val_check_interval=10000)
-    # val_args = dict(val_check_interval=1000)
-    def maybe_get_ckpt(ckpt_dir):
-        if cfg.auto_resume and os.path.exists(ckpt_dir):
-            print(f"Attempting to resume from {ckpt_dir}")
-            candidates = os.listdir(ckpt_dir)
-            assert (len(candidates) == 1)
-            return join(ckpt_dir, candidates[0])
-        elif cfg.auto_resume:
-            print(f"Could not find checkpoint at {ckpt_dir}")
-            return None
-        else:
-            return None
-    log_dir = join(cfg.output_root, "logs", cfg.grouping_name, exp_name)
-    ckpt_dir = join(cfg.output_root, "checkpoints", cfg.grouping_name, exp_name)
-    import gc
-    torch.cuda.empty_cache()
-    gc.collect()
-    def run_exp(aligner, full_train):
-        trainer_args = dict(
-            accelerator='gpu',
-            strategy=strategy,
-            devices=cfg.num_gpus,
-            num_sanity_val_steps=cfg.num_sanity_val_steps,
-            log_every_n_steps=50,
-            reload_dataloaders_every_n_epochs=10,
-            precision="16",
-            # profiler="simple",
-            # precision="bf16",
-            max_steps=cfg.max_steps,
-            **val_args)
-        aligner.set_full_train(full_train)
-        if full_train:
-            suffix = "train"
-        else:
-            suffix = "pretrain"
-            trainer_args["max_steps"] = cfg.pretrain_steps
-        print(f"Starting {suffix} phase")
-        logger = TensorBoardLogger(join(log_dir, suffix), default_hp_metric=False)
-        callbacks = [
-            ModelCheckpoint(join(ckpt_dir, suffix), every_n_epochs=1),
-            LearningRateMonitor(logging_interval='step'),
-        ]
-        Trainer(logger=logger,
-                callbacks=callbacks,
-                **trainer_args).fit(
-            aligner,
-            datamodule=datamodule,
-            ckpt_path=maybe_get_ckpt(join(ckpt_dir, suffix)))
-    train_chkpt = maybe_get_ckpt(join(ckpt_dir, "train"))
-    gc.collect()
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-    if cfg.pretrain_steps > 0 and train_chkpt is None:
-        run_exp(aligner, full_train=False)
-    run_exp(aligner, full_train=True)
-if __name__ == "__main__":
-    my_app()

DenseAV/gradio_app.py DELETED Viewed

@@ -1,196 +0,0 @@
-import csv
-import os
-import tempfile
-import gradio as gr
-import requests
-import torch
-import torchvision
-import torchvision.transforms as T
-from PIL import Image
-from featup.util import norm
-from torchaudio.functional import resample
-from denseav.train import LitAVAligner
-from denseav.plotting import plot_attention_video, plot_2head_attention_video, plot_feature_video
-from denseav.shared import norm, crop_to_divisor, blur_dim
-from os.path import join
-if __name__ == "__main__":
-    mode = "local"
-    if mode == "local":
-        sample_videos_dir = "samples"
-    else:
-        os.environ['TORCH_HOME'] = '/tmp/.cache'
-        os.environ['HF_HOME'] = '/tmp/.cache'
-        os.environ['HF_DATASETS_CACHE'] = '/tmp/.cache'
-        os.environ['TRANSFORMERS_CACHE'] = '/tmp/.cache'
-        os.environ['GRADIO_EXAMPLES_CACHE'] = '/tmp/gradio_cache'
-        sample_videos_dir = "/tmp/samples"
-    def download_video(url, save_path):
-        response = requests.get(url)
-        with open(save_path, 'wb') as file:
-            file.write(response.content)
-    base_url = "https://marhamilresearch4.blob.core.windows.net/denseav-public/samples/"
-    sample_videos_urls = {
-        "puppies.mp4": base_url + "puppies.mp4",
-        "peppers.mp4": base_url + "peppers.mp4",
-        "boat.mp4": base_url + "boat.mp4",
-        "elephant2.mp4": base_url + "elephant2.mp4",
-    }
-    # Ensure the directory for sample videos exists
-    os.makedirs(sample_videos_dir, exist_ok=True)
-    # Download each sample video
-    for filename, url in sample_videos_urls.items():
-        save_path = os.path.join(sample_videos_dir, filename)
-        # Download the video if it doesn't already exist
-        if not os.path.exists(save_path):
-            print(f"Downloading {filename}...")
-            download_video(url, save_path)
-        else:
-            print(f"{filename} already exists. Skipping download.")
-    csv.field_size_limit(100000000)
-    options = ['language', "sound-language", "sound"]
-    load_size = 224
-    plot_size = 224
-    video_input = gr.Video(label="Choose a video to featurize", height=480)
-    model_option = gr.Radio(options, value="language", label='Choose a model')
-    video_output1 = gr.Video(label="Audio Video Attention", height=480)
-    video_output2 = gr.Video(label="Multi-Head Audio Video Attention (Only Availible for sound_and_language)",
-                             height=480)
-    video_output3 = gr.Video(label="Visual Features", height=480)
-    models = {o: LitAVAligner.from_pretrained(f"mhamilton723/DenseAV-{o}") for o in options}
-    def process_video(video, model_option):
-        model = models[model_option].cuda()
-        original_frames, audio, info = torchvision.io.read_video(video, end_pts=10, pts_unit='sec')
-        sample_rate = 16000
-        if info["audio_fps"] != sample_rate:
-            audio = resample(audio, info["audio_fps"], sample_rate)
-        audio = audio[0].unsqueeze(0)
-        img_transform = T.Compose([
-            T.Resize(load_size, Image.BILINEAR),
-            lambda x: crop_to_divisor(x, 8),
-            lambda x: x.to(torch.float32) / 255,
-            norm])
-        frames = torch.cat([img_transform(f.permute(2, 0, 1)).unsqueeze(0) for f in original_frames], axis=0)
-        plotting_img_transform = T.Compose([
-            T.Resize(plot_size, Image.BILINEAR),
-            lambda x: crop_to_divisor(x, 8),
-            lambda x: x.to(torch.float32) / 255])
-        frames_to_plot = plotting_img_transform(original_frames.permute(0, 3, 1, 2))
-        with torch.no_grad():
-            audio_feats = model.forward_audio({"audio": audio.cuda()})
-            audio_feats = {k: v.cpu() for k, v in audio_feats.items()}
-            image_feats = model.forward_image({"frames": frames.unsqueeze(0).cuda()}, max_batch_size=2)
-            image_feats = {k: v.cpu() for k, v in image_feats.items()}
-            sim_by_head = model.sim_agg.get_pairwise_sims(
-                {**image_feats, **audio_feats},
-                raw=False,
-                agg_sim=False,
-                agg_heads=False
-            ).mean(dim=-2).cpu()
-            sim_by_head = blur_dim(sim_by_head, window=3, dim=-1)
-            print(sim_by_head.shape)
-        temp_video_path_1 = tempfile.mktemp(suffix='.mp4')
-        plot_attention_video(
-            sim_by_head,
-            frames_to_plot,
-            audio,
-            info["video_fps"],
-            sample_rate,
-            temp_video_path_1)
-        if model_option == "sound_and_language":
-            temp_video_path_2 = tempfile.mktemp(suffix='.mp4')
-            plot_2head_attention_video(
-                sim_by_head,
-                frames_to_plot,
-                audio,
-                info["video_fps"],
-                sample_rate,
-                temp_video_path_2)
-        else:
-            temp_video_path_2 = None
-        temp_video_path_3 = tempfile.mktemp(suffix='.mp4')
-        temp_video_path_4 = tempfile.mktemp(suffix='.mp4')
-        plot_feature_video(
-            image_feats["image_feats"].cpu(),
-            audio_feats['audio_feats'].cpu(),
-            frames_to_plot,
-            audio,
-            info["video_fps"],
-            sample_rate,
-            temp_video_path_3,
-            temp_video_path_4,
-        )
-        # return temp_video_path_1, temp_video_path_2, temp_video_path_3, temp_video_path_4
-        return temp_video_path_1, temp_video_path_2, temp_video_path_3
-    with gr.Blocks() as demo:
-        with gr.Column():
-            gr.Markdown("## Visualizing Sound and Language with DenseAV")
-            gr.Markdown(
-                "This demo allows you to explore the inner attention maps of DenseAV's dense multi-head contrastive operator.")
-            with gr.Row():
-                with gr.Column(scale=1):
-                    model_option.render()
-                with gr.Column(scale=3):
-                    video_input.render()
-            with gr.Row():
-                submit_button = gr.Button("Submit")
-            with gr.Row():
-                gr.Examples(
-                    examples=[
-                        [join(sample_videos_dir, "puppies.mp4"), "sound_and_language"],
-                        [join(sample_videos_dir, "peppers.mp4"), "language"],
-                        [join(sample_videos_dir, "elephant2.mp4"), "language"],
-                        [join(sample_videos_dir, "boat.mp4"), "language"]
-                    ],
-                    inputs=[video_input, model_option]
-                )
-            with gr.Row():
-                video_output1.render()
-                video_output2.render()
-                video_output3.render()
-        submit_button.click(fn=process_video, inputs=[video_input, model_option],
-                            outputs=[video_output1, video_output2, video_output3])
-    if mode == "local":
-        demo.launch(server_name="0.0.0.0", server_port=6006, debug=True)
-    else:
-        demo.launch(server_name="0.0.0.0", server_port=7860, debug=True)

DenseAV/hubconf.py DELETED Viewed

@@ -1,25 +0,0 @@
-# hubconf.py
-from denseav.train import LitAVAligner
-dependencies = ['torch', 'torchvision', 'PIL', 'denseav']  # List any dependencies here
-def _load_base(model_name):
-    model = LitAVAligner.load_from_checkpoint(
-        f"https://marhamilresearch4.blob.core.windows.net/denseav-public/hub/{model_name}.ckpt",
-        **{'loss_leak': 0.0, 'use_cached_embs': False},
-        strict=True)
-    model.set_full_train(True)
-    return model
-def sound_and_language():
-    return _load_base("denseav_2head")
-def language():
-    return _load_base("denseav_language")
-def sound():
-    return _load_base("denseav_sound")

DenseAV/samples/puppies.mp4 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d4bc5049010142b9a4364afea7da15d4e9736d95cfc9a365c2658c69ba409d56
-size 7534432

DenseAV/setup.py DELETED Viewed

@@ -1,37 +0,0 @@
-from setuptools import setup, find_packages
-setup(
-    name='denseav',
-    version='0.1.0',
-    packages=find_packages(),
-    install_requires=[
-        'torch',
-        'kornia',
-        'omegaconf',
-        'pytorch-lightning',
-        'torchvision',
-        'tqdm',
-        'torchmetrics',
-        'scikit-learn',
-        'numpy',
-        'matplotlib',
-        'timm==0.4.12',
-        'moviepy',
-        'hydra-core',
-        'peft==0.5.0',
-        'av',
-        'audioread'
-    ],
-    author='Mark Hamilton',
-    author_email='[email protected]',
-    description='Offical code for the CVPR 2024 Paper: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language',
-    long_description=open('README.md').read(),
-    long_description_content_type='text/markdown',
-    url='https://github.com/mhamilton723/DenseAV',
-    classifiers=[
-        'Programming Language :: Python :: 3',
-        'License :: OSI Approved :: MIT License',
-        'Operating System :: OS Independent',
-    ],
-    python_requires='>=3.6'
-)