# DenseAV Demonstration Notebook

> ⚠️ Change your collab runtime to T4 GPU before running this notebook

In this notebook we will walk through how to load, visualize, and work with our catalog of pre-trained models.

## Set up Google Collab
> ⚠️ Skip this section if you are not on Google Collab


In [1]:
!git clone https://github.com/mhamilton723/DenseAV

fatal: destination path 'DenseAV' already exists and is not an empty directory.


In [2]:
!pip install av



In [3]:
import os
os.chdir("DenseAV/")

In [4]:
!pip install -e .

Obtaining file:///content/DenseAV
 Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: denseav
 Attempting uninstall: denseav
 Found existing installation: denseav 0.1.0
 Uninstalling denseav-0.1.0:
 Successfully uninstalled denseav-0.1.0
 Running setup.py develop for denseav
Successfully installed denseav-0.1.0


## Import dependencies and load a pretrained DenseAV Model


In [5]:
from os.path import join

import torch
import torchvision
import torchvision.transforms as T
from PIL import Image
from torchaudio.functional import resample

from denseav.plotting import plot_attention_video, plot_2head_attention_video, plot_feature_video, display_video_in_notebook
from denseav.shared import norm, crop_to_divisor, blur_dim

In [6]:
model_name = "sound_and_language"
video_path = "samples/puppies.mp4"
result_dir = "results"
load_size = 224
plot_size = 224

In [7]:
model = torch.hub.load('mhamilton723/DenseAV', model_name).cuda()

Using cache found in /root/.cache/torch/hub/mhamilton723_DenseAV_main
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.9.4 to v2.2.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint https:/marhamilresearch4.blob.core.windows.net/denseav-public/hub/denseav_2head.ckpt`
Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.

Some weights of HubertModel were not initialized from the model checkpoint at facebook/hubert-large-ls960-ft and are newly

trainable params: 147,456 || all params: 21,817,728 || trainable%: 0.6758540577644016


## Load a sample video and prepare it for DenseAV

In [8]:
original_frames, audio, info = torchvision.io.read_video(video_path, pts_unit='sec')
sample_rate = 16000

if info["audio_fps"] != sample_rate:
 audio = resample(audio, info["audio_fps"], sample_rate)
audio = audio[0].unsqueeze(0)

img_transform = T.Compose([
 T.Resize(load_size, Image.BILINEAR),
 lambda x: crop_to_divisor(x, 8),
 lambda x: x.to(torch.float32) / 255,
 norm])

frames = torch.cat([img_transform(f.permute(2, 0, 1)).unsqueeze(0) for f in original_frames], axis=0)

plotting_img_transform = T.Compose([
 T.Resize(plot_size, Image.BILINEAR),
 lambda x: crop_to_divisor(x, 8),
 lambda x: x.to(torch.float32) / 255])

frames_to_plot = plotting_img_transform(original_frames.permute(0, 3, 1, 2))

## Use DenseAV to obtain dense AV-aligned features

In [9]:
with torch.no_grad():
 audio_feats = model.forward_audio({"audio": audio.cuda()})
 audio_feats = {k: v.cpu() for k,v in audio_feats.items()}
 image_feats = model.forward_image({"frames": frames.unsqueeze(0).cuda()}, max_batch_size=2)
 image_feats = {k: v.cpu() for k,v in image_feats.items()}


 sim_by_head = model.sim_agg.get_pairwise_sims(
 {**image_feats, **audio_feats},
 raw=False,
 agg_sim=False,
 agg_heads=False
 ).mean(dim=-2).cpu()

 sim_by_head = blur_dim(sim_by_head, window=3, dim=-1)
 print(sim_by_head.shape)

torch.Size([181, 2, 14, 14, 33])


## Visualize Cross-Modal Attention

In [28]:
plot_attention_video(
 sim_by_head,
 frames_to_plot,
 audio,
 info["video_fps"],
 sample_rate,
 "results/attention.mp4")
display_video_in_notebook("results/attention.mp4")

Moviepy - Building video results/attention.mp4.
MoviePy - Writing audio in attentionTEMP_MPY_wvf_snd.mp3




MoviePy - Done.
Moviepy - Writing video results/attention.mp4





Moviepy - Done !
Moviepy - video ready results/attention.mp4


## Visualize Cross Modal Attention by Head to Disentangle Sound and Language

In [29]:
if model_name == "sound_and_language":
 plot_2head_attention_video(
 sim_by_head,
 frames_to_plot,
 audio,
 info["video_fps"],
 sample_rate,
 "results/2head_attention.mp4")
 display_video_in_notebook("results/2head_attention.mp4")

Moviepy - Building video results/2head_attention.mp4.
MoviePy - Writing audio in 2head_attentionTEMP_MPY_wvf_snd.mp3




MoviePy - Done.
Moviepy - Writing video results/2head_attention.mp4





Moviepy - Done !
Moviepy - video ready results/2head_attention.mp4


## Plot Deep Features

In [30]:
plot_feature_video(
 image_feats["image_feats"].cpu(),
 audio_feats['audio_feats'].cpu(),
 frames_to_plot,
 audio,
 info["video_fps"],
 sample_rate,
 "results/visual_features.mp4",
 "results/audio_features.mp4",
)
display_video_in_notebook("results/visual_features.mp4")
display_video_in_notebook("results/audio_features.mp4")

Moviepy - Building video results/visual_features.mp4.
MoviePy - Writing audio in visual_featuresTEMP_MPY_wvf_snd.mp3




MoviePy - Done.
Moviepy - Writing video results/visual_features.mp4





Moviepy - Done !
Moviepy - video ready results/visual_features.mp4
Moviepy - Building video results/audio_features.mp4.
MoviePy - Writing audio in audio_featuresTEMP_MPY_wvf_snd.mp3




MoviePy - Done.
Moviepy - Writing video results/audio_features.mp4





Moviepy - Done !
Moviepy - video ready results/audio_features.mp4
