Datasets and Evaluation
SSv2-ST (SSv2 Spatio-Temporal dataset)
Pre-processing
Our pre-processing pipeline is described here. We first extract the first noun chunk of the caption using Spacy. Then this subject is fed into Owl-ViT-L to obtain bounding boxes. If there are 0 bounding boxes corresponding to a subject, we use the next caption from the dataset. If there are atleast two bounding boxes, we interpolate bounding boxes for the missing frames linearly. The dataset downloading is a bit complex, you need to follow the instructions here. Download the dataset and run generate_ssv2_st.py
.
Interactive Motion Control - IMC
We generate bounding boxes for this dataset using the generate_imc.py
file. The prompts are in custom_prompts.csv
and filtered_prompts.csv
.
For more details regarding the datasets and evaluation strategy, please refer to the Peekaboo paper.