Datasets and Evaluation

SSv2-ST (SSv2 Spatio-Temporal dataset)

Pre-processing

Our pre-processing pipeline is described here. We first extract the first noun chunk of the caption using Spacy. Then this subject is fed into Owl-ViT-L to obtain bounding boxes. If there are 0 bounding boxes corresponding to a subject, we use the next caption from the dataset. If there are atleast two bounding boxes, we interpolate bounding boxes for the missing frames linearly. The dataset downloading is a bit complex, you need to follow the instructions here. Download the dataset and run generate_ssv2_st.py.

Interactive Motion Control - IMC

We generate bounding boxes for this dataset using the generate_imc.py file. The prompts are in custom_prompts.csv and filtered_prompts.csv.

For more details regarding the datasets and evaluation strategy, please refer to the Peekaboo paper.