metadata

license: apache-2.0

Video-LLaVA-Seg

Project | Arxiv

This is the official baseline implementation for the ViCas dataset. This is the pretrained model which has been optimized on a subset of WebVid10M and Panda70M for video captioning. The final model which has been finetuned on ViCaS is hosted here.

For details about setting up the model, refer to the Video-LLaVA-Seg GitHub repo.

For details about downloading and evaluating the dataset benchmark, refer to the ViCaS GitHub repo.