buxiangzhiren
/

VD-IT

referring-video-object-segmentation

Model card Files Files and versions Community

VD-IT / README.md

buxiangzhiren's picture

Update README.md

40c2f03 verified 6 months ago

|

773 Bytes

	---
	license: ecl-2.0
	---
	VD-IT model

	The is our pre-trained checkpoint for our paper [Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation](https://arxiv.org/abs/2403.12042).

	We use a video diffusion model ([ModelScopeT2V](https://modelscope.cn/models/damo/text-to-video-synthesis/summary)) as our base model, applying prompt tuning to adapt it as a visual backbone for downstream video understanding tasks.

	### Model traning
	We first pre-train our model on Ref-COCO and then fine-tune it on Ref-YouTube-VOS. The training of the models utilizes
	two NVIDIA A100 GPUs, processing 5 frames per clip over the course of 9 epochs. The initial learning rate is set to 5e-5 and reduced by a factor of 10 at the 6th and 8th epochs.