VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Abstract
Video-based multimodal large language models (<PRE_TAG><PRE_TAG><PRE_TAG>Video-LLM</POST_TAG>s</POST_TAG></POST_TAG>) possess significant potential for video understanding tasks. However, most <PRE_TAG><PRE_TAG><PRE_TAG>Video-LLM</POST_TAG>s</POST_TAG></POST_TAG> treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose <PRE_TAG>VidCompress</POST_TAG>, a novel <PRE_TAG>Video-LLM</POST_TAG> featuring <PRE_TAG>memory-enhanced temporal compression</POST_TAG>. <PRE_TAG>VidCompress</POST_TAG> employs a <PRE_TAG>dual-compressor</POST_TAG> approach: a <PRE_TAG>memory-enhanced compressor</POST_TAG> captures both <PRE_TAG>short-term</POST_TAG> and <PRE_TAG>long-term</POST_TAG> <PRE_TAG>temporal relationships</POST_TAG> in videos and compresses the <PRE_TAG><PRE_TAG>visual tokens</POST_TAG></POST_TAG> using a <PRE_TAG>multiscale transformer</POST_TAG> with a memory-cache mechanism, while a text-perceived compressor generates condensed <PRE_TAG><PRE_TAG>visual tokens</POST_TAG></POST_TAG> by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that <PRE_TAG>VidCompress</POST_TAG> efficiently models complex temporal-spatial relations and significantly outperforms existing <PRE_TAG><PRE_TAG><PRE_TAG>Video-LLM</POST_TAG>s</POST_TAG></POST_TAG>.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper