Papers
arxiv:2410.11417

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Published on Oct 15, 2024
Authors:
,
,
,

Abstract

Video-based multimodal large language models (<PRE_TAG><PRE_TAG><PRE_TAG>Video-LLM</POST_TAG>s</POST_TAG></POST_TAG>) possess significant potential for video understanding tasks. However, most <PRE_TAG><PRE_TAG><PRE_TAG>Video-LLM</POST_TAG>s</POST_TAG></POST_TAG> treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose <PRE_TAG>VidCompress</POST_TAG>, a novel <PRE_TAG>Video-LLM</POST_TAG> featuring <PRE_TAG>memory-enhanced temporal compression</POST_TAG>. <PRE_TAG>VidCompress</POST_TAG> employs a <PRE_TAG>dual-compressor</POST_TAG> approach: a <PRE_TAG>memory-enhanced compressor</POST_TAG> captures both <PRE_TAG>short-term</POST_TAG> and <PRE_TAG>long-term</POST_TAG> <PRE_TAG>temporal relationships</POST_TAG> in videos and compresses the <PRE_TAG><PRE_TAG>visual tokens</POST_TAG></POST_TAG> using a <PRE_TAG>multiscale transformer</POST_TAG> with a memory-cache mechanism, while a text-perceived compressor generates condensed <PRE_TAG><PRE_TAG>visual tokens</POST_TAG></POST_TAG> by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that <PRE_TAG>VidCompress</POST_TAG> efficiently models complex temporal-spatial relations and significantly outperforms existing <PRE_TAG><PRE_TAG><PRE_TAG>Video-LLM</POST_TAG>s</POST_TAG></POST_TAG>.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.11417 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.11417 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.11417 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.