-
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 71 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 48 -
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Paper • 2311.10122 • Published • 26 -
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Paper • 2311.16103 • Published • 1
Collections
Discover the best community collections!
Collections including paper arxiv:2311.10122
-
MM-VID: Advancing Video Understanding with GPT-4V(ision)
Paper • 2310.19773 • Published • 19 -
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
Paper • 2310.05863 • Published • 1 -
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Paper • 2311.06242 • Published • 79 -
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization
Paper • 2311.10126 • Published • 7
-
GLaMM: Pixel Grounding Large Multimodal Model
Paper • 2311.03356 • Published • 33 -
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Paper • 2311.03354 • Published • 4 -
CogVLM: Visual Expert for Pretrained Language Models
Paper • 2311.03079 • Published • 23 -
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework
Paper • 2311.10125 • Published • 4