EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents Paper • 2501.11858 • Published 5 days ago • 2
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published 3 days ago • 18
Temporal Preference Optimization for Long-Form Video Understanding Paper • 2501.13919 • Published 3 days ago • 17
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published 4 days ago • 66
MSTS: A Multimodal Safety Test Suite for Vision-Language Models Paper • 2501.10057 • Published 9 days ago • 8
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Paper • 2501.12368 • Published 5 days ago • 35
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published 5 days ago • 75
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation Paper • 2501.09755 • Published 10 days ago • 33
FAST: Efficient Action Tokenization for Vision-Language-Action Models Paper • 2501.09747 • Published 10 days ago • 22
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Paper • 2501.08326 • Published 12 days ago • 31
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot Paper • 2501.09012 • Published 11 days ago • 10
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding Paper • 2501.07783 • Published 12 days ago • 7
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents Paper • 2501.08828 • Published 11 days ago • 28
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding Paper • 2501.07888 • Published 12 days ago • 13
A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following Paper • 2501.08187 • Published 12 days ago • 24
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Paper • 2501.06282 • Published 16 days ago • 40
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper • 2501.06186 • Published 16 days ago • 59
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper • 2501.05510 • Published 17 days ago • 38
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis Paper • 2501.04377 • Published 18 days ago • 14
An Empirical Study of Autoregressive Pre-training from Videos Paper • 2501.05453 • Published 17 days ago • 37