rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Paper • 2501.04519 • Published 18 days ago • 248
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published 19 days ago • 48
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper • 2501.01957 • Published 23 days ago • 41
ProgCo: Program Helps Self-Correction of Large Language Models Paper • 2501.01264 • Published 24 days ago • 25
Cosmos World Foundation Model Platform for Physical AI Paper • 2501.03575 • Published 19 days ago • 66
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper • 2501.05510 • Published 17 days ago • 38
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper • 2501.00599 • Published 26 days ago • 41
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints Paper • 2501.03841 • Published 19 days ago • 51
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control Paper • 2501.01427 • Published 24 days ago • 49
VideoRAG: Retrieval-Augmented Generation over Video Corpus Paper • 2501.05874 • Published 16 days ago • 66
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation Paper • 2501.01895 • Published 23 days ago • 50
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published 25 days ago • 98
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper • 2412.19723 • Published 30 days ago • 81