2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper β’ 2501.00958 β’ Published 22 days ago β’ 97
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper β’ 2412.19723 β’ Published 27 days ago β’ 81
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper β’ 2412.04467 β’ Published Dec 5, 2024 β’ 105
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper β’ 2412.03555 β’ Published Dec 4, 2024 β’ 124
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper β’ 2411.17465 β’ Published Nov 26, 2024 β’ 79
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Paper β’ 2501.04003 β’ Published 16 days ago β’ 24
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published 23 days ago β’ 41
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper β’ 2501.06186 β’ Published 13 days ago β’ 59
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests Paper β’ 2501.04671 β’ Published 15 days ago