Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published 16 days ago • 24
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers? Paper • 2503.10632 • Published 12 days ago • 12
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper • 2307.13854 • Published Jul 25, 2023 • 26
Qwen2-VL Collection Vision-language model series based on Qwen2 • 16 items • Updated Dec 6, 2024 • 209
openai/clip-vit-large-patch14 Zero-Shot Image Classification • Updated Sep 15, 2023 • 38.6M • • 1.68k