Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published 17 days ago • 24
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers? Paper • 2503.10632 • Published 13 days ago • 12
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper • 2307.13854 • Published Jul 25, 2023 • 26
Qwen2-VL Collection Vision-language model series based on Qwen2 • 16 items • Updated Dec 6, 2024 • 209
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Paper • 2408.03314 • Published Aug 6, 2024 • 59
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation Paper • 2402.10210 • Published Feb 15, 2024 • 35
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation Paper • 2401.05675 • Published Jan 11, 2024 • 25
Video ReCap: Recursive Captioning of Hour-Long Videos Paper • 2402.13250 • Published Feb 20, 2024 • 26
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? Paper • 2307.16368 • Published Jul 31, 2023 • 12