Words or Vision: Do Vision-Language Models Have Blind Faith in Text? Paper • 2503.02199 • Published 8 days ago • 6
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published 3 days ago • 19
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers Paper • 2502.15894 • Published 19 days ago • 20
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers Paper • 2502.14377 • Published 20 days ago • 12
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above Paper • 2502.14127 • Published 21 days ago • 2
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning Paper • 2502.11271 • Published 24 days ago • 16
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models Paper • 2502.09696 • Published 27 days ago • 39
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Paper • 2502.08910 • Published 27 days ago • 143
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation Paper • 2502.08690 • Published 28 days ago • 41
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding Paper • 2502.08946 • Published 27 days ago • 183