SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Paper • 2412.09604 • Published 13 days ago • 35
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 19 days ago • 121
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published Nov 15 • 68
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Paper • 2410.13861 • Published Oct 17 • 52
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models Paper • 2408.02718 • Published Aug 5 • 60
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Paper • 2406.08085 • Published Jun 12 • 13
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published Jul 3 • 93
Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments Paper • 2403.13803 • Published Mar 20
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling Paper • 2401.15977 • Published Jan 29 • 37
ControlLLM: Augment Language Models with Tools by Searching on Graphs Paper • 2310.17796 • Published Oct 26, 2023 • 17
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Paper • 2308.01907 • Published Aug 3, 2023 • 11
Ghost in the Minecraft: Generally Capable Agents for Open-World Enviroments via Large Language Models with Text-based Knowledge and Memory Paper • 2305.17144 • Published May 25, 2023 • 2
InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language Paper • 2305.05662 • Published May 9, 2023 • 4
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks Paper • 2305.11175 • Published May 18, 2023 • 3