InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning Paper • 2505.13888 • Published May 20
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models Paper • 2410.00363 • Published Oct 1, 2024 • 1
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT Paper • 2406.18583 • Published Jun 5, 2024
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers Paper • 2405.05945 • Published May 9, 2024 • 4
Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation Paper • 2508.06426 • Published 9 days ago • 10
CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model Paper • 2403.08350 • Published Mar 13, 2024
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis Paper • 2501.04561 • Published Jan 8 • 16
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct Paper • 2409.05840 • Published Sep 9, 2024 • 49
Text-Video Retrieval with Global-Local Semantic Consistent Learning Paper • 2405.12710 • Published May 21, 2024
Channel Importance Matters in Few-Shot Image Classification Paper • 2206.08126 • Published Jun 16, 2022
Rectifying the Shortcut Learning of Background for Few-Shot Learning Paper • 2107.07746 • Published Jul 16, 2021