Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents Paper • 2508.05954 • Published 10 days ago • 6 • 2
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Paper • 2507.13344 • Published Jul 17 • 55 • 2
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning Paper • 2506.03525 • Published Jun 4 • 6 • 2
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance Paper • 2505.21876 • Published May 28 • 9 • 2
CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting Paper • 2504.15485 • Published Apr 21 • 5 • 2
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding Paper • 2411.04952 • Published Nov 7, 2024 • 30 • 4
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement Paper • 2411.15115 • Published Nov 22, 2024 • 9 • 3
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement Paper • 2411.15115 • Published Nov 22, 2024 • 9 • 3
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding Paper • 2411.04952 • Published Nov 7, 2024 • 30 • 4