Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Paper • 2412.15322 • Published 7 days ago • 16
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published 13 days ago • 131
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers Paper • 2412.09611 • Published 14 days ago • 9
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models Paper • 2412.07674 • Published 16 days ago • 20
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation Paper • 2412.05148 • Published 20 days ago • 11
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics Paper • 2412.07774 • Published 16 days ago • 25
Mind the Time: Temporally-Controlled Multi-Event Video Generation Paper • 2412.05263 • Published 20 days ago • 10
AIMv2 Collection A collection of AIMv2 vision encoders that supports a number of resolutions, native resolution, and a distilled checkpoint. • 19 items • Updated Nov 22 • 67
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity Paper • 2411.15411 • Published Nov 23 • 7
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE Paper • 2411.16856 • Published about 1 month ago • 11
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator Paper • 2411.15466 • Published Nov 23 • 34
Material Anything: Generating Materials for Any 3D Object via Diffusion Paper • 2411.15138 • Published Nov 22 • 42