Autoregressive Video Generation without Vector Quantization Paper • 2412.14169 • Published 7 days ago • 13
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper • 2412.13663 • Published 7 days ago • 103
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 13 days ago • 74
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Paper • 2412.10302 • Published 12 days ago • 7
Large Concept Models: Language Modeling in a Sentence Representation Space Paper • 2412.08821 • Published 14 days ago • 7
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 19 days ago • 121
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Paper • 2412.07760 • Published 15 days ago • 49
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper • 2412.09596 • Published 13 days ago • 90
LOGO -- Long cOntext aliGnment via efficient preference Optimization Paper • 2410.18533 • Published Oct 24 • 42
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper • 2410.13848 • Published Oct 17 • 31
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures Paper • 2410.13754 • Published Oct 17 • 74