-
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
Paper • 2411.06558 • Published • 34 -
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Paper • 2411.09944 • Published • 12 -
Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing
Paper • 2411.19460 • Published • 10 -
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Paper • 2412.05237 • Published • 46
Collections
Discover the best community collections!
Collections including paper arxiv:2412.09871
-
Differential Transformer
Paper • 2410.05258 • Published • 168 -
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper • 2412.03555 • Published • 119 -
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper • 2412.04467 • Published • 104 -
o1-Coder: an o1 Replication for Coding
Paper • 2412.00154 • Published • 41
-
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Paper • 2409.08513 • Published • 11 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 43 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 74 -
LLMs + Persona-Plug = Personalized LLMs
Paper • 2409.11901 • Published • 31
-
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Paper • 2407.08083 • Published • 27 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 58 -
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Paper • 2408.15237 • Published • 37 -
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper • 2409.11355 • Published • 28
-
Human-like Episodic Memory for Infinite Context LLMs
Paper • 2407.09450 • Published • 59 -
MUSCLE: A Model Update Strategy for Compatible LLM Evolution
Paper • 2407.09435 • Published • 20 -
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Paper • 2407.09121 • Published • 5 -
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
Paper • 2407.14482 • Published • 25
-
Depth Anything V2
Paper • 2406.09414 • Published • 95 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 50 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 34 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 109
-
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Paper • 2311.17049 • Published • 1 -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 14 -
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper • 2303.17376 • Published -
Sigmoid Loss for Language Image Pre-Training
Paper • 2303.15343 • Published • 5
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 26 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 12 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 46 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 28
-
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper • 2404.14047 • Published • 44 -
Reasoning in Large Language Models: A Geometric Perspective
Paper • 2407.02678 • Published • 1 -
Natural Language Reinforcement Learning
Paper • 2411.14251 • Published • 27 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 76