-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 39 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 20
Collections
Discover the best community collections!
Collections including paper arxiv:2412.04424
-
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper • 2412.15213 • Published • 25 -
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper • 2412.11768 • Published • 40 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 103 -
Autoregressive Video Generation without Vector Quantization
Paper • 2412.14169 • Published • 13
-
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper • 2410.13861 • Published • 52 -
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Paper • 2411.07975 • Published • 27 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 67 -
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 42
-
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Paper • 2411.02959 • Published • 64 -
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
Paper • 2411.02355 • Published • 46 -
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation
Paper • 2410.23090 • Published • 54 -
RARe: Retrieval Augmented Retrieval with In-Context Examples
Paper • 2410.20088 • Published • 5
-
A Comparative Study on Automatic Coding of Medical Letters with Explainability
Paper • 2407.13638 • Published • 5 -
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
Paper • 2407.07061 • Published • 26 -
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper • 2407.03502 • Published • 49 -
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Paper • 2407.06723 • Published • 10
-
nvidia/Nemotron-4-340B-Base
Updated • 266 • 145 -
cognitivecomputations/dolphin-2.9.3-mistral-7B-32k
Text Generation • Updated • 14.6k • 46 -
cognitivecomputations/dolphin-2.9.3-Yi-1.5-34B-32k
Text Generation • Updated • 2.75k • 18 -
cognitivecomputations/dolphin-2.9-llama3-8b
Text Generation • Updated • 22.3k • 425
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper • 2406.12275 • Published • 29 -
TroL: Traversal of Layers for Large Language and Vision Models
Paper • 2406.12246 • Published • 34 -
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper • 2406.15334 • Published • 8 -
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper • 2406.12742 • Published • 14
-
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 12 -
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 28 -
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 35 -
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 92