Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.04424

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 39
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 20

Paper - Multimodal

Paper related to Multimodal Model - Research for a : Modular, Multimodal, Multi-Stream, Mixture of Expert, Universal Transformer, Matryoshka embedding

about 16 hours ago

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published 6 days ago • 25
No More Adam: Learning Rate Scaling at Initialization is All You Need

Paper • 2412.11768 • Published 9 days ago • 40
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 7 days ago • 103
Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published 7 days ago • 13

Computer Vision in Focus: Advances in Computer Vision and Video Understanding

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published 20 days ago • 55

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published 20 days ago • 55

MutiModal_Paper

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17 • 52
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12 • 27
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15 • 67
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21 • 42

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

Paper • 2411.02959 • Published Nov 5 • 64
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Paper • 2411.02355 • Published Nov 4 • 46
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

Paper • 2410.23090 • Published Oct 30 • 54
RARe: Retrieval Augmented Retrieval with In-Context Examples

Paper • 2410.20088 • Published Oct 26 • 5

A Comparative Study on Automatic Coding of Medical Letters with Explainability

Paper • 2407.13638 • Published Jul 18 • 5
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Paper • 2407.07061 • Published Jul 9 • 26
AgentInstruct: Toward Generative Teaching with Agentic Flows

Paper • 2407.03502 • Published Jul 3 • 49
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Paper • 2407.06723 • Published Jul 9 • 10

fundamental-wave

nvidia/Nemotron-4-340B-Base

Updated Jun 28 • 266 • 145
cognitivecomputations/dolphin-2.9.3-mistral-7B-32k

Text Generation • Updated Oct 30 • 14.6k • 46
cognitivecomputations/dolphin-2.9.3-Yi-1.5-34B-32k

Text Generation • Updated Jun 23 • 2.75k • 18
cognitivecomputations/dolphin-2.9-llama3-8b

Text Generation • Updated May 20 • 22.3k • 425

Multi-modality LVM

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18 • 29
TroL: Traversal of Layers for Large Language and Vision Models

Paper • 2406.12246 • Published Jun 18 • 34
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Paper • 2406.15334 • Published Jun 21 • 8
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Paper • 2406.12742 • Published Jun 18 • 14

General Multimodal Learning

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25 • 12
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26 • 28
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22 • 35
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs