-
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Paper • 2411.02959 • Published • 70 -
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
Paper • 2411.03047 • Published • 9 -
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
Paper • 2411.02336 • Published • 24 -
GenXD: Generating Any 3D and 4D Scenes
Paper • 2411.02319 • Published • 20
Collections
Discover the best community collections!
Collections including paper arxiv:2411.10442
-
Rethinking Data Selection at Scale: Random Selection is Almost All You Need
Paper • 2410.09335 • Published • 17 -
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
Paper • 2410.06456 • Published • 37 -
Emergent properties with repeated examples
Paper • 2410.07041 • Published • 8 -
Personalized Visual Instruction Tuning
Paper • 2410.07113 • Published • 70
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Paper • 2410.17637 • Published • 36 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 80 -
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper • 2411.18203 • Published • 36 -
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper • 2411.14432 • Published • 25
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 58 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 42 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 57
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 16 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 89 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 32
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 -
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30