Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes? Mar 5 • 4
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published 16 days ago • 69
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory Paper • 2410.10813 • Published Oct 14 • 9
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models Paper • 2410.05269 • Published Oct 7 • 3
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published Jun 13 • 18
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 51
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models Paper • 2401.13311 • Published Jan 24 • 10
VideoCon: Robust Video-Language Alignment via Contrast Captions Paper • 2311.10111 • Published Nov 15, 2023 • 7
Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs Paper • 2311.05657 • Published Nov 9, 2023 • 27
Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs Paper • 2311.05657 • Published Nov 9, 2023 • 27
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment Paper • 2307.12950 • Published Jul 24, 2023 • 9
DesCo: Learning Object Recognition with Rich Language Descriptions Paper • 2306.14060 • Published Jun 24, 2023 • 1
AVIS: Autonomous Visual Information Seeking with Large Language Models Paper • 2306.08129 • Published Jun 13, 2023 • 5