MMIE (MMIE)

Organization Card

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

[📖 Project] [📄 Paper] [💻 Code] [📝 Dataset] [🤖 Evaluation Model] [🏆 Leaderboard]

We introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, MMIE is definitely setting new standards for testing the depths of multimodal understanding.

🔑 Key Features:

🗂 Comprehensive Dataset: With 20,103 interleaved multimodal questions, MMIE provides a rich foundation for evaluating models across diverse domains.
🔍 Ground Truth Reference: Each query includes a reliable reference, ensuring model outputs are measured accurately.
⚙ Automated Scoring with MMIE-Score: Our scoring model achieves high human-score correlation, surpassing previous metrics like GPT-4o for multimodal tasks.
🔎 Bias Mitigation: Fine-tuned for fair assessments, enabling more objective model evaluations.

🔍 Key Insights:

🧠 In-depth Evaluation: Covering 12 major fields (mathematics, coding, literature, and more) with 102 subfields for a comprehensive test across competencies.
📈 Challenging the Best: Even top models like GPT-4o + SDXL peak at 65.47%, highlighting room for growth in LVLMs.
🌐 Designed for Interleaved Tasks: The benchmark supports evaluation across both text and image comprehension with both multiple-choice and open-ended formats.