Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Abstract
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.
Community
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored.
Creation-MMBench is a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs. It features three main aspects:
Comprehensive Creation Benchmark for MLLM and LLM. The benchmark 765 test cases spanning 51 fine-grained tasks. Each case provides an MLLM with images and context, including role, background information, and instructions. To further explore the impact of visual instruction tuning, we transformed Creation-MMBench into a text-only variant, Creation-MMBench-TO, by replacing image inputs with corresponding textual descriptions.
Robust Evaluation Methodology. Creation-MMBench includes carefully crafted instance-specific criteria for each test case, enabling assessment of both general response quality and visual-factual alignment in model-generated content. The Dual Evaluation, and GPT-4o judge model is the evaluation strategy for Creation-MMBench.
Attractive Experiment Insight. The results highlight the current limitations of MLLMs in context-aware creativity and vision-based language generation, offering valuable guidance for future research and development. The comparison results of MLLM and its based LLM indicate that the open-source MLLM, after visual instruction tuning, consistently performs worse compared to the corresponding LLM on Creation-MMBench-TO. More detailed results can be observed in our paper.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges (2025)
- MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems (2025)
- WritingBench: A Comprehensive Benchmark for Generative Writing (2025)
- A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models (2025)
- MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding (2025)
- InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models (2025)
- VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper