Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Existing benchmarks fail to comprehensively evaluate MLLMs reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning.
In this paper, we introduce MME-Reasoning, a comprehensive benchmark specifically designed to evaluate the reasoning capability of MLLMs. MME-Reasoning consists of 1,188 carefully curated questions that systematically cover types of logical reasoning (inductive, deductive, and abductive), while spanning a range of difficulty levels.
Experiments were conducted on state-of-the-art MLLMs, covering Chat and Thinking types of both open-source and closed-source. Evaluations with MME-Reasoning reveal these key findings:
(1) MLLMs exhibit significant limitations and pronounced imbalances in reasoning capabilities. Even the most advanced MLLMs achieve only limited results under holistic logical reasoning evaluation, with Gemini-Pro-2.5-Thinking scoring only 60.19%, followed by Seed1.5-VL (59.85) and o4-mini (57.49%);
(2) Abductive reasoning remains a major bottleneck for current MLLMs. Closed-source models exhibit an average gap of 5.38 points between deductive and abductive tasks, which further widens to 9.81 among open-source models, making abductive reasoning a key bottleneck
(3) Reasoning length scales with task difficulty, benefiting performance but accompanied by marginal effects and decreasing token efficiency.
We hope MME-Reasoning serves as a foundation for advancing multimodal reasoning in MLLMs.