MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Introduction

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Existing benchmarks fail to comprehensively evaluate MLLMs reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning.

In this paper, we introduce MME-Reasoning, a comprehensive benchmark specifically designed to evaluate the reasoning capability of MLLMs. MME-Reasoning consists of 1,188 carefully curated questions that systematically cover types of logical reasoning (inductive, deductive, and abductive), while spanning a range of difficulty levels.

Experiments were conducted on state-of-the-art MLLMs, covering Chat and Thinking types of both open-source and closed-source. Evaluations with MME-Reasoning reveal these key findings:

(1) MLLMs exhibit significant limitations and pronounced imbalances in reasoning capabilities. Even the most advanced MLLMs achieve only limited results under holistic logical reasoning evaluation, with Gemini-Pro-2.5-Thinking scoring only 60.19%, followed by Seed1.5-VL (59.85) and o4-mini (57.49%);

(2) Abductive reasoning remains a major bottleneck for current MLLMs. Closed-source models exhibit an average gap of 5.38 points between deductive and abductive tasks, which further widens to 9.81 among open-source models, making abductive reasoning a key bottleneck

(3) Reasoning length scales with task difficulty, benefiting performance but accompanied by marginal effects and decreasing token efficiency. We hope MME-Reasoning serves as a foundation for advancing multimodal reasoning in MLLMs.

Performance comparison between thinking and chat models on MME-Reasoning.

Leaderboard

#	Model	Capability					Reasoning Type			Acc.
#	Model	Calculation	Planning & Exploring	Pattern Analysis	Spatial & Temporal	Casual Chain Analysis	Deductive	Inductive	Abductive	Acc.
1	Gemini-2.5-Pro-T 🥇	68.0	64.4	53.7	52.1	90.3	64.0	51.7	62.8	60.2
2	Seed1.5-VL-T 🥈	67.2	62.7	56.0	47.2	82.6	64.5	52.3	60.8	59.9
3	o4-mini 🥉	63.1	58.3	57.2	50.4	59.0	60.6	51.4	59.0	57.5
4	Seed1.5-VL	52.0	42.0	38.4	44.0	72.9	54.9	45.0	41.0	47.5
5	Claude-4-Sonnet-T	33.3	35.9	33.0	36.2	47.9	39.4	32.0	35.7	36.1
6	VL-Rethinker-72B	33.6	28.4	31.4	37.2	59.7	39.0	36.0	31.9	35.8
7	QvQ-72B-Preview	37.4	27.1	28.8	35.8	57.6	41.6	33.5	29.1	35.2
8	Claude-3.7-Sonnet-T	30.4	27.6	32.3	38.3	46.5	34.6	36.2	31.7	34.1
9	Qwen2.5-VL-72B	31.7	25.1	27.2	37.9	53.5	39.0	32.3	29.9	34.1
10	Claude-3.7-Sonnet	29.0	24.6	32.8	35.5	46.5	35.7	38.7	26.1	33.3
11	Qwen2.5-VL-32B	32.2	26.8	24.4	39.0	52.1	40.5	27.5	29.6	33.2
12	InternVL3-78B	26.0	24.0	26.5	41.8	50.0	35.1	33.8	27.1	32.1
13	Virgo-72B	30.4	22.9	26.1	36.2	47.2	37.7	32.6	24.4	31.8
14	MM-Eureka-Qwen-32B	23.0	25.7	25.6	36.2	50.7	32.9	30.5	28.1	30.6
15	GPT-4o	21.4	22.1	30.5	38.6	36.8	29.0	34.7	27.9	30.2
16	VL-Rethinker-7B	24.7	17.7	23.5	39.4	42.4	34.4	29.9	22.9	29.3
17	InternVL3-38B	23.0	18.5	23.0	38.3	41.7	33.5	29.0	22.1	28.4
18	MM-Eureka-Qwen-7B	27.1	19.3	22.3	31.9	50.0	32.7	28.7	22.6	28.2
19	Qwen2-VL-72B	19.2	19.3	24.9	36.2	44.4	28.8	32.3	22.1	27.5
20	LMM-R1-MGT-PerceReason	22.2	16.0	23.7	37.9	34.0	30.3	32.3	20.1	27.4

Overview

The overall construction process of MME-Reasoning.

MME-Reasoning consists of 1,188 questions, including 1,008 newly collected items. MME-Reasoning comprehensively covers three types of reasoning (inductive, deductive, and abductive) and includes three question types (multiple-choice, free-form, and rule-based). We further divided MME-Reasoning into three difficulty levels (easy, medium, and hard) and summarize five types of abilities that MME-Reasoning test including pattern analysis, planning and exploring, spatial and temporal, calculation, and casual chain analysis. The key statistics and construction pipeline of MME-Reasoning is shown below:

For rule-based questions, we first use GPT to extract answers and convert them into an intermediate format, which is then judged using specific scripts.