MME-Reasoning

A Comprehensive Benchmark for Logical Reasoning in MLLMs

1FDU, 2CUHK MMLab, 3Shanghai AI Laboratory, 4USTC, 5NJU
*Equal contribution, Corresponding author

Introduction

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Existing benchmarks fail to comprehensively evaluate MLLMs reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning.

In this paper, we introduce MME-Reasoning, a comprehensive benchmark specifically designed to evaluate the reasoning capability of MLLMs. MME-Reasoning consists of 1,188 carefully curated questions that systematically cover types of logical reasoning (inductive, deductive, and abductive), while spanning a range of difficulty levels.

Experiments were conducted on state-of-the-art MLLMs, covering Chat and Thinking types of both open-source and closed-source. Evaluations with MME-Reasoning reveal these key findings:

(1) MLLMs exhibit significant limitations and pronounced imbalances in reasoning capabilities. Even the most advanced MLLMs achieve only limited results under holistic logical reasoning evaluation, with Gemini-Pro-2.5-Thinking scoring only 60.19%, followed by Seed1.5-VL (59.85) and o4-mini (57.49%);

(2) Abductive reasoning remains a major bottleneck for current MLLMs. Closed-source models exhibit an average gap of 5.38 points between deductive and abductive tasks, which further widens to 9.81 among open-source models, making abductive reasoning a key bottleneck

(3) Reasoning length scales with task difficulty, benefiting performance but accompanied by marginal effects and decreasing token efficiency. We hope MME-Reasoning serves as a foundation for advancing multimodal reasoning in MLLMs.

grade-lv

Performance comparison between thinking and chat models on MME-Reasoning.

Leaderboard

# Model Capability Reasoning Type Acc.
Calculation Planning & Exploring Pattern Analysis Spatial & Temporal Casual Chain Analysis Deductive Inductive Abductive
1 Gemini-2.5-Pro-T 🥇 68.0 64.4 53.7 52.1 90.3 64.0 51.7 62.8 60.2
2 Seed1.5-VL-T 🥈 67.2 62.7 56.0 47.2 82.6 64.5 52.3 60.8 59.9
3 o4-mini 🥉 63.1 58.3 57.2 50.4 59.0 60.6 51.4 59.0 57.5
4 Seed1.5-VL 52.0 42.0 38.4 44.0 72.9 54.9 45.0 41.0 47.5
5 Claude-4-Sonnet-T 33.3 35.9 33.0 36.2 47.9 39.4 32.0 35.7 36.1
6 VL-Rethinker-72B 33.6 28.4 31.4 37.2 59.7 39.0 36.0 31.9 35.8
7 QvQ-72B-Preview 37.4 27.1 28.8 35.8 57.6 41.6 33.5 29.1 35.2
8 Claude-3.7-Sonnet-T 30.4 27.6 32.3 38.3 46.5 34.6 36.2 31.7 34.1
9 Qwen2.5-VL-72B 31.7 25.1 27.2 37.9 53.5 39.0 32.3 29.9 34.1
10 Claude-3.7-Sonnet 29.0 24.6 32.8 35.5 46.5 35.7 38.7 26.1 33.3
11 Qwen2.5-VL-32B 32.2 26.8 24.4 39.0 52.1 40.5 27.5 29.6 33.2
12 InternVL3-78B 26.0 24.0 26.5 41.8 50.0 35.1 33.8 27.1 32.1
13 Virgo-72B 30.4 22.9 26.1 36.2 47.2 37.7 32.6 24.4 31.8
14 MM-Eureka-Qwen-32B 23.0 25.7 25.6 36.2 50.7 32.9 30.5 28.1 30.6
15 GPT-4o 21.4 22.1 30.5 38.6 36.8 29.0 34.7 27.9 30.2
16 VL-Rethinker-7B 24.7 17.7 23.5 39.4 42.4 34.4 29.9 22.9 29.3
17 InternVL3-38B 23.0 18.5 23.0 38.3 41.7 33.5 29.0 22.1 28.4
18 MM-Eureka-Qwen-7B 27.1 19.3 22.3 31.9 50.0 32.7 28.7 22.6 28.2
19 Qwen2-VL-72B 19.2 19.3 24.9 36.2 44.4 28.8 32.3 22.1 27.5
20 LMM-R1-MGT-PerceReason 22.2 16.0 23.7 37.9 34.0 30.3 32.3 20.1 27.4

MME-Reasoning Dataset

Overview

data-overview

The overall construction process of MME-Reasoning.

MME-Reasoning consists of 1,188 questions, including 1,008 newly collected items. MME-Reasoning comprehensively covers three types of reasoning (inductive, deductive, and abductive) and includes three question types (multiple-choice, free-form, and rule-based). We further divided MME-Reasoning into three difficulty levels (easy, medium, and hard) and summarize five types of abilities that MME-Reasoning test including pattern analysis, planning and exploring, spatial and temporal, calculation, and casual chain analysis. The key statistics and construction pipeline of MME-Reasoning is shown below:

grade-lv

For rule-based questions, we first use GPT to extract answers and convert them into an intermediate format, which is then judged using specific scripts.

grade-lv

Evaluation of rule-based questions.

Experiment Results

Visualization Examples