Spatial457 Icon A Diagnostic Benchmark
for 6D Spatial Reasoning
of Large Multimodal Models

1Johns Hopkins University, 2DEVCOM Army Research Laboratory
Equal advising
CVPR 2025 (Highlight, 13.5% accepted papers)
Teaser

We propose Spatial457 for spatial reasoning with 4 key capabilities. It evaluates models through 5 difficulty levels and 7 question types, from simple object recognition to complex 6D spatial reasoning tasks. We benchmarks SoTA models and reveal a gap between human reasoning in spatial VQA.

Abstract

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities.

To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capabilities for spatial reasoning: icon Multiple Objects , icon 2D Locations , icon 3D Locations , icon 3D Orientations .

We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks.

We evaluated various large multimodal models (LMMs) on Spatial457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.

Benchmark Design

4 Key Capabilities for 3D Spatial Reasoning

We first define 4 basic capabilities that a model requires for 3D spatial reasoning.

  • icon Multiple Objects : The foundation of spatial reasoning involves recognizing multiple objects and understanding their relationships. Tasks include comparing attributes and counting, such as verifying how many objects share the same color.
  • icon 2D Locations : Reasoning within the 2D camera plane by identifying relative positions like left, right, front, or back. These are basic but crucial cues, especially for grounding object relationships from a single-view image.
  • icon 3D Locations : Understanding object depth and position in 3D space enables reasoning about occlusion, distance, and spatial hierarchy between objects beyond the image plane;
  • icon 3D Orientations : This involves inferring object rotation, facing direction, and alignment. Tasks test the ability to determine if an object is facing forward, backward, or parallel to another from its own perspective—not just from the camera view;

5 Difficulty Levels and 7 Subsets

To systematically evaluate spatial reasoning in large multimodal models (LMMs), we propose a five-level difficulty roadmap that progressively incorporates our four core spatial capabilities—multiple objects, 2D location, 3D location, and 3D orientation. This design allows for a fine-grained analysis of model performance as task complexity increases.

  • Level 1–3 (Basic Spatial Reasoning): These levels begin with fundamental perception tasks. L1 involves single-object recognition, L2 introduces multi-object comparisons (e.g., color matching), and L3 tests 2D spatial relationships within the image plane.
  • Level 4 (3D Pose and Occlusion): These subsets assess a model’s ability to reason about 3D orientation (e.g., object alignment and rotation) and 3D location (e.g., identifying occluded objects), requiring depth-aware perception.
  • Level 5 (Advanced 6D Reasoning): The most complex level involves full 6D spatial reasoning—including both 3D location and orientation—and physical interaction tasks such as 6D spatial relationship and collision prediction. These questions demand comprehensive scene understanding and future state inference.

This cascading structure reveals distinct failure points as tasks increase in complexity. Examples for each level are shown below.

Example Questions

All Difficulty Levels Example
L1

Single Object

Q: What color is the double bus?

A: Cyan.

L2

Multiple objects icon

Q: Is there another object of the same color as the double bus?

A: Yes

L3

2D Spatial Relationship icon icon

Q: There is a cyan object to the left of the chopper; what is its shape?

A: Double bus.

L4

Orientation icon icon icon

Q: What shape is the cyan object parallel to the brown one?

A: Double bus

L4

Occlusion icon icon icon

Q: What size is the thing occluded by the double bus?

A: Small

L5

6D Spatial Relationship icon icon icon icon

Q: Is there a double bus to the left of the yellow object?

A: No

L5

Collision icon icon icon icon

Q: What color is the object the double bus will collide with if it moves backward?

A: Yellow

Benchmark Features

  • Controllable generation: All scenes are rendered using a fully programmatic engine, enabling fine-grained control over object placement, shape, color to support targeted evaluation of reasoning skills.
  • Rich 3D annotations: Every sample includes precise object metadata such as 2D bounding box, 2D instance mask, world space 3D locations, 3D orientations.
  • Compositional reasoning templates: As a neural-symbolic benchmark, each question is grounded in a step-by-step reasoning program that guides the model toward the correct answer.

Benchmark Results

We evaluate a range of models on all 7 question types across 5 difficulty levels. As the complexity increases—from single-object perception to 6D spatial reasoning and collision prediction—model performance generally drops, highlighting challenges in multi-object understanding, 3D orientation, and predictive spatial reasoning.

Model Single Object Multi-Obj. 2D Spatial Occlusion 3D Pose Collisions 6D Spatial
Random33.0532.7733.4722.0418.9921.0219.41
GPT-4o74.4662.8856.1448.4042.4138.4137.01
Gemini-Pro 1.573.2662.5454.4947.6543.6741.1939.36
Claude 3.5 Sonnet68.2457.4054.1930.8438.4035.3433.48
Qwen2-VL-7B-Instruct71.9661.4455.3427.8734.2936.5833.75
InternVL2-8B58.1158.7657.4033.2532.3234.3034.30
LLaVA-v1.5-7B44.8744.7242.2024.3424.5523.6323.86
LLaVA-NeXT-vicuna-7B50.7249.4746.0129.6829.3531.9531.95
LLaVA-NeXT-llama3-8B52.1549.7345.9230.3129.7732.1232.12
PO3D-VQA86.4682.5580.6470.4981.4068.1271.06
Human89.9786.8384.9582.7684.9581.8279.94

Download

You can access the full Spatial457 dataset and evaluation toolkit via the following links:

BibTeX

@article{wang2025spatial457,
  title     = {Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models},
  author    = {Wang, Xingrui and Ma, Wufei and Zhang, Tiezheng and de Melo, Celso M and Chen, Jieneng and Yuille, Alan},
  journal   = {CVPR},
  year      = {2025},
  url       = {https://arxiv.org/abs/2502.08636}
}