Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

Abstract

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities.

To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capabilities for spatial reasoning: Multiple Objects , 2D Locations , 3D Locations , 3D Orientations .

We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks.

We evaluated various large multimodal models (LMMs) on Spatial457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.

Benchmark Design

4 Key Capabilities for 3D Spatial Reasoning

We first define 4 basic capabilities that a model requires for 3D spatial reasoning.

Multiple Objects : The foundation of spatial reasoning involves recognizing multiple objects and understanding their relationships. Tasks include comparing attributes and counting, such as verifying how many objects share the same color.
2D Locations : Reasoning within the 2D camera plane by identifying relative positions like left, right, front, or back. These are basic but crucial cues, especially for grounding object relationships from a single-view image.
3D Locations : Understanding object depth and position in 3D space enables reasoning about occlusion, distance, and spatial hierarchy between objects beyond the image plane;
3D Orientations : This involves inferring object rotation, facing direction, and alignment. Tasks test the ability to determine if an object is facing forward, backward, or parallel to another from its own perspective—not just from the camera view;

5 Difficulty Levels and 7 Subsets

To systematically evaluate spatial reasoning in large multimodal models (LMMs), we propose a five-level difficulty roadmap that progressively incorporates our four core spatial capabilities—multiple objects, 2D location, 3D location, and 3D orientation. This design allows for a fine-grained analysis of model performance as task complexity increases.

Level 1–3 (Basic Spatial Reasoning): These levels begin with fundamental perception tasks. L1 involves single-object recognition, L2 introduces multi-object comparisons (e.g., color matching), and L3 tests 2D spatial relationships within the image plane.
Level 4 (3D Pose and Occlusion): These subsets assess a model’s ability to reason about 3D orientation (e.g., object alignment and rotation) and 3D location (e.g., identifying occluded objects), requiring depth-aware perception.
Level 5 (Advanced 6D Reasoning): The most complex level involves full 6D spatial reasoning—including both 3D location and orientation—and physical interaction tasks such as 6D spatial relationship and collision prediction. These questions demand comprehensive scene understanding and future state inference.

This cascading structure reveals distinct failure points as tasks increase in complexity. Examples for each level are shown below.

Example Questions

L1

Single Object

Q: What color is the double bus?

A: Cyan.

L2

Multiple objects

Q: Is there another object of the same color as the double bus?

A: Yes

L3

2D Spatial Relationship

Q: There is a cyan object to the left of the chopper; what is its shape?

A: Double bus.

L4

Orientation

Q: What shape is the cyan object parallel to the brown one?

A: Double bus

L4

Occlusion

Q: What size is the thing occluded by the double bus?

A: Small

L5

6D Spatial Relationship

Q: Is there a double bus to the left of the yellow object?

A: No

L5

Collision

Q: What color is the object the double bus will collide with if it moves backward?

A: Yellow

Benchmark Features

Controllable generation: All scenes are rendered using a fully programmatic engine, enabling fine-grained control over object placement, shape, color to support targeted evaluation of reasoning skills.
Rich 3D annotations: Every sample includes precise object metadata such as 2D bounding box, 2D instance mask, world space 3D locations, 3D orientations.
Compositional reasoning templates: As a neural-symbolic benchmark, each question is grounded in a step-by-step reasoning program that guides the model toward the correct answer.

Benchmark Results

We evaluate a range of models on all 7 question types across 5 difficulty levels. As the complexity increases—from single-object perception to 6D spatial reasoning and collision prediction—model performance generally drops, highlighting challenges in multi-object understanding, 3D orientation, and predictive spatial reasoning.

Model	Single Object	Multi-Obj.	2D Spatial	Occlusion	3D Pose	Collisions	6D Spatial
Random	33.05	32.77	33.47	22.04	18.99	21.02	19.41
GPT-4o	74.46	62.88	56.14	48.40	42.41	38.41	37.01
Gemini-Pro 1.5	73.26	62.54	54.49	47.65	43.67	41.19	39.36
Claude 3.5 Sonnet	68.24	57.40	54.19	30.84	38.40	35.34	33.48
Qwen2-VL-7B-Instruct	71.96	61.44	55.34	27.87	34.29	36.58	33.75
InternVL2-8B	58.11	58.76	57.40	33.25	32.32	34.30	34.30
LLaVA-v1.5-7B	44.87	44.72	42.20	24.34	24.55	23.63	23.86
LLaVA-NeXT-vicuna-7B	50.72	49.47	46.01	29.68	29.35	31.95	31.95
LLaVA-NeXT-llama3-8B	52.15	49.73	45.92	30.31	29.77	32.12	32.12
PO3D-VQA	86.46	82.55	80.64	70.49	81.40	68.12	71.06
Human	89.97	86.83	84.95	82.76	84.95	81.82	79.94

Download

You can access the full Spatial457 dataset and evaluation toolkit via the following links:

Dataset on Hugging Face: https://huggingface.co/datasets/RyanWW/Spatial457
Code & Benchmarking Scripts: https://github.com/XingruiWang/Spatial457
Paper: https://arxiv.org/abs/2502.08636

Although not included in the paper, we also provide a 20k-image version under the same settings for those interested in model training. Our data generation pipeline is also available to generate as much data as needed.

BibTeX

@article{wang2025spatial457,
  title     = {Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models},
  author    = {Wang, Xingrui and Ma, Wufei and Zhang, Tiezheng and de Melo, Celso M and Chen, Jieneng and Yuille, Alan},
  journal   = {CVPR},
  year      = {2025},
  url       = {https://arxiv.org/abs/2502.08636}
}

Spatial457 A Diagnostic Benchmark
for 6D Spatial Reasoning
of Large Multimodal Models

We propose Spatial457 for spatial reasoning with 4 key capabilities. It evaluates models through 5 difficulty levels and 7 question types, from simple object recognition to complex 6D spatial reasoning tasks. We benchmarks SoTA models and reveal a gap between human reasoning in spatial VQA.

Abstract

Benchmark Design

4 Key Capabilities for 3D Spatial Reasoning

5 Difficulty Levels and 7 Subsets

Example Questions

Single Object

Multiple objects

2D Spatial Relationship

Orientation

Occlusion

6D Spatial Relationship

Collision

Benchmark Features

Benchmark Results

Download

BibTeX

Spatial457 A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

We propose Spatial457 for spatial reasoning with 4 key capabilities. It evaluates models through 5 difficulty levels and 7 question types, from simple object recognition to complex 6D spatial reasoning tasks. We benchmarks SoTA models and reveal a gap between human reasoning in spatial VQA.

Abstract

Benchmark Design

4 Key Capabilities for 3D Spatial Reasoning

5 Difficulty Levels and 7 Subsets

Example Questions

Single Object

Multiple objects

2D Spatial Relationship

Orientation

Occlusion

6D Spatial Relationship

Collision

Benchmark Features

Benchmark Results

Download

BibTeX

Spatial457 A Diagnostic Benchmark
for 6D Spatial Reasoning
of Large Multimodal Models