Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities.
To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capabilities for spatial reasoning:
Multiple Objects
,
2D Locations
,
3D Locations
,
3D Orientations
.
We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks.
We evaluated various large multimodal models (LMMs) on Spatial457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
We first define 4 basic capabilities that a model requires for 3D spatial reasoning.
To systematically evaluate spatial reasoning in large multimodal models (LMMs), we propose a five-level difficulty roadmap that progressively incorporates our four core spatial capabilities—multiple objects, 2D location, 3D location, and 3D orientation. This design allows for a fine-grained analysis of model performance as task complexity increases.
This cascading structure reveals distinct failure points as tasks increase in complexity. Examples for each level are shown below.
Q: What color is the double bus?
A: Cyan.
Q: Is there another object of the same color as the double bus?
A: Yes
Q: There is a cyan object to the left of the chopper; what is its shape?
A: Double bus.
Q: What shape is the cyan object parallel to the brown one?
A: Double bus
Q: What size is the thing occluded by the double bus?
A: Small
Q: Is there a double bus to the left of the yellow object?
A: No
Q: What color is the object the double bus will collide with if it moves backward?
A: Yellow
We evaluate a range of models on all 7 question types across 5 difficulty levels. As the complexity increases—from single-object perception to 6D spatial reasoning and collision prediction—model performance generally drops, highlighting challenges in multi-object understanding, 3D orientation, and predictive spatial reasoning.
Model | Single Object | Multi-Obj. | 2D Spatial | Occlusion | 3D Pose | Collisions | 6D Spatial |
---|---|---|---|---|---|---|---|
Random | 33.05 | 32.77 | 33.47 | 22.04 | 18.99 | 21.02 | 19.41 |
GPT-4o | 74.46 | 62.88 | 56.14 | 48.40 | 42.41 | 38.41 | 37.01 |
Gemini-Pro 1.5 | 73.26 | 62.54 | 54.49 | 47.65 | 43.67 | 41.19 | 39.36 |
Claude 3.5 Sonnet | 68.24 | 57.40 | 54.19 | 30.84 | 38.40 | 35.34 | 33.48 |
Qwen2-VL-7B-Instruct | 71.96 | 61.44 | 55.34 | 27.87 | 34.29 | 36.58 | 33.75 |
InternVL2-8B | 58.11 | 58.76 | 57.40 | 33.25 | 32.32 | 34.30 | 34.30 |
LLaVA-v1.5-7B | 44.87 | 44.72 | 42.20 | 24.34 | 24.55 | 23.63 | 23.86 |
LLaVA-NeXT-vicuna-7B | 50.72 | 49.47 | 46.01 | 29.68 | 29.35 | 31.95 | 31.95 |
LLaVA-NeXT-llama3-8B | 52.15 | 49.73 | 45.92 | 30.31 | 29.77 | 32.12 | 32.12 |
PO3D-VQA | 86.46 | 82.55 | 80.64 | 70.49 | 81.40 | 68.12 | 71.06 |
Human | 89.97 | 86.83 | 84.95 | 82.76 | 84.95 | 81.82 | 79.94 |
You can access the full Spatial457 dataset and evaluation toolkit via the following links:
@article{wang2025spatial457,
title = {Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models},
author = {Wang, Xingrui and Ma, Wufei and Zhang, Tiezheng and de Melo, Celso M and Chen, Jieneng and Yuille, Alan},
journal = {CVPR},
year = {2025},
url = {https://arxiv.org/abs/2502.08636}
}