XModBench — teaser

For Omni-modality Language Models

Which represents a dog?

model sees →

🐶 Dog ✓

model hears →

“a person talking” ✗

🐶≠🔊

A model that knows a dog by sight can’t hear one.

XModBench probes 6 cross-modal directions

Audio · Vision (Image ∪ Video) · Text — every ordered pair

5 broad task families · 17 subtasks · 61,320 samples

Perceptionwhat is it?

Spatialwhere / direction?

Temporalorder & counting

Linguisticspeech & translation

Knowledgemusic, movie, emotion

But spatial & temporal reasoning collapses

79.7

Perception

35.2

Spatial

41.4

Temporal

82.5

Linguistic

77.4

Knowledge

spatial 35.2 / temporal 41.4 vs linguistic 82.5 — accuracy %, Qwen3-Omni on XModBench-Lite

The same knowledge, unequal across modalities

Audio
lowText
high

A→T

T→A

modality disparity & directional imbalance

XModBench

Cross-Modal Consistency

ICLR 2026

xingruiwang.github.io/projects/XModBench