For
Omni-modality Language Models
Which represents a dog?
model sees →
🐶 Dog ✓
model hears →
“a person talking” ✗
🐶
≠
🔊
A model that knows a dog
by sight
can’t
hear
one.
XModBench probes
6 cross-modal directions
A→V
V→A
A→T
T→A
V→T
T→V
🔊
Audio
👁
Vision
T
Text
Audio · Vision (Image ∪ Video) · Text — every ordered pair
5 broad task families · 17 subtasks ·
61,320
samples
Perception
what is it?
Spatial
where / direction?
Temporal
order & counting
Linguistic
speech & translation
Knowledge
music, movie, emotion
But
spatial & temporal reasoning collapses
79.7
Perception
35.2
Spatial
41.4
Temporal
82.5
Linguistic
77.4
Knowledge
spatial 35.2 / temporal 41.4 vs linguistic 82.5 — accuracy %, Qwen3-Omni on XModBench-Lite
The
same knowledge
, unequal across modalities
Audio
low
Text
high
A→T
T→A
modality
disparity
& directional
imbalance
XMod
Bench
Cross-Modal Consistency
ICLR 2026
xingruiwang.github.io/projects/XModBench
🔈 sound: off
↺ replay