For Omni-modality Language Models
Which represents a dog?
dog
model sees  →
🐶  Dog  ✓
model hears  →
“a person talking”  ✗
🐶🔊
A model that knows a dog by sight can’t hear one.
XModBench probes 6 cross-modal directions
A→VV→AA→TT→AV→TT→V🔊Audio👁VisionTText
Audio · Vision (Image ∪ Video) · Text  —  every ordered pair
5 broad task families · 17 subtasks · 61,320 samples
Perceptionwhat is it?
Spatialwhere / direction?
Temporalorder & counting
Linguisticspeech & translation
Knowledgemusic, movie, emotion
But spatial & temporal reasoning collapses
79.7
Perception
35.2
Spatial
41.4
Temporal
82.5
Linguistic
77.4
Knowledge
spatial 35.2 / temporal 41.4  vs  linguistic 82.5  —  accuracy %, Qwen3-Omni on XModBench-Lite
The same knowledge, unequal across modalities
Audio
low
Text
high
A→T
T→A
modality disparity  &  directional imbalance
Cross-Modal Consistency
ICLR 2026
xingruiwang.github.io/projects/XModBench