For Omni-modality Language Models
Which represents a dog?
model sees →
🐕  Dog  ✓
model hears →
“a person talking”  ✗
🐕🔊
A model that knows a dog by sight can’t hear one.
Three modalities  ·  six cross-modal directions
A→VV→AA→TT→AV→TT→V🔊Audio👁VisionTText
The template
Same question, six configurations
Each item is a four-choice multiple-choice question — a <context> (the question stem) and four <candidates> (answer options). Swap which modality carries the context and which carries the candidates, and one item becomes six.
V → T  (vision context, text candidates)
Which sound matches what is shown?
Adog howling
Bchicken clucking
Ccrocodile hissing
Dcuckoo bird calling
A <context> + four <candidates>; the modality of each is varied to form V→T, V→A, T→A, T→V, A→V, A→T. Identical semantics, six modality routings — isolating cross-modal consistency.
V→T🖼 Visioncontext → candidates🔤 Text
V→A🖼 Visioncontext → candidates🔊 Audio
T→A🔤 Textcontext → candidates🔊 Audio
T→V🔤 Textcontext → candidates🖼 Vision
A→V🔊 Audiocontext → candidates🖼 Vision
A→T🔊 Audiocontext → candidates🔤 Text
Five task families, 17 subtasks
Perception24,000 items
  • General activities
  • Fine-grained activities
  • Natural environments
  • Musical instruments
  • Instrument compositions
Spatial7,776 items
  • 2D Arrangement
  • 3D Localization
  • 3D Movement
Temporal9,000 items
  • Temporal Order
  • Temporal Counting
  • Temporal Calculation
Linguistic8,244 items
  • Recognition (OCR/ASR)
  • Translation (EN-ZH)
  • Emotion Classification
Knowledge12,300 items
  • Music Genre
  • Movie Recognition
  • Singer Identification
Detailed statistics
61,320items · 17 subtasks
Perception · 24,000 (39%)
Spatial · 7,776 (13%)
Temporal · 9,000 (15%)
Linguistic · 8,244 (13%)
Knowledge · 12,300 (20%)
But spatial & temporal reasoning collapses
79.7
Perception
35.2
Spatial
41.4
Temporal
82.5
Linguistic
77.4
Knowledge
spatial 35.2 / temporal 41.4  vs  linguistic 82.5  —  accuracy %, Qwen3-Omni on XModBench-Lite
The same knowledge, unequal across modalities
Audio
low
Text
high
Qwen3-Omni  ·  directional imbalance (V↔T)  ·  accuracy %
V→T
79.7
T→V
66.0
−13.7capability lost when the
same pair is asked the other way
same knowledge, asked V→T vs T→V  —  13.7-pt directional gap (Qwen3-Omni, XModBench-Lite)
Cross-modal consistency, measured.
ICLR 2026
xingruiwang.github.io/projects/XModBench