XModBench — teaser

For Omni-modality Language Models

Which represents a dog?

model sees →

🐕 Dog ✓

model hears →

“a person talking” ✗

🐕≠🔊

A model that knows a dog by sight can’t hear one.

Three modalities · six cross-modal directions

The template

Same question, six configurations

Each item is a four-choice multiple-choice question — a <context> (the question stem) and four <candidates> (answer options). Swap which modality carries the context and which carries the candidates, and one item becomes six.

V → T (vision context, text candidates)

Which sound matches what is shown?

Adog howling

Bchicken clucking

Ccrocodile hissing

Dcuckoo bird calling

A <context> + four <candidates>; the modality of each is varied to form V→T, V→A, T→A, T→V, A→V, A→T. Identical semantics, six modality routings — isolating cross-modal consistency.

V→T🖼 Visioncontext → candidates🔤 Text

V→A🖼 Visioncontext → candidates🔊 Audio

T→A🔤 Textcontext → candidates🔊 Audio

T→V🔤 Textcontext → candidates🖼 Vision

A→V🔊 Audiocontext → candidates🖼 Vision

A→T🔊 Audiocontext → candidates🔤 Text

Five task families, 17 subtasks

Perception24,000 items

General activities
Fine-grained activities
Natural environments
Musical instruments
Instrument compositions

Spatial7,776 items

2D Arrangement
3D Localization
3D Movement

Temporal9,000 items

Temporal Order
Temporal Counting
Temporal Calculation

Linguistic8,244 items

Recognition (OCR/ASR)
Translation (EN-ZH)
Emotion Classification

Knowledge12,300 items

Music Genre
Movie Recognition
Singer Identification

Detailed statistics

Perception · 24,000 (39%)

Spatial · 7,776 (13%)

Temporal · 9,000 (15%)

Linguistic · 8,244 (13%)

Knowledge · 12,300 (20%)

But spatial & temporal reasoning collapses

79.7

Perception

35.2

Spatial

41.4

Temporal

82.5

Linguistic

77.4

Knowledge

spatial 35.2 / temporal 41.4 vs linguistic 82.5 — accuracy %, Qwen3-Omni on XModBench-Lite

The same knowledge, unequal across modalities

Audio
lowText
high

Qwen3-Omni · directional imbalance (V↔T) · accuracy %

V→T

79.7

T→V

66.0

−13.7capability lost when the
same pair is asked the other way

same knowledge, asked V→T vs T→V — 13.7-pt directional gap (Qwen3-Omni, XModBench-Lite)

XModBench

Cross-modal consistency, measured.

ICLR 2026

xingruiwang.github.io/projects/XModBench