🎉 ICLR 2026

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Can Omni-Language Models Hit a Home Run?

Animated overview · open full screen ↗ · short version

XModBench contains 60K multiple-choice questions across five task families and systematically covers all six cross-modality directions, enabling diagnosis of task competence, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) suffers from modality disparities, with performance dropping by over 20 points on average when audio inputs replace text, and (iii) exhibits directional imbalance, with a 9-point gap when using vision as context versus using text as context.

Abstract

Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks have advanced multimodal evaluation, it remains unclear whether OLLMs achieve modality-invariant reasoning or inherit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency.

XModBench contains 60K multiple-choice questions across five task families and systematically covers all six cross-modality directions, enabling diagnosis of task competence, modality disparity, and directional imbalance. The findings suggest that OLLMs fall short of modality-invariant reasoning, and XModBench provides a fundamental diagnostic tool for evaluating and improving their overall cross-modal competence.

60K+
Question-Answer Pairs
6
Cross-Modal Directions
5
Task Families
17
Subtasks

Benchmark Design

Core Innovation: Modality-Balanced Configuration

The central objective of XModBench is to evaluate whether models preserve cross-modal consistency when the same semantic content appears in different modalities. Each item is a four-choice multiple-choice question consisting of a <context> (question stem) and four <candidates> (answer options).

By systematically permuting text (T), vision (V), and audio (A) across the context and candidates, we generate six modality configurations of the same question:

🔊Audio🖼Vision🔤Text
Each item is a four-choice question — a <context> (the stem) and four <candidates> (options). Permuting the modality of each yields six configurations.
V→T
🐾 image context4 × text candidates
hover a ball or an arrow
Cross-modal configuration diagram

Three Diagnostic Properties

This balanced design enables unprecedented diagnosis of cross-modal consistency through three complementary evaluation dimensions:

  1. Task Competence
    By averaging accuracy across all six modality configurations, we obtain a fair measure of a model's overall capability for each task, independent of modality-specific biases. This reveals which fundamental capabilities models truly possess versus which they fake through modality shortcuts.
  2. Modality Disparity
    By presenting semantically identical questions under different modality configurations, we isolate modality as the only variable. Accuracy differences reveal which modalities models handle best or worst—for example, comparing T→A vs T→V shows whether models understand audio or vision better when given the same text context.
  3. Directional Imbalance
    By examining inverse settings—swapping context and candidate modalities (e.g., V→T vs T→V)—we expose asymmetries in cross-modal grounding. Large gaps indicate that models perform better in certain directions due to training data imbalances, rather than achieving true bidirectional understanding.

Task Taxonomy

XModBench covers 5 task families with 17 subtasks — perception, spatial, temporal, linguistic and external-knowledge reasoning, all in the modality-balanced multiple-choice format.

Perception: 24,000Spatial: 7,776Temporal: 9,000Linguistic: 12,444Knowledge: 8,10061,320items · 17 subtasks
Perception24,000 · 39%
Spatial7,776 · 13%
Temporal9,000 · 15%
Linguistic12,444 · 20%
Knowledge8,100 · 13%
Hover a family to see its subtasks
24,000

Perception

Recognition of objects, activities and scenes across modalities

  • General activities
  • Fine-grained recognition
  • Musical instruments
  • Instrument comparison
  • Natural environments
7,776

Spatial

Object positions and motion in 2D / 3D space

  • 2D Arrangement
  • 3D Localization
  • 3D Movement
9,000

Temporal

Event order and frequency across time

  • Temporal Order
  • Temporal Counting
  • Temporal Calculation
12,444

Linguistic

OCR / ASR in cross-modal settings with affective understanding

  • Recognition (OCR/ASR)
  • Translation (EN-ZH)
  • Emotion Classification
8,100

Knowledge

Linking multimodal content with world & cultural knowledge

  • Music Genre
  • Movie Matching
  • Singer ID

Data Construction Pipeline

The benchmark is built through a rigorous three-stage pipeline:

  1. Cross-Modal Data Collection: Combining re-annotated datasets (VGG-Sound, STARSS23), synthetic generation (TTS, rendered text), and targeted web collection (YouTube trailers, singer portraits)
  2. Question Generation: Task-specific templates refined with GPT, semantically challenging distractors, and diversified prompts
  3. Quality Assurance: LLM filtering, human verification, and iterative testing to ensure accuracy and eliminate ambiguities

Leaderboard

One interactive view of the full benchmark and the lite split. Switch the breakdown in the title; toggle All / Full / Lite at the table's top-right; click any column header to sort. The three charts update with the table.

All·Full·Lite

Key Findings

We evaluated 13 state-of-the-art omni-modal models including Gemini 2.5 Pro, Qwen2.5-Omni, EchoInk-R1, and others. The results reveal systematic weaknesses across three dimensions:

1. Task Competence Gaps

Models show strong performance on perception and linguistic tasks (best model achieves ~75%), but struggle significantly with spatial and temporal reasoning:

  • Gemini 2.5 Pro: 75.9% (Perception), 76.8% (Linguistic), but only 50.1% (Spatial) and 60.8% (Temporal)
  • Spatial & Temporal Reasoning: All models drop 15-25 points compared to perception tasks
  • Open-source Models: Show even larger gaps, with some scoring below 40% on spatial/temporal tasks

Task Competence (by family)

2. Modality Disparity

Performance varies dramatically across modalities, with audio being the most challenging:

  • Audio vs. Text: Models drop 20-49 points when audio replaces text inputs
  • Audio vs. Vision: 33-point average gap, showing difficulty in aligning heterogeneous signals
  • Vision vs. Text: Smaller but still significant ~15-point disparity
  • Consistency (Std. Dev.): Best models show 10-12 point standard deviation across configurations

Modality Disparity (bars drop = worse on audio)

3. Directional Imbalance

Models exhibit asymmetric performance when context and candidate modalities are swapped:

  • Vision↔Text: 9-17 point gaps between V→T and T→V directions
  • Audio↔Text: 6-8 point asymmetries in bidirectional settings
  • Audio↔Vision: Nearly symmetric but with much lower overall accuracy
  • Root Cause: Training data imbalance—models heavily trained on image-to-text QA, less on inverse directions

Directional Imbalance

Human Performance

Human evaluation on sampled questions shows consistently high performance across all modalities:

  • Overall Average: 91.5% accuracy (vs. 70.6% for best model)
  • Perception: 91.0% (vs. 75.9%)
  • Spatial: 89.7% (vs. 50.1%)
  • Temporal: 88.9% (vs. 60.8%)
  • Linguistic: 93.9% (vs. 76.8%)
  • Knowledge: 93.9% (vs. 89.3%)

This demonstrates substantial room for improvement, especially in spatial and temporal reasoning where the human-model gap exceeds 25-30 points.

-49
Audio-Text Disparity
-33
Audio-Vision Gap
9
V→T vs T→V Imbalance
21%
Gap to Human Performance

Model-Specific Insights

Gemini 2.5 Pro (Best Overall: 70.6% avg, 11.7 std)

Qwen2.5-Omni (Best Open-Source: 58.6% avg, 10.1 std)

EchoInk-R1 (Strong Open Alternative: 59.2% avg, 11.3 std)

Key Takeaway: Even the best models fall far short of modality-invariant reasoning, with systematic biases toward text and vision over audio, and asymmetric performance when modality roles are reversed.

Use the Dataset

XModBench is hosted on the Hugging Face Hub at RyanWW/XModBench. Each item is a 4-choice MCQ with media (audio / image / video / text) in both the question stem and the options.

Load the data

# pip install datasets huggingface_hub from datasets import load_dataset # one modality config (e.g. audio-condition → text-options) ds = load_dataset("json", data_files="hf://datasets/RyanWW/XModBench/data/audio_text.jsonl", split="train") # XModBench-Lite (6k balanced split) lite = load_dataset("json", data_files="hf://datasets/RyanWW/XModBench/data_lite/a2t.jsonl", split="train")

Evaluate a model with lmms-eval

The lmms-eval fork has XModBench pre-integrated; the dataset auto-downloads on first run.

git clone https://github.com/XingruiWang/lmms-eval.git cd lmms-eval && pip install -e ".[all]" # XModBench-Lite, all 6 configs, resource-aware GPU profile ./submit_lite.sh qwen2_5_omni_interleave Qwen/Qwen2.5-Omni-7B qwenomni3 # Level-2 metrics: by-config / by-family / disparity / imbalance python lmms_eval/tasks/xmod_bench/summarize.py \ --logs logs/xmod_bench_lite/results_qwen2_5_omni_interleave/

See the task README and RESULTS.md for the full reproduction guide, per-model environment notes, and updated numbers.