The template
Same question, six configurations
Each item is a four-choice multiple-choice question — a <context> (the question stem) and four <candidates> (answer options). Swap which modality carries the context and which carries the candidates, and one item becomes six.
V → T (vision context, text candidates)
Which sound matches what is shown?

Adog howling
Bchicken clucking
Ccrocodile hissing
Dcuckoo bird calling
A <context> + four <candidates>; the modality of each is varied to form V→T, V→A, T→A, T→V, A→V, A→T. Identical semantics, six modality routings — isolating cross-modal consistency.
V→T🖼 Visioncontext → candidates🔤 Text
V→A🖼 Visioncontext → candidates🔊 Audio
T→A🔤 Textcontext → candidates🔊 Audio
T→V🔤 Textcontext → candidates🖼 Vision
A→V🔊 Audiocontext → candidates🖼 Vision
A→T🔊 Audiocontext → candidates🔤 Text