WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Abstract

Interactive video generation models—Genie, YUME, HY-World, Matrix-Game, among others—are advancing rapidly, yet every model is evaluated on its own proprietary benchmark with bespoke scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions—identical scenes, identical action sequences, and a unified control interface—needed to make those metrics comparable across models with heterogeneous inputs.

We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20–60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We release all data, evaluation code, and model outputs to facilitate future research.

Introduction Video

Overview

Current interactive world models each define their own control interfaces, scenes, and evaluation protocols, making cross-model comparison impossible. WorldMark resolves this fragmentation through three pillars: (a) A curated test suite of 500 cases spanning first- and third-person views with photorealistic and stylized scenes. (b) A unified action-mapping layer translating a shared WASD vocabulary into each model's native format. (c) Same reference image + same action sequence reveals striking quality divergence across models.

Why WorldMark?

Existing benchmarks lack standardized scenes, unified cross-model action mapping, keyboard-interactive support, or difficulty hierarchies. WorldMark is the first to provide all of these.

Benchmark	Target Task	Unified Cross-Model
VBench	T2V
VBench++	I2V
WorldScore	Traj cond. I2V/I2-3D	~
MIND	Traj cond. I2V/V2V	~
WorldMark (Ours)	Interactive I2V

Image Suite

50 diverse reference images spanning three scene categories (Nature, City, Indoor), two visual styles (Real, Stylized), and two viewpoints (First-person, Third-person). The stylized subset covers oil painting, Ukiyo-e, cyberpunk, and Minecraft aesthetics, ensuring evaluation is not biased toward photorealistic domains alone. Cross-viewpoint pairs yield 100 test images total.

Action Suite

15 standardized action sequences of increasing complexity, expressed in a shared vocabulary of WASD movement plus L/R yaw rotation:

Easy (20s, single-segment)

Pure forward motion, basic action compliance. Actions 1–5: Forward, Backward, Strafe Left, Strafe Right, Pan Right.

Medium (40s, two-segment)

Smooth transitions between movement types. Actions 6–10: Forth & Back, Pan L & R, Strafe L & R, Walk+Pan, Reverse+Pan.

Hard (60s, three-segment)

Complex patrol routes and 360° panoramic rotations. Actions 11–15: Patrol, Sweep, Right Turn, Zigzag, Peek+Retract.

The 15 standardized action sequences from elementary translations to combined and cyclic trajectories — **Fig. 3:** The 15 standardized action sequences, ranging from elementary translations and rotations to combined and cyclic trajectories.

Combined with 100 test images, this yields approximately 500 standardized evaluation cases. A VLM-based scene-aware filtering step ensures only physically plausible actions are assigned to each image.

Context-aware action selection using VLM reasoning — **Fig. 4:** Context-aware action selection. A VLM analyzes the initial image to identify physical constraints and selects plausible action sequences from the predefined library.

Quantitative Results

WorldMark evaluates generated videos along three complementary dimensions with eight metrics:

Visual Quality

Aesthetic Quality — human-perceived aesthetic appeal via LAION aesthetic predictor
Imaging Quality — low-level distortion detection via MUSIQ

Control Alignment

Translation Error — 3D camera pose via DROID-SLAM reconstruction
Rotation Error — geodesic angular deviation of heading

World Consistency

Reprojection Error — 3D spatial coherence via DROID-SLAM + DBA
State Consistency — spatiotemporal object stability
Content Consistency — hallucination detection
Style Consistency — global visual uniformity

First-Person Real

Metric	YUME 1.5	MatrixGame 2.0	HY-World	HY-Game	Oasis	Genie 3
Visual Quality
Aesthetic Quality ↑	56.94	49.40	54.79	46.59	29.31	45.58
Imaging Quality ↑	74.36	68.11	69.37	49.31	28.08	64.14
Control Alignment
Translation Error ↓	0.199	0.222	0.191	0.159	0.376	0.498
Rotation Error ↓	2.107	1.324	2.079	6.019	4.892	4.247
World Consistency
Reprojection Error ↓	0.549	0.688	0.702	0.447	1.938	0.441
State Consistency ↑	5.344	4.151	5.913	4.073	2.585	6.416
Content Consistency ↑	3.820	7.415	6.352	5.814	3.748	6.914
Style Consistency ↑	7.119	3.181	5.142	3.726	1.797	8.158

First-Person Stylized

Metric	YUME 1.5	MatrixGame 2.0	HY-World	HY-Game	Oasis	Genie 3
Visual Quality
Aesthetic Quality ↑	57.03	47.74	58.50	44.02	30.84	46.84
Imaging Quality ↑	69.15	64.24	64.78	40.91	28.44	53.27
Control Alignment
Translation Error ↓	0.223	0.182	0.244	0.116	0.350	0.261
Rotation Error ↓	2.732	1.561	4.316	0.932	3.808	2.835
World Consistency
Reprojection Error ↓	0.638	0.672	0.638	0.640	1.877	0.256
State Consistency ↑	5.891	2.873	6.408	4.782	3.523	6.835
Content Consistency ↑	5.362	6.457	7.159	5.196	3.114	7.306
Style Consistency ↑	4.216	4.934	6.817	4.051	2.435	7.523

Third-Person

Only three models natively support third-person perspective: MatrixGame 2.0, HY-World, and Genie 3.

Metric	Real			Stylized
Metric	MatrixGame 2.0	HY-World	Genie 3	MatrixGame 2.0	HY-World	Genie 3
Visual Quality
Aesthetic Quality ↑	52.78	57.69	51.04	51.60	60.57	53.76
Imaging Quality ↑	67.26	70.76	60.20	65.24	66.45	63.98
Control Alignment
Translation Error ↓	0.284	0.206	0.212	0.230	0.220	0.129
Rotation Error ↓	27.606	2.137	14.905	9.211	5.285	8.823
World Consistency
Reprojection Error ↓	0.814	0.640	0.584	0.744	0.713	1.148
State Consistency ↑	5.136	6.628	7.082	3.625	5.274	7.565
Content Consistency ↑	3.405	5.707	7.424	2.083	5.147	7.109
Style Consistency ↑	1.659	4.491	8.247	2.942	7.236	8.541

Key Takeaways

Visual quality and world consistency are largely uncorrelated. YUME produces the most appealing frames yet its worlds lack global coherence, while Genie 3 maintains the most consistent worlds with only moderate frame-level fidelity.

Strong control alignment does not imply overall quality. HY-Game follows commands precisely but at the cost of visual fidelity, whereas Genie 3 has higher trajectory error yet preserves a globally coherent world.

Third-person generation exposes severe weaknesses. MatrixGame's rotation error explodes by ~20×, highlighting the difficulty of maintaining camera control around a visible character.

Domain-specific training does not transfer. Open-Oasis, trained on Minecraft, fails across all metrics on real-world and stylized scenes.

BibTeX

@inproceedings{worldmark2026,
  title={WorldMark: A Unified Benchmark Suite for Interactive Video World Models},
  author={Anonymous},
  journal={arXiv preprint},
  year={2026}
}