WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Xiaojie Xu1, Zhengyuan Lin1,2, Kang He1,3, Yukang Feng1,3

Xiaofeng Mao1, Yuanyang Yin1,3, Kaipeng Zhang1,3,†, Yongtao Ge1,†

1Alaya Studio, Shanda AI Research Tokyo, 2The University of Tokyo, 3Shanghai Innovation Institute

Corresponding authors: kaipeng.zhang@shanda.com, yongtao.ge@shanda.com

Alaya Studio The University of Tokyo Shanghai Innovation Institute

Paper Code & Data World Model Arena


Abstract

Interactive video generation models—Genie, YUME, HY-World, Matrix-Game, among others—are advancing rapidly, yet every model is evaluated on its own proprietary benchmark with bespoke scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions—identical scenes, identical action sequences, and a unified control interface—needed to make those metrics comparable across models with heterogeneous inputs.

We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20–60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We release all data, evaluation code, and model outputs to facilitate future research.

Introduction Video

Overview

Current interactive world models each define their own control interfaces, scenes, and evaluation protocols, making cross-model comparison impossible. WorldMark resolves this fragmentation through three pillars: (a) A curated test suite of 500 cases spanning first- and third-person views with photorealistic and stylized scenes. (b) A unified action-mapping layer translating a shared WASD vocabulary into each model's native format. (c) Same reference image + same action sequence reveals striking quality divergence across models.

Overview of WorldMark: diverse scenes, unified action mapping, and cross-model comparison

Why WorldMark?

Existing benchmarks lack standardized scenes, unified cross-model action mapping, keyboard-interactive support, or difficulty hierarchies. WorldMark is the first to provide all of these.

Benchmark Target Task Std. Scenes & Actions Unified Cross-Model Keyboard Interactive Difficulty Hierarchy
VBench T2V
VBench++ I2V
WorldScore Traj cond. I2V/I2-3D ~
MIND Traj cond. I2V/V2V ~
WorldMark (Ours) Interactive I2V

Image Suite

50 diverse reference images spanning three scene categories (Nature, City, Indoor), two visual styles (Real, Stylized), and two viewpoints (First-person, Third-person). The stylized subset covers oil painting, Ukiyo-e, cyberpunk, and Minecraft aesthetics, ensuring evaluation is not biased toward photorealistic domains alone. Cross-viewpoint pairs yield 100 test images total.

Image Suite overview showing diverse scenes and styles across Indoor, City, Nature categories in Real and Stylized variants
Fig. 2: Overview of the Image Suite. Diverse scenes and styles are covered, each shown in first-person and generated third-person views.

Action Suite

15 standardized action sequences of increasing complexity, expressed in a shared vocabulary of WASD movement plus L/R yaw rotation:

Easy (20s, single-segment)

Pure forward motion, basic action compliance. Actions 1–5: Forward, Backward, Strafe Left, Strafe Right, Pan Right.

Medium (40s, two-segment)

Smooth transitions between movement types. Actions 6–10: Forth & Back, Pan L & R, Strafe L & R, Walk+Pan, Reverse+Pan.

Hard (60s, three-segment)

Complex patrol routes and 360° panoramic rotations. Actions 11–15: Patrol, Sweep, Right Turn, Zigzag, Peek+Retract.

The 15 standardized action sequences from elementary translations to combined and cyclic trajectories
Fig. 3: The 15 standardized action sequences, ranging from elementary translations and rotations to combined and cyclic trajectories.

Combined with 100 test images, this yields approximately 500 standardized evaluation cases. A VLM-based scene-aware filtering step ensures only physically plausible actions are assigned to each image.

Context-aware action selection using VLM reasoning
Fig. 4: Context-aware action selection. A VLM analyzes the initial image to identify physical constraints and selects plausible action sequences from the predefined library.

Quantitative Results

WorldMark evaluates generated videos along three complementary dimensions with eight metrics:

Visual Quality

  • Aesthetic Quality — human-perceived aesthetic appeal via LAION aesthetic predictor
  • Imaging Quality — low-level distortion detection via MUSIQ

Control Alignment

  • Translation Error — 3D camera pose via DROID-SLAM reconstruction
  • Rotation Error — geodesic angular deviation of heading

World Consistency

  • Reprojection Error — 3D spatial coherence via DROID-SLAM + DBA
  • State Consistency — spatiotemporal object stability
  • Content Consistency — hallucination detection
  • Style Consistency — global visual uniformity

First-Person Real

Metric YUME 1.5 MatrixGame 2.0 HY-World HY-Game Oasis Genie 3
Visual Quality
Aesthetic Quality ↑ 56.9449.4054.7946.5929.3145.58
Imaging Quality ↑ 74.3668.1169.3749.3128.0864.14
Control Alignment
Translation Error ↓ 0.1990.2220.1910.1590.3760.498
Rotation Error ↓ 2.1071.3242.0796.0194.8924.247
World Consistency
Reprojection Error ↓ 0.5490.6880.7020.4471.9380.441
State Consistency ↑ 5.3444.1515.9134.0732.5856.416
Content Consistency ↑ 3.8207.4156.3525.8143.7486.914
Style Consistency ↑ 7.1193.1815.1423.7261.7978.158

First-Person Stylized

Metric YUME 1.5 MatrixGame 2.0 HY-World HY-Game Oasis Genie 3
Visual Quality
Aesthetic Quality ↑ 57.0347.7458.5044.0230.8446.84
Imaging Quality ↑ 69.1564.2464.7840.9128.4453.27
Control Alignment
Translation Error ↓ 0.2230.1820.2440.1160.3500.261
Rotation Error ↓ 2.7321.5614.3160.9323.8082.835
World Consistency
Reprojection Error ↓ 0.6380.6720.6380.6401.8770.256
State Consistency ↑ 5.8912.8736.4084.7823.5236.835
Content Consistency ↑ 5.3626.4577.1595.1963.1147.306
Style Consistency ↑ 4.2164.9346.8174.0512.4357.523

Third-Person

Only three models natively support third-person perspective: MatrixGame 2.0, HY-World, and Genie 3.

Metric Real Stylized
MatrixGame 2.0HY-WorldGenie 3 MatrixGame 2.0HY-WorldGenie 3
Visual Quality
Aesthetic Quality ↑ 52.7857.6951.04 51.6060.5753.76
Imaging Quality ↑ 67.2670.7660.20 65.2466.4563.98
Control Alignment
Translation Error ↓ 0.2840.2060.212 0.2300.2200.129
Rotation Error ↓ 27.6062.13714.905 9.2115.2858.823
World Consistency
Reprojection Error ↓ 0.8140.6400.584 0.7440.7131.148
State Consistency ↑ 5.1366.6287.082 3.6255.2747.565
Content Consistency ↑ 3.4055.7077.424 2.0835.1477.109
Style Consistency ↑ 1.6594.4918.247 2.9427.2368.541

Key Takeaways

Visual quality and world consistency are largely uncorrelated. YUME produces the most appealing frames yet its worlds lack global coherence, while Genie 3 maintains the most consistent worlds with only moderate frame-level fidelity.

Strong control alignment does not imply overall quality. HY-Game follows commands precisely but at the cost of visual fidelity, whereas Genie 3 has higher trajectory error yet preserves a globally coherent world.

Third-person generation exposes severe weaknesses. MatrixGame's rotation error explodes by ~20×, highlighting the difficulty of maintaining camera control around a visible character.

Domain-specific training does not transfer. Open-Oasis, trained on Minecraft, fails across all metrics on real-world and stylized scenes.

BibTeX

@inproceedings{worldmark2026,
  title={WorldMark: A Unified Benchmark Suite for Interactive Video World Models},
  author={Anonymous},
  journal={arXiv preprint},
  year={2026}
}