SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

Yu Yang^*,1,2,3 Yue Liao^*,3 Jianbiao Mei^*,1,2 Baisen Wang^*,4 Xuemeng Yang² Licheng Wen² Jiangning Zhang^1,5
Xiangtai Li⁶ Liang Lv⁷ Hanlin Chen³ Botian Shi² Yong Liu^1,† Shuicheng Yan³ Gim Hee Lee³

¹Zhejiang University ²Shanghai AI Laboratory ³National University of Singapore
⁴Chinese Academy of Sciences ⁵Tencent Youtu Lab ⁶Nanyang Technological University ⁷Wuhan University

^*Equal contribution ^†Corresponding author

Video Project Paper Citation

Abstract

Long-Horizon Action-Conditioned Video Generation: Challenges and Solution. (a) General TI2V is single-shot and open-loop, often causing incomplete actions and hallucinated motions. (b) We propose a closed-loop think-act-reflect framework for iterative planning, generation, and verification. (c) We introduce the ActVideoGen-Dataset and Benchmark for task-specific experiments. (d) Our closed-loop design enables self-evolving, continually improving video generation quality.

Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs Sequential Planning and Iterative Reflection for Action-conditioned Long-horizon video generation. Specifically, a PlanAgent decomposes a high-level goal into sub-actions that condition video generation, while a CriticAgent evaluates intermediate video segments and provides corrective feedback for iterative refinement. This closed-loop design further supports self-evolving, utilizing planning and verification signals for GRPO-based post-training to enhance the video generator's consistency and action quality over extended horizons. Moreover, we introduce ActVideoGen-Dataset and ActVideoGen-Bench for training and evaluation. Experiments across multiple TI2V backbones with self-evolving show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.

Method Overview

SPIRAL Overview. (a) Closed-Loop Framework: PlanAgent decomposes abstract goals into atomic plans for action-conditioned video generation, CriticAgent evaluates videos and triggers dual-level inner/outer refinement feedback. (b) Self-Evolving via GRPO: guided by PlanAgent, VideoGenerator produces rollouts and is optimized using CriticAgent rewards.

PlanAgent

Decomposes a high-level goal and visual context into ordered, object-centric action plans with explicit pre-conditions and post-conditions for each generation step.

VideoGenerator

Synthesizes each video segment from the current sub-action and accumulated context, enabling long-horizon generation through step-wise controllable execution.

CriticAgent

Evaluates action-video alignment, detects local failures or global drift, and returns feedback that triggers refinement, regeneration, or replanning.

Full Pipeline Demo

A side-by-side view of the agentic execution process and the final long-horizon video.

Full Pipeline Process

PlanAgent, VideoGenerator, and CriticAgent collaborate through planning, generation, verification, and refinement.

Long-Horizo Video Result

The final generated video composed from verified step-wise segments.

End-to-end pipeline of SPIRAL. Given a user goal, PlanAgent decomposes the task into step-wise actions, VideoGenerator synthesizes each segment, and CriticAgent verifies alignment before final long-horizon composition.

Feedback Refinement Demo

A side-by-side view of CriticAgent-triggered local refinement and the corrected video result.

Local Refinement Trigger

CriticAgent detects an execution issue and triggers local refinement to regenerate and optimize the video segment.

Refined Video Result

The corrected local refinement result after regenerating the problematic step.

Closed-loop feedback refinement. CriticAgent detects local failures, SPIRAL refines the action instruction, regenerates a corrected segment, and continues the procedure without propagating errors.

Results Gallery

Egocentric Behaviour Generation

Wash Clothes

Step 1Place Clothes→Step 2Add Detergent→Step 3Close Lid→Step 4Turn on Power

Wash Pot

Step 1Add Detergent→Step 2Scrub Pad→Step 3Rinse Water→Step 4Wipe Dry

Boil Water

Step 1Fill Water→Step 2Close Lid→Step 3Place on Base→Step 4Switch on Power

Exocentric Human Kinematics

Weight Lifting

Step 1Squat & Grip→Step 2Lift to Chest→Step 3Push Overhead

Long Jump

Step 1Accelerate→Step 2Jump Takeoff & Flight→Step 3Landing

Basketball Dunk

Closed-Loop SPIRAL (Ours)

Step 1: Spray Cleaner→Step 2: Wipe with a Cloth

Ultra Long-Horizon

Tomato Preparation & Storage

Open-Loop Baseline

Step 1: Open the Refrigerator→Step 2: Take Out Tomatoes→Step 3: Wash the Tomatoes (Physical Violation)→Step 4: Cut the Tomatoes (Physical Violation)→Step 5: Seal in a Bag (Incomplete Action)

Closed-Loop SPIRAL (Ours)

Step 1: Open the Refrigerator→Step 2: Take Out Tomatoes→Step 3: Close the Right Door→Step 4: Close the Left Door→Step 5: Place on the Cutting Board→Step 6: Wash the Tomatoes→Step 7: Cut the Tomatoes→Step 8: Seal in a Bag

Make Tomato & Cucumber Salad

Open-Loop Baseline

Step 1: Slice Tomatoes & Cucumbers (Missing Actions)→Step 2: Place Tomatoes & Cucumbers in Bowl (Incomplete Action)→Step 3: Get and Pour Salad Dressing (Physical Violation)

Closed-Loop SPIRAL (Ours)

Step 1: Slice Cucumbers→Step 2: Place Cucumbers in Plate→Step 3: Slice Tomatoes→Step 4: Get Salad Bowl→Step 5: Place Tomatoes in Bowl→Step 6: Place Cucumbers in Bowl→Step 7: Get Salad Dressing→Step 8: Pour the Dressing→Step 9: Toss with Spoon

Citation

@article{yang2026spiral,
  title   = {SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents},
  author  = {Yang, Yu and Liao, Yue and Mei, Jianbiao and Wang, Baisen and Yang, Xuemeng and Wen, Licheng and Zhang, Jiangning and Li, Xiangtai and Lv, Liang and Chen, Hanlin and Shi, Botian and Liu, Yong and Yan, Shuicheng and Lee, Gim Hee},
  journal = {arXiv preprint arXiv:2603.08403},
  year    = {2026}
}