CINE
SCENE

Implicit 3D as Effective Scene Representation
for Cinematic Video Generation

1The University of Hong Kong 2Kling Team, Kuaishou Technology 3Tsinghua University 4Zhejiang University 5Microsoft
Corresponding Author

Our Method

Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way.

Compared with other implicit methods (e.g. loss-guided), ours:

(1) Decouples static background from dynamic foreground to boost vivid motion;
(2) Aligns better with the diffusion paradigm by token concatenation.
Method Architecture

We first extract 3D-aware scene features by VGGT, construct implicit scene representation by fusing features of content information with its corresponding features of camera-viewpoint information.
We then inject the implicit scene representation with scene context images into the diffusion model via context condition, along with camera and text prompt.
We introduce a simple but effective random-shuffling strategy for scene images during training, to further strengthen the alignment between the scene images and their implicit 3D encoding

Dataset Overview

We render Scene-Decoupled Video Dataset in UE5, generating a dataset with 46K video-scene image pairs from different scenes
in 35 high-quality 3D environments with different camera trajectories.

Method Architecture

Comparison: Dynamic Scene

A person is standing in front of a brick wall.

Trajectory
Scene (GT)

FramePack

CaM

Gen3C

CineScene (Ours)

A person is walking through a modern, well-lit room with a yellow wall and a column.

Trajectory
Scene (GT)

FramePack

CaM

Gen3C

CineScene (Ours)

Comparison: Static Scene

A train moving along a set of tracks, passing by a large, rusty metal gate.

Trajectory
Scene (GT)

FramePack

CaM

Gen3C

CineScene (Ours)

Comparison: Camera Control

A gradual transition from a view of a fire extinguisher mounted on a wooden door to a signage.

Trajectory
Scene (GT)

Traj-Attn

RecamMaster

CineScene (Ours)

A vibrant, animated cityscape with a focus on a series of buildings connected by a network of cables.

Trajectory
Scene (GT)

Traj-Attn

RecamMaster

CineScene (Ours)

Out-of-Domain Results

The video showcases a grand, ornate church interior with a long, polished wooden table at the front.

Trajectory
Scene (GT)

+ a historian with detailed, wrinkled skin

+ a woman wearing a delicate lace veil

+ a priest in an intricately embroidered robe

The video showcases a cozy and well-decorated living room with a warm and inviting atmosphere.

Trajectory
Scene (GT)

+ a small, fluffy Pomeranian dog

+ a man with a well-groomed beard

+ + a beautiful white cockatoo

Note: Since this Out-of-Domain dataset (360-DiT) is designed for panoramic content and lacks scene ground truth for translational motion, we specifically demonstrate results for panning camera trajectories in this section.

Introduction

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.

Frequently Asked Questions

We use the term of “cinematic” for the scenario of “blocking” in filmmaking process, which refers to the composition of characters and cameras within a scene by directors [A]. This corresponds to our focus on dynamic subjects, camera movements, and scenes.
[A] Wachirabunjong M S, Raksasad U. Analyzing the significance of blocking and its history[R]. Thammasat University. Journalism and Mass Communication, 2022.
User inputs are multiple scene images captured around a point, ensuring practical feasibility. Panoramas are not required as inputs; they only serve as data sources for multiple scene images.
(1) Our method leverages the pretrained text-to-video diffusion model's inherent ability to generate dynamic subjects from text prompts. (2) We show our in-context conditioning mechanism of injecting implicit 3D features effectively preserves motion dynamics from base model, compared to the loss-guided method. (3) As discussed in "limitation" part in our paper, the base model still struggles to generate large motion dynamics of humans, which is a common challenge in current text-to-video generation models. Hence we utilize more static prompts (e.g. "standing still") in our demo. However, we show dynamics in prompts like "walking", and animals' motion in our demo.
(1) Decoupled scenes: we provide 360° static panoramic image from videos for environment conditioning, whereas [B] lacks decoupled scene; (2) Wider camera range: we provide 75° view changes v.s. 5-60° in [B]. (3) Paired data: we provide videos both with/without dynamic subject for each location, while [B] only offers non-paired videos
[B] Bai J, Xia M, Fu X, et al. Recammaster: Camera-controlled generative rendering from a single video[J]. arXiv preprint arXiv:2503.11647, 2025.

Citation

@misc{huang2026cinesceneimplicit3deffective,
      title={CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation}, 
      author={Kaiyi Huang and Yukun Huang and Yu Li and Jianhong Bai and Xintao Wang and Zinan Lin and Xuefei Ning and Jiwen Yu and Pengfei Wan and Yu Wang and Xihui Liu},
      year={2026},
      eprint={2602.06959},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.06959}, 
}