Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

Abstract

This work focuses on 4D scene editing and presents a training-free text-driven approach for modifying dynamic scenes. By leveraging a multimodal diffusion transformer, the method enables temporally consistent edits based on user instructions. Compared with conventional image editing methods, this work extends controllable generation into the spatiotemporal domain and highlights flexibility and precision in dynamic scene manipulation.

Publication
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026