This work focuses on 4D scene editing and presents a training-free text-driven approach for modifying dynamic scenes. By leveraging a multimodal diffusion transformer, the method enables temporally consistent edits based on user instructions. Compared with conventional image editing methods, this work extends controllable generation into the spatiotemporal domain and highlights flexibility and precision in dynamic scene manipulation.