In this paper, the authors address the limitations of diffusion-based methods in editing existing objects in videos without compromising their appearance over time. They propose a solution by introducing temporal dependency into text-driven diffusion models, allowing for consistent generation of edited objects. The authors develop an inter-frame propagation mechanism that uses layered representations to propagate appearance information from one frame to the next. They then present StableVideo, a text-driven video editing framework based on this mechanism, which achieves consistency-aware video editing. Extensive experiments demonstrate the effectiveness and superiority of their approach compared to existing methods.
https://rese1f.github.io/StableVideo/