The burgeoning field of text-based video generation (T2V) has reignited significant interest in the research of controllable video editing. Although pre-trained T2V-based editing models have achieved efficient editing capabilities, current works are still plagued by two major challenges. Firstly, the inherent limitations of T2V models lead to content inconsistencies and motion discontinuities between frames. Secondly, the notorious issue of over-editing significantly disrupts areas that are intended to remain unaltered. To address these challenges, our work aims to explore a robust video-based editing paradigm based on score distillation. Specifically, we propose an Adaptive Sliding Score Distillation strategy, which not only enhances the stability of T2V supervision but also incorporates both global and local video guidance to mitigate the impact of generation errors. Additionally, we modify the self-attention layers during the editing process to further preserve the key features of the original video. Extensive experiments demonstrate that these strategies enable us to effectively address the aforementioned challenges, achieving superior editing performance compared to existing state-of-the-art methods.
Our pipeline includes an upper reference branch and a lower editing branch. We use ASSD for T2V models and DDS for T2I models to jointly derive the gradients for updating the latent code of the video. Additionally, we modify the computation in the spatial self-attention layers of the editing branch to incorporate information from the reference branch.
ASSD adaptively extracts masks for regions with significant gradient changes based on the temporal variation of the update gradients. It then employs a sliding window-based smoothing operation to mitigate errors in the noise predictions from the reference branch.
American ➜ Canadian
horse ➜ turtle
brown bear ➜ tiger
ship ➜ sailboat
red ➜ pink
man ➜ rabbit
swans ➜ flamengos
ski lift chairlifts ➜ gondola lifts
rabbit ➜ dog
Source
Ours
Tune-A-Video
ControlVideo
FateZero
Flatten
TokenFlow
Ship ➜ Sailboat
Source
Ours
Tune-A-Video
ControlVideo
FateZero
Flatten
TokenFlow
Rabbit ➜ Dog
Source
Ours
Tune-A-Video
ControlVideo
FateZero
Flatten
TokenFlow
Man ➜ Rabbit
Pred \( Z_0 \) of Stable Diffusion
timestep=250
Ref Branch
Edit Branch
timestep=750
Ref Branch
Edit Branch
Pred \( Z_0 \) of ModelscopeT2V
timestep=250
Ref Branch
Edit Branch
timestep=750
Ref Branch
Edit Branch
Pred \( Z_0 \) of Zeroscope
timestep=250
Ref Branch
Edit Branch
timestep=750
Ref Branch
Edit Branch