Zero-Shot Video Editing through Adaptive Sliding Score Distillation

Abstract

The burgeoning field of text-based video generation (T2V) has reignited significant interest in the research of controllable video editing. Although pre-trained T2V-based editing models have achieved efficient editing capabilities, current works are still plagued by two major challenges. Firstly, the inherent limitations of T2V models lead to content inconsistencies and motion discontinuities between frames. Secondly, the notorious issue of over-editing significantly disrupts areas that are intended to remain unaltered. To address these challenges, our work aims to explore a robust video-based editing paradigm based on score distillation. Specifically, we propose an Adaptive Sliding Score Distillation strategy, which not only enhances the stability of T2V supervision but also incorporates both global and local video guidance to mitigate the impact of generation errors. Additionally, we modify the self-attention layers during the editing process to further preserve the key features of the original video. Extensive experiments demonstrate that these strategies enable us to effectively address the aforementioned challenges, achieving superior editing performance compared to existing state-of-the-art methods.

Overview

Overview diagram

Our pipeline includes an upper reference branch and a lower editing branch. We use ASSD for T2V models and DDS for T2I models to jointly derive the gradients for updating the latent code of the video. Additionally, we modify the computation in the spatial self-attention layers of the editing branch to incorporate information from the reference branch.

Overview diagram

ASSD adaptively extracts masks for regions with significant gradient changes based on the temporal variation of the update gradients. It then employs a sliding window-based smoothing operation to mitigate errors in the noise predictions from the reference branch.

Video Editing Using Our ASSD

American ➜ Canadian

horse ➜ turtle

brown bear ➜ tiger

ship ➜ sailboat

red ➜ pink

man ➜ rabbit

swans ➜ flamengos

ski lift chairlifts ➜ gondola lifts

rabbit ➜ dog

Comparisons

Source

Ours

Tune-A-Video

ControlVideo

FateZero

Flatten

TokenFlow

ShipSailboat

Source

Ours

Tune-A-Video

ControlVideo

FateZero

Flatten

TokenFlow

RabbitDog

Source

Ours

Tune-A-Video

ControlVideo

FateZero

Flatten

TokenFlow

ManRabbit

One-Step Pred \( Z_0 \)

Pred \( Z_0 \) of Stable Diffusion

timestep=250

Ref Branch Edit Branch

Ref Branch

Edit Branch

timestep=750

Ref Branch Edit Branch

Ref Branch

Edit Branch

Pred \( Z_0 \) of ModelscopeT2V

timestep=250

Ref Branch Edit Branch

Ref Branch

Edit Branch

timestep=750

Ref Branch Edit Branch

Ref Branch

Edit Branch

Pred \( Z_0 \) of Zeroscope

timestep=250

Ref Branch Edit Branch

Ref Branch

Edit Branch

timestep=750

Ref Branch Edit Branch

Ref Branch

Edit Branch

The score distillation method uses the one-step predicted noise from the diffusion model to calculate the update gradients for the original video. However, as illustrated in the figure above, compared to Stable Diffusion, text-to-video models such as ModelscopeT2V and Zeroscope exhibit significant errors in the one-step predicted noise at larger time steps (\( Z_0 \) is computed from the one-step predicted noise, and since visualizing \( Z_0 \) is more meaningful, we choose to visualize \( Z_0 \)). This presents challenges in directly transferring the score distillation methods used for image editing to video editing.

Optimization Progress

Optimization Progress