论文标题
视频对象的文本驱动风格化
Text-Driven Stylization of Video Objects
论文作者
论文摘要
我们按照用户指定的文本提示,以直观和语义的方式处理视频对象的任务。这是一项具有挑战性的任务,因为由此产生的视频必须满足多个属性:(1)它必须在时间上保持一致并避免抖动或类似的工件,(2)所得的样式必须保留对象的全球语义及其细粒度的细节,并且(3)必须遵守用户指定的文本提示。为此,根据两个目标文本,我们的方法在视频中对对象进行了对对象的规模。第一个目标文本提示说明了全局语义,第二个目标文本提示提示描述了本地语义。为了修改对象的样式,我们利用剪辑的代表力,以在(1)本地目标文本和一组本地风格的视图之间获得相似性得分,以及(2)全局目标文本和一组程式化的全局视图。我们使用预估计的ATLA分解网络以时间一致的方式传播编辑。我们证明,我们的方法可以为各种对象和视频产生一致的样式变化,这些样式遵守目标文本的规范。我们还展示了如何改变目标文本的特异性并使用一组前缀增强文本会导致具有不同细节级别的样式化。在我们的项目网页上给出了完整的结果:https://sloeschcke.github.io/text-driven-stylization-of-video-objects/
We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1) it has to be temporally consistent and avoid jittering or similar artifacts, (2) the resulting stylization must preserve both the global semantics of the object and its fine-grained details, and (3) it must adhere to the user-specified text prompt. To this end, our method stylizes an object in a video according to two target texts. The first target text prompt describes the global semantics and the second target text prompt describes the local semantics. To modify the style of an object, we harness the representational power of CLIP to get a similarity score between (1) the local target text and a set of local stylized views, and (2) a global target text and a set of stylized global views. We use a pretrained atlas decomposition network to propagate the edits in a temporally consistent manner. We demonstrate that our method can generate consistent style changes over time for a variety of objects and videos, that adhere to the specification of the target texts. We also show how varying the specificity of the target texts and augmenting the texts with a set of prefixes results in stylizations with different levels of detail. Full results are given on our project webpage: https://sloeschcke.github.io/Text-Driven-Stylization-of-Video-Objects/