微调剪辑模型是有效的视频学习者

论文标题

微调剪辑模型是有效的视频学习者

Fine-tuned CLIP Models are Efficient Video Learners

论文作者

Rasheed, Hanoona, Khattak, Muhammad Uzair, Maaz, Muhammad, Khan, Salman, Khan, Fahad Shahbaz

论文摘要

具有图像文本对的大规模多模式训练将剪辑模型赋予强大的概括。由于在类似的视频范围内进行培训是不可行的，因此最近的方法着重于将基于图像的剪辑的有效传输到视频域。在此追求中，添加了新的参数模块，以学习需要精心设计工作的时间信息和框架间关系。此外，当在视频上学习结果模型时，它们倾向于在给定的任务分布和缺乏概括方面过度拟合。这就提出了以下问题：如何有效地将图像级剪辑表示形式传输到视频？在这项工作中，我们表明一个简单的视频微调剪辑（VIFI-CLIP）基线通常足以弥合域间隙，从图像到视频。我们的定性分析表明，剪辑图像编码器的帧级处理，然后是特征池和相似性与相应的文本嵌入匹配有助于隐式建模VIFI-CLIP中的时间提示。这样的微调有助于模型专注于场景动态，移动对象和对象间的关系。对于完全微调不可行的低数据制度，我们提出了一种“桥梁和及时”方法，该方法首先使用微调来桥接域间隙，然后在语言和视觉方面学习提示以调整剪辑表示。我们广泛评估了这一简单而强大的基线在零射击，基础对象的概括，五个视频基准的几乎没有射击和完全监督的设置上。我们的代码可从https://github.com/muzairkhattak/vifi-clip获得。

Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题