论文标题
像您预告片一样的芬太纳:改进了零击视觉模型的填充
Finetune like you pretrain: Improved finetuning of zero-shot vision models
论文作者
论文摘要
诸如剪辑之类的填充图像文本模型可在各种基准测试上实现最新精确度。然而,诸如Wiseft(Wiseft(Wortsman et al。,2021)和LP-FT(Kumar等,2022)之类的最近作品表明,即使在填充过程中的微妙差异也可能导致最终表现的巨大差异,无论是在分布(ID)和分数外(OOD)数据的情况下,即使是较大的差异。在这项工作中,我们表明,模仿对比预训练的一种自然而简单的方法始终优于替代填充方法。具体来说,我们将下游类标签作为文本提示施放,并继续优化图像嵌入和类别描述性提示嵌入(对比度芬太尼)之间的对比度损失。 我们的方法始终超过7个分配变化,6个转移学习和3个少量学习基准的基线。在Wilds-iwildcam上,我们提出的方法FlyP的表现优于排行榜的顶部2.3 \%$ ID和$ 2.7 \%$ $ OOD,提供了最高的报告准确性。 FlyP在7个OOD数据集(2个野外和5个Imagenet相关的偏移)中平均,比标准芬特登录的$ 4.2 \%$ ood的收益优于当前最新的ART状态(LP-FT)超过$ 1 \%\%\%$ $ $ $ $ $。同样,在3个几次学习的基准测试中,我们的方法可增长高达$ 4.6 \%\%$ $ $ $ $ $,而不是$ 4.4 \%\%\%$ $。总的来说,这些基准测试将对比度填充作为一种简单,直观且最新的方法,用于监督剪辑(例如剪辑)的图像文本模型。代码可在https://github.com/locuslab/flyp上找到。
Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.