论文标题
将本地化提炼为自我监督的代表性学习
Distilling Localization for Self-Supervised Representation Learning
论文作者
论文摘要
对比学习的最新进展彻底改变了无监督的表示学习。具体来说,鼓励从同一图像中的多个视图(增强)映射到相似的嵌入,而来自不同图像的视图则分开。在本文中,通过可视化和诊断分类错误,我们观察到当前的对比模型无效地定位前景对象,从而限制了它们提取区分性高级特征的能力。这是由于视图生成过程均匀考虑图像中的像素。为了解决这个问题,我们提出了一种数据驱动的方法,以学习对背景的不变性。它首先估计图像的前景显着性,然后通过将前景复制到各种背景上来创建增强。学习仍然遵循实例歧视借口任务,以便对表示形式进行训练以无视背景内容并专注于前景。我们研究了各种显着性估计方法,发现大多数方法可以改善对比度学习。通过这种方法(DILO),可以在ImageNet分类上进行自我监督的学习以及对Pascal VOC和Mscoco的对象检测实现明显的表现。
Recent progress in contrastive learning has revolutionized unsupervised representation learning. Concretely, multiple views (augmentations) from the same image are encouraged to map to the similar embeddings, while views from different images are pulled apart. In this paper, through visualizing and diagnosing classification errors, we observe that current contrastive models are ineffective at localizing the foreground object, limiting their ability to extract discriminative high-level features. This is due to the fact that view generation process considers pixels in an image uniformly. To address this problem, we propose a data-driven approach for learning invariance to backgrounds. It first estimates foreground saliency in images and then creates augmentations by copy-and-pasting the foreground onto a variety of backgrounds. The learning still follows the instance discrimination pretext task, so that the representation is trained to disregard background content and focus on the foreground. We study a variety of saliency estimation methods, and find that most methods lead to improvements for contrastive learning. With this approach (DiLo), significant performance is achieved for self-supervised learning on ImageNet classification, and also for object detection on PASCAL VOC and MSCOCO.