论文标题
提案挖掘和预测均衡的开放词汇对象检测
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization
论文作者
论文摘要
开放式视频对象检测(OVD)旨在扩大词汇量,以检测训练词汇量之外的新型类别的对象。最近的工作诉诸于预先训练的视觉模型中的丰富知识。但是,现有方法在提案级视力语言对齐中无效。同时,这些模型通常会遭受对基本类别的信心偏见,并且在新颖的类别上表现较差。为了克服挑战,我们提出了Medet,这是一个新颖有效的OVD框架,并具有建议挖掘和预测均衡。首先,我们设计了一个在线提案挖掘,以完善从粗到细的遗传视觉语义知识,从而允许提案级别以检测为导向的特征对齐。其次,基于因果推论理论,我们引入了班级的后门调整,以加强对新类别的预测,以提高整体OVD性能。对可可和LVIS基准的广泛实验验证了MEDET在检测新类别的物体(例如可可的32.6%AP50)和LVI上的22.4%Mask Mab映射方面的优越性。
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary. Recent work resorts to the rich knowledge in pre-trained vision-language models. However, existing methods are ineffective in proposal-level vision-language alignment. Meanwhile, the models usually suffer from confidence bias toward base categories and perform worse on novel ones. To overcome the challenges, we present MEDet, a novel and effective OVD framework with proposal mining and prediction equalization. First, we design an online proposal mining to refine the inherited vision-semantic knowledge from coarse to fine, allowing for proposal-level detection-oriented feature alignment. Second, based on causal inference theory, we introduce a class-wise backdoor adjustment to reinforce the predictions on novel categories to improve the overall OVD performance. Extensive experiments on COCO and LVIS benchmarks verify the superiority of MEDet over the competing approaches in detecting objects of novel categories, e.g., 32.6% AP50 on COCO and 22.4% mask mAP on LVIS.