论文标题
在线知识蒸馏的同行协作学习
Peer Collaborative Learning for Online Knowledge Distillation
论文作者
论文摘要
传统知识蒸馏使用两阶段的培训策略,将知识从高容量的教师模型转移到紧凑的学生模型,这在很大程度上依赖于预先训练的教师。遵循一阶段的端到端培训时尚,最近的在线知识蒸馏通过协作学习,相互学习和在线结合来减轻这种局限性。但是,协作学习和相互学习未能构建在线高容量老师,而在线结合在一起,忽略了分支机构之间的协作及其logit求和会阻碍合奏老师的进一步优化。在这项工作中,我们提出了一种用于在线知识蒸馏的新型同行协作学习方法,该方法将在线结合和网络协作整合到一个统一的框架中。具体来说,给定目标网络,我们构建了一个用于培训的多分支网络,其中每个分支称为同伴。我们对同行的输入进行了多次随机增强,并组装了从同伴输出的特征表示形式,并以额外的分类器作为对等式合奏老师。这有助于将知识从高容量的老师转移到同龄人,进而进一步优化合奏老师。同时,我们采用每个同龄人的时间平均模型作为同龄人的卑鄙教师来协作在同龄人之间转移知识,这有助于每个同龄人学习更丰富的知识并促进,以优化具有更好概括的更稳定的模型。关于CIFAR-10,CIFAR-100和ImageNet的广泛实验表明,所提出的方法显着改善了各种骨干网络的概括,并且表现优于最先进的方法。
Traditional knowledge distillation uses a two-stage training strategy to transfer knowledge from a high-capacity teacher model to a compact student model, which relies heavily on the pre-trained teacher. Recent online knowledge distillation alleviates this limitation by collaborative learning, mutual learning and online ensembling, following a one-stage end-to-end training fashion. However, collaborative learning and mutual learning fail to construct an online high-capacity teacher, whilst online ensembling ignores the collaboration among branches and its logit summation impedes the further optimisation of the ensemble teacher. In this work, we propose a novel Peer Collaborative Learning method for online knowledge distillation, which integrates online ensembling and network collaboration into a unified framework. Specifically, given a target network, we construct a multi-branch network for training, in which each branch is called a peer. We perform random augmentation multiple times on the inputs to peers and assemble feature representations outputted from peers with an additional classifier as the peer ensemble teacher. This helps to transfer knowledge from a high-capacity teacher to peers, and in turn further optimises the ensemble teacher. Meanwhile, we employ the temporal mean model of each peer as the peer mean teacher to collaboratively transfer knowledge among peers, which helps each peer to learn richer knowledge and facilitates to optimise a more stable model with better generalisation. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet show that the proposed method significantly improves the generalisation of various backbone networks and outperforms the state-of-the-art methods.