论文标题
RESKD:残留引导的知识蒸馏
ResKD: Residual-Guided Knowledge Distillation
论文作者
论文摘要
旨在将知识从繁重的教师网络转移到轻量级学生网络的知识蒸馏已成为一种有前途的技术,用于压缩神经网络。但是,由于沉重的老师和轻量级学生之间的容量差距,他们之间仍然存在显着的绩效差距。在本文中,我们将知识蒸馏以新鲜的方式,使用知识差距或老师和学生之间的残差作为指导,以训练一个更轻巧的学生,称为重工人。我们将学生和重工人结合在一起,成为一个新学生,在这里,重任学生纠正了前学生的错误。这样的残留引导过程可以重复,直到用户达到准确性和成本之间的平衡为止。在推论时,我们提出了一种样本自适应策略,以决定每个样本不需要哪些重生,这可以节省计算成本。实验结果表明,我们以18.04 $ \%$,23.14 $ \%$,53.59 $ \%$ $ \%$和56.86 $ \%$ cifar-10,CIFAR-10,CIFAR-100,TININ-IMIMAGENET和IMAGENET数据集的教师计算成本的56.86 $ \%$的计算费用。最后,我们为我们的方法进行了彻底的理论和经验分析。
Knowledge distillation, aimed at transferring the knowledge from a heavy teacher network to a lightweight student network, has emerged as a promising technique for compressing neural networks. However, due to the capacity gap between the heavy teacher and the lightweight student, there still exists a significant performance gap between them. In this paper, we see knowledge distillation in a fresh light, using the knowledge gap, or the residual, between a teacher and a student as guidance to train a much more lightweight student, called a res-student. We combine the student and the res-student into a new student, where the res-student rectifies the errors of the former student. Such a residual-guided process can be repeated until the user strikes the balance between accuracy and cost. At inference time, we propose a sample-adaptive strategy to decide which res-students are not necessary for each sample, which can save computational cost. Experimental results show that we achieve competitive performance with 18.04$\%$, 23.14$\%$, 53.59$\%$, and 56.86$\%$ of the teachers' computational costs on the CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet datasets. Finally, we do thorough theoretical and empirical analysis for our method.