论文标题
不对称温度缩放使较大的网络再次教授
Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again
论文作者
论文摘要
知识蒸馏(KD)旨在将良好表现的神经网络({\ it tocuts})的知识转移到一个较弱的知识({\ it Student})。一个特殊的现象是,更准确的模型不一定更好,温度调节也不能减轻不匹配的能力。为了解释这一点,我们将KD的疗效分解为三个部分:{\ IT正确的指导},{\ it平滑正则化}和{\ it class Criginability}。最后一个术语描述了教师在KD中提供的{\ IT错误类概率}的独特性。复杂的教师倾向于过度自信,传统的温度缩放限制了{\ it class Cindiminable}的功效,从而导致较少的歧视性错误类概率。因此,我们提出{\ IT不对称温度缩放(ATS)},该温度分别在正确/错误的类中应用了较高/较低的温度。 ATS扩大了教师标签中错误的班级概率的差异,并使学生尽可能地掌握了与目标类别的错误类别的绝对亲和力。理论分析和广泛的实验结果都证明了ATS的有效性。以https://gitee.com/lxcnju/ats-mindspore提供的Mindspore开发的演示,可在https://gitee.com/mindspore/models/models/models/master/master/master/research/cv/ats上找到。
Knowledge Distillation (KD) aims at transferring the knowledge of a well-performed neural network (the {\it teacher}) to a weaker one (the {\it student}). A peculiar phenomenon is that a more accurate model doesn't necessarily teach better, and temperature adjustment can neither alleviate the mismatched capacity. To explain this, we decompose the efficacy of KD into three parts: {\it correct guidance}, {\it smooth regularization}, and {\it class discriminability}. The last term describes the distinctness of {\it wrong class probabilities} that the teacher provides in KD. Complex teachers tend to be over-confident and traditional temperature scaling limits the efficacy of {\it class discriminability}, resulting in less discriminative wrong class probabilities. Therefore, we propose {\it Asymmetric Temperature Scaling (ATS)}, which separately applies a higher/lower temperature to the correct/wrong class. ATS enlarges the variance of wrong class probabilities in the teacher's label and makes the students grasp the absolute affinities of wrong classes to the target class as discriminative as possible. Both theoretical analysis and extensive experimental results demonstrate the effectiveness of ATS. The demo developed in Mindspore is available at https://gitee.com/lxcnju/ats-mindspore and will be available at https://gitee.com/mindspore/models/tree/master/research/cv/ats.