论文标题
基于图像分类的注意机制的知识蒸馏模型压缩的代表性教师钥匙
Representative Teacher Keys for Knowledge Distillation Model Compression Based on Attention Mechanism for Image Classification
论文作者
论文摘要
随着AI芯片(例如GPU,TPU和NPU)的改善以及物联网(IoT)的快速发展,一些强大的深神经网络(DNN)通常由数百万甚至数亿个参数组成。这样的大型模型可能不适合直接在低计算和低容量单元(例如边缘设备)上部署。知识蒸馏(KD)最近被认为是一种有效减少模型参数的强大模型压缩方法。 KD的核心概念是从大型模型(即教师模型)的特征图中提取有用的信息,以引用成功训练模型大小的小型模型(即学生模型),其中模型大小远小于教师。尽管已经提出了许多KD方法来利用教师模型中间层的特征图中的信息,但大多数人并未考虑教师模型与学生模型之间的特征图的相似性。结果,它可能会使学生模型学习无用的信息。受到注意机制的启发,我们提出了一种新颖的KD方法,称为代表教师钥匙(RTK),该方法不仅考虑了特征地图的相似性,而且还会滤除无用的信息以提高目标学生模型的性能。在实验中,我们使用多个主链网络(例如Resnet和wideresnet)和数据集(例如CIFAR10,CIFAR100,SVHN和CINIC10)验证了我们提出的方法。结果表明,我们提出的RTK可以有效地提高基于注意的KD方法的分类精度。
With the improvement of AI chips (e.g., GPU, TPU, and NPU) and the fast development of the Internet of Things (IoT), some robust deep neural networks (DNNs) are usually composed of millions or even hundreds of millions of parameters. Such a large model may not be suitable for directly deploying on low computation and low capacity units (e.g., edge devices). Knowledge distillation (KD) has recently been recognized as a powerful model compression method to decrease the model parameters effectively. The central concept of KD is to extract useful information from the feature maps of a large model (i.e., teacher model) as a reference to successfully train a small model (i.e., student model) in which the model size is much smaller than the teacher one. Although many KD methods have been proposed to utilize the information from the feature maps of intermediate layers in the teacher model, most did not consider the similarity of feature maps between the teacher model and the student model. As a result, it may make the student model learn useless information. Inspired by the attention mechanism, we propose a novel KD method called representative teacher key (RTK) that not only considers the similarity of feature maps but also filters out the useless information to improve the performance of the target student model. In the experiments, we validate our proposed method with several backbone networks (e.g., ResNet and WideResNet) and datasets (e.g., CIFAR10, CIFAR100, SVHN, and CINIC10). The results show that our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.