神经网络的解释容易受到普遍的对抗扰动的影响

论文标题

神经网络的解释容易受到普遍的对抗扰动的影响

Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations

论文作者

Oskouie, Haniyeh Ehsani, Farnia, Farzan

论文摘要

使用基于梯度的显着图来解释神经网络分类器已在深度学习文献中进行了广泛的研究。尽管现有算法设法在应用于标准图像识别数据集的应用中实现了令人满意的性能，但最近的作品证明了广泛使用的基于梯度的解释方案的脆弱性与针对每个单独的输入样本的对流设计的规范扰动。但是，这种对抗性扰动通常是使用输入样本的知识来设计的，因此在适用于未知或不断变化的数据点时进行了优选。在本文中，我们显示了用于标准图像数据集的通用扰动（UPI）的存在，该数据集可以在很大一部分的测试样本上改变基于梯度的神经网络的特征图。为了设计这样的UPI，我们提出了一种基于梯度的优化方法以及主要组件分析（PCA）基于基于梯度的方法来计算UPI，该方法可以有效地改变神经网络对不同样本的基于梯度的解释。我们通过向标准图像数据集提供了几个成功应用程序的数值结果来支持所提出的UPI方法。

Interpreting neural network classifiers using gradient-based saliency maps has been extensively studied in the deep learning literature. While the existing algorithms manage to achieve satisfactory performance in application to standard image recognition datasets, recent works demonstrate the vulnerability of widely-used gradient-based interpretation schemes to norm-bounded perturbations adversarially designed for every individual input sample. However, such adversarial perturbations are commonly designed using the knowledge of an input sample, and hence perform sub-optimally in application to an unknown or constantly changing data point. In this paper, we show the existence of a Universal Perturbation for Interpretation (UPI) for standard image datasets, which can alter a gradient-based feature map of neural networks over a significant fraction of test samples. To design such a UPI, we propose a gradient-based optimization method as well as a principal component analysis (PCA)-based approach to compute a UPI which can effectively alter a neural network's gradient-based interpretation on different samples. We support the proposed UPI approaches by presenting several numerical results of their successful applications to standard image datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题