修剪什么，在初始化时不修剪什么

论文标题

修剪什么，在初始化时不修剪什么

What to Prune and What Not to Prune at Initialization

论文作者

Haroon, Maham

论文摘要

基于训练后辍学的方法实现了高稀疏性，并且是解决与计算成本和神经网络架构中过度拟合的问题的良好方法。相反，初始化时修剪仍然远远落后。当涉及到网络的缩放计算成本时，初始化修剪更有效。此外，它可以处理过度拟合以及培训后辍学。在对上述原因的认可中，本文提出了两种初始化时修剪的方法。目标是在保持性能的同时获得更高的稀疏性。 1）K-starts，从初始化时k随机p-sparse矩阵开始。在前几个时期，网络随后确定了这些P-Sparse矩阵的“优胜者”，以尝试找到“彩票” P-SPARSE网络。进化算法如何找到最好的个体来采用这种方法。根据神经网络结构，健身标准可以基于网络权重的大小，梯度积累的幅度或两者的组合。 2）耗散梯度方法，目的是消除在前几个时期内保持其初始值的一小部分的权重。尽管它们的幅度最佳地保留了网络的性能，但以这种方式去除权重。相反，该方法还需要最大的时期才能达到更高的稀疏性。 3）耗散梯度和KSTART的组合始终超过方法和随机辍学。使用提供的相关方法的好处是：1）他们不需要对分类任务的特定知识，固定辍学阈值或正则化参数2）模型的重新训练既不是必要的，也不影响P-Sparse网络的性能。

Post-training dropout based approaches achieve high sparsity and are well established means of deciphering problems relating to computational cost and overfitting in Neural Network architectures. Contrastingly, pruning at initialization is still far behind. Initialization pruning is more efficacious when it comes to scaling computation cost of the network. Furthermore, it handles overfitting just as well as post training dropout. In approbation of the above reasons, the paper presents two approaches to prune at initialization. The goal is to achieve higher sparsity while preserving performance. 1) K-starts, begins with k random p-sparse matrices at initialization. In the first couple of epochs the network then determines the "fittest" of these p-sparse matrices in an attempt to find the "lottery ticket" p-sparse network. The approach is adopted from how evolutionary algorithms find the best individual. Depending on the Neural Network architecture, fitness criteria can be based on magnitude of network weights, magnitude of gradient accumulation over an epoch or a combination of both. 2) Dissipating gradients approach, aims at eliminating weights that remain within a fraction of their initial value during the first couple of epochs. Removing weights in this manner despite their magnitude best preserves performance of the network. Contrarily, the approach also takes the most epochs to achieve higher sparsity. 3) Combination of dissipating gradients and kstarts outperforms either methods and random dropout consistently. The benefits of using the provided pertaining approaches are: 1) They do not require specific knowledge of the classification task, fixing of dropout threshold or regularization parameters 2) Retraining of the model is neither necessary nor affects the performance of the p-sparse network.

下载PDF全文

下载文献需遵守相关版权规定

论文标题