I/O的下限用于自动调整CNN的卷积

论文标题

I/O的下限用于自动调整CNN的卷积

I/O Lower Bounds for Auto-tuning of Convolutions in CNNs

论文作者

Zhang, Xiaoyang, Xiao, Junmin, Tan, Guangming

论文摘要

卷积是卷积神经网络（CNN）计算中最耗时的部分，在众多应用中取得了巨大成功。由于复杂的数据依赖性和模型样本量的增加，卷积遭受数据移动的高架（即记忆访问）。这项工作提供了全面的分析和方法，以最大程度地减少CNN中卷积的通信。通过对红蓝色游戏模型下最近的I/O复杂性理论的深入分析，我们为复合算法开发了一般I/O的下限理论，该理论由几种不同的子计算组成。基于提出的理论，我们建立了CNN中两种代表性卷积算法的数据运动的下限结果，即直接卷积和Winograd算法。接下来，从I/O下限结果得出，我们通过充分利用数据重用来设计两种主要卷积算法的接近I/O-Oftimal数据流策略。此外，为了进一步推动近I/O-Optimal数据流策略的性能，提议基于I/O下限的自动调整的积极设计，以搜索直接卷积和Winograd Algorithm的最佳参数配置，以及在GPU上使用的GPU，例如，在每个线程中使用的尺寸和各个线程的大小。最后，对直接卷积和Winograd算法的实验评估结果表明，我们使用自动调整方法的数据流策略平均可以达到3.32倍的性能加速。此外，与代表自动调整最新技术的TVM相比，我们基于I/O下限的自动调整方法不仅可以找到最佳参数配置，而且我们的解决方案的性能比TVM提供的最佳解决方案更高。

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results of two representative convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32x performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided by TVM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题