下行链路压缩改善Topk的稀疏

论文标题

下行链路压缩改善Topk的稀疏

Downlink Compression Improves TopK Sparsification

论文作者

Zou, William, De Sterck, Hans, Liu, Jun

论文摘要

培训大型神经网络很耗时。为了加快流程，经常使用分布式培训。分布式培训中最大的瓶颈之一是跨不同节点传达梯度。已经提出了不同的梯度压缩技术来减轻通信瓶颈，包括Topk梯度稀疏，在将其发送给其他节点之前，将梯度截断为最大的K组件。尽管一些作者通过在Worker-to-server（上行链路）和服务器对工作者（Downink）方向上应用TOPK压缩来调查了参数服务器框架中TOPK梯度的稀疏，但当前接受的信念表示，添加额外的压缩降低了模型的收敛性。相反，我们证明，添加下行链路压缩可以改善TOPK稀疏性的性能：不仅可以减少每个步骤的通信量，而且还可以违反直觉可以改善收敛分析中的上限。为了证明这一点，我们重新访问TOPK随机梯度下降（SGD）的非凸照分析，并将其从单向延伸到双向设置。我们还消除了对先前分析的限制，该分析需要非现实的K值。我们对单向TOPK SGD进行了实验评估双向TOPK SGD，并表明，通过双向Topk SGD培训的模型将执行双向TOPK SGD，并且具有训练的模型，并通过单向Topk SGD进行了培训，同时对大量的沟通优势却获得了大量的工作人员。

Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the largest K components before sending it to other nodes. While some authors have investigated topK gradient sparsification in the parameter-server framework by applying topK compression in both the worker-to-server (uplink) and server-to-worker (downlink) direction, the currently accepted belief says that adding extra compression degrades the convergence of the model. We demonstrate, on the contrary, that adding downlink compression can potentially improve the performance of topK sparsification: not only does it reduce the amount of communication per step, but also, counter-intuitively, can improve the upper bound in the convergence analysis. To show this, we revisit non-convex convergence analysis of topK stochastic gradient descent (SGD) and extend it from the unidirectional to the bidirectional setting. We also remove a restriction of the previous analysis that requires unrealistically large values of K. We experimentally evaluate bidirectional topK SGD against unidirectional topK SGD and show that models trained with bidirectional topK SGD will perform as well as models trained with unidirectional topK SGD while yielding significant communication benefits for large numbers of workers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题