FP8格式用于深度学习

论文标题

FP8格式用于深度学习

FP8 Formats for Deep Learning

论文作者

Micikevicius, Paulius, Stosic, Dusan, Burgess, Neil, Cornea, Marius, Dubey, Pradeep, Grisenthwaite, Richard, Ha, Sangwon, Heinecke, Alexander, Judd, Patrick, Kamalu, John, Mellempudi, Naveen, Oberman, Stuart, Shoeybi, Mohammad, Siu, Michael, Wu, Hao

论文摘要

FP8是加速深度学习训练推论以外的16位格式的自然发展。在本文中，我们提出了一个8位浮点（FP8）二进制互换格式，该格式由两个编码组成-E4M3（4位指数和3位Mantissa）和E5M2（5位指数和2位指数和2位Mantissa）。尽管E5M2遵循IEEE 754惯例代表特殊值的惯例，但E4M3的动态范围是通过不代表无限态，只有一个Mantissa bit-Pattern来扩展的。我们证明了FP8格式对各种图像和语言任务的功效，从而有效地匹配了16位培训课程所达到的质量。我们的研究涵盖了主要的现代神经网络体系结构 - CNN，RNN和基于变压器的模型，使所有超参数从16位基线训练课程中保持不变。我们的培训实验包括大型，最多175b参数，语言模型。我们还检查了使用16位格式训练的语言模型的FP8训练后定量化，以抵抗固定点INT8量化。

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题