为什么GPT可以在文化中学习？语言模型隐含地执行梯度下降作为元选项

论文标题

为什么GPT可以在文化中学习？语言模型隐含地执行梯度下降作为元选项

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

论文作者

Dai, Damai, Sun, Yutao, Dong, Li, Hao, Yaru, Ma, Shuming, Sui, Zhifang, Wei, Furu

论文摘要

较大的预审前的语言模型显示出令人惊讶的内在学习能力（ICL）能力。有了一些演示输入标签对，他们可以预测没有参数更新的看不见的输入的标签。尽管性能取得了巨大的成功，但其工作机制仍然是一个悬而未决的问题。在本文中，我们将语言模型解释为元观察者，并将其视为内在的学习为隐式填充。从理论上讲，我们发现变压器的注意力具有双重形式的梯度下降。最重要的是，我们理解ICL如下：GPT首先根据演示示例产生元梯度，然后将这些元梯度应用于原始GPT以构建ICL模型。我们全面比较了对实际任务的文化学习和明确的填补的行为，以提供支持我们理解的经验证据。实验结果表明，从多个角度来看，内在学习的行为与显式填充相似。受到变压器注意力和梯度下降之间的双重形式的启发，我们通过与动量的梯度下降设计了基于动量的注意力。从另一个角度来看，对香草的提高的表现进一步支持了我们的理解，更重要的是，将我们的理解用于未来的模型设计的潜力。该代码可在\ url {https://aka.ms/icl}上获得。

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题