通过蒸馏上下文学习

论文标题

通过蒸馏上下文学习

Learning by Distilling Context

论文作者

Snell, Charlie, Klein, Dan, Zhong, Ruiqi

论文摘要

语言模型可从上下文令牌（例如提示或刮擦板）中受益匪浅。当提示提供信息的说明时，它们的表现更好，并且通过在预测最终答案之前产生刮擦板来获得新的推理能力。但是，它们不会\ textIt {internestize}这些性能提升，当上下文令牌消失时，它们消失了。我们的工作建议应用上下文蒸馏，以便语言模型可以通过内部化这些收益来改善自己。具体而言，给定目标任务的合成未标记的输入，我们在``[指令] + [任务输入]''上调节模型，以预测``[[scratch-pad] + [finas final Answer]'';然后，我们微调相同的模型，以预测自己的``[最终答案]''的条件``[task-Input]''，而无需看到``[[指令]'''或使用``[scratch-pad]'''。我们表明，上下文蒸馏是培训语言模型的一般方法，它可以有效地内化3种类型的培训信号。首先，它可以内部化抽象任务指令和说明，因此我们可以通过新的说明和覆盖旧的指令迭代更新模型参数。其次，它可以将复杂任务的逐步推理内化（例如8位添加），并且这种新获得的功能被证明对其他下游任务很有用。最后，它可以内部化具体的训练示例，并且在蜘蛛文本到SQL数据集上以梯度下降的速度直接学习9 \％。此外，结合上下文蒸馏操作可以将更多的培训示例内化，而不是上下文窗口大小所允许的。

Language models significantly benefit from context tokens, such as prompts or scratchpads. They perform better when prompted with informative instructions, and they acquire new reasoning capabilities by generating a scratch-pad before predicting the final answers. However, they do not \textit{internalize} these performance gains, which disappear when the context tokens are gone. Our work proposes to apply context distillation so that a language model can improve itself by internalizing these gains. Concretely, given a synthetic unlabeled input for the target task, we condition the model on ``[instructions] + [task-input]'' to predict ``[scratch-pad] + [final answer]''; then we fine-tune the same model to predict its own ``[final answer]'' conditioned on the ``[task-input]'', without seeing the ``[instructions]'' or using the ``[scratch-pad]''. We show that context distillation is a general method to train language models, and it can effectively internalize 3 types of training signals. First, it can internalize abstract task instructions and explanations, so we can iteratively update the model parameters with new instructions and overwrite old ones. Second, it can internalize step-by-step reasoning for complex tasks (e.g., 8-digit addition), and such a newly acquired capability proves to be useful for other downstream tasks. Finally, it can internalize concrete training examples, and it outperforms directly learning with gradient descent by 9\% on the SPIDER Text-to-SQL dataset; furthermore, combining context distillation operations can internalize more training examples than the context window size allows.

下载PDF全文

下载文献需遵守相关版权规定

论文标题