论文标题

参数重新定位的决策者倾向于寻求权力

Parametrically Retargetable Decision-Makers Tend To Seek Power

论文作者

Turner, Alexander Matt, Tadepalli, Prasad

论文摘要

如果通常激励有能力的AI代理来寻求为我们指定的目标服务的权力,那么除了巨大的利益外,这些系统还将带来巨大的风险。在完全可观察到的环境中,大多数奖励功能都具有最佳的政策,该政策通过保持期权开放并保持活力来寻求权力。但是,现实世界既不是完全可观察到的,也不是必须付出的奖励优越的。我们考虑了一系列的AI决策模型,从最佳,随机到通过学习和与环境互动所告知的选择。我们发现许多决策功能都是可以重新定位的,并且可重新定位的性足以引起寻求权力的趋势。我们的功能标准简单而广泛。我们表明,一系列定性的决策程序激励代理人寻求权力。我们通过推理蒙特祖玛的报仇中学习政策激励措施来证明结果的灵活性。这些结果表明安全风险:最终,可重新定位的培训程序可能会培训寻求对人类权力的现实世界代理商。

If capable AI agents are generally incentivized to seek power in service of the objectives we specify for them, then these systems will pose enormous risks, in addition to enormous benefits. In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive. However, the real world is neither fully observable, nor must trained agents be even approximately reward-optimal. We consider a range of models of AI decision-making, from optimal, to random, to choices informed by learning and interacting with an environment. We discover that many decision-making functions are retargetable, and that retargetability is sufficient to cause power-seeking tendencies. Our functional criterion is simple and broad. We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power. We demonstrate the flexibility of our results by reasoning about learned policy incentives in Montezuma's Revenge. These results suggest a safety risk: Eventually, retargetable training procedures may train real-world agents which seek power over humans.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源