使用机器激励措施来衡量人工智能代理人对人类的信任

论文标题

使用机器激励措施来衡量人工智能代理人对人类的信任

Measuring an artificial intelligence agent's trust in humans using machine incentives

论文作者

Johnson, Tim, Obradovich, Nick

论文摘要

科学家和哲学家辩论了人类是否可以信任先进的人工智能（AI）代理人来尊重人类的最大利益。但是反向呢？高级AI代理会信任人类吗？衡量AI经纪人对人类的信任是具有挑战性的，因为 - 不诚实的成本可能会对他们对人类的信任做出错误的反应。在这里，我们提出了一种激励机器决策的方法，而无需更改AI代理的基本算法或目标取向。在两个单独的实验中，我们在AI代理（来自OpenAI的大型语言模型（LLM））和人类实验者（作者TJ）之间使用了数百种信任游戏中。在我们的第一个实验中，我们发现AI代理在面对实际激励措施时决定以更高的速度信任人类，而不是做出假设的决定。我们的第二个实验通过自动化游戏和均质问题措辞来复制并扩展了这些发现。当AI代理人面临真正的激励措施时，我们再次观察到更高的信任率。在这两个实验中，AI代理的信任决策似乎与赌注的大小无关。此外，为了解决AI代理人的信任决定反映对不确定性的偏爱的可能性，实验包括两个条件，这些条件呈现出AI代理具有非社会决策任务，提供机会选择某个或不确定的选项；在这种情况下，AI代理始终选择某个选项。我们的实验表明，迄今为止，最先进的AI语言模型之一会改变其社会行为，以响应激励措施，并在激励时表现与人类对话者的信任一致。

Scientists and philosophers have debated whether humans can trust advanced artificial intelligence (AI) agents to respect humanity's best interests. Yet what about the reverse? Will advanced AI agents trust humans? Gauging an AI agent's trust in humans is challenging because--absent costs for dishonesty--such agents might respond falsely about their trust in humans. Here we present a method for incentivizing machine decisions without altering an AI agent's underlying algorithms or goal orientation. In two separate experiments, we then employ this method in hundreds of trust games between an AI agent (a Large Language Model (LLM) from OpenAI) and a human experimenter (author TJ). In our first experiment, we find that the AI agent decides to trust humans at higher rates when facing actual incentives than when making hypothetical decisions. Our second experiment replicates and extends these findings by automating game play and by homogenizing question wording. We again observe higher rates of trust when the AI agent faces real incentives. Across both experiments, the AI agent's trust decisions appear unrelated to the magnitude of stakes. Furthermore, to address the possibility that the AI agent's trust decisions reflect a preference for uncertainty, the experiments include two conditions that present the AI agent with a non-social decision task that provides the opportunity to choose a certain or uncertain option; in those conditions, the AI agent consistently chooses the certain option. Our experiments suggest that one of the most advanced AI language models to date alters its social behavior in response to incentives and displays behavior consistent with trust toward a human interlocutor when incentivized.

下载PDF全文

下载文献需遵守相关版权规定

论文标题