论文标题
从人类反馈的各种来源学习奖励功能:最佳整合示范和偏好
Learning Reward Functions from Diverse Sources of Human Feedback: Optimally Integrating Demonstrations and Preferences
论文作者
论文摘要
奖励功能是指定机器人目标的常见方法。由于设计奖励功能可能非常具有挑战性,因此一种更有希望的方法是直接从人类教师那里学习奖励功能。重要的是,可以以多种形式被动地或积极地收集人类教师的数据:被动数据源包括演示(例如,动力学指导),而偏好(例如,比较排名)是积极提出的。先前的研究已将奖励学习应用于这些不同的数据源。但是,存在许多域,多个来源是互补和表达的。在这个一般问题的推动下,我们提出了一个框架,以整合多个信息来源,这些信息是被动或从人类用户中积极收集的。特别是,我们提出了一种算法,该算法首先利用用户演示来初始化对奖励功能的信念,然后用优先查询对用户进行优先查询,以零以零in的真实奖励。该算法不仅使我们结合了多个数据源,而且还可以告知机器人何时应利用每种类型的信息。此外,我们的方法解释了人类提供数据的能力:在理论上也是最佳选择的用户友好的优先查询。我们对获取移动操纵器的广泛模拟实验和用户研究证明了我们集成框架的优越性和可用性。
Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations, (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human's ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.