论文标题

通过混杂的观察数据,可证明有效的因果增强学习

Provably Efficient Causal Reinforcement Learning with Confounded Observational Data

论文作者

Wang, Lingxiao, Yang, Zhuoran, Wang, Zhaoran

论文摘要

通过表达功能近似器(例如神经网络)的能力,深度增强学习(DRL)取得了巨大的经验成功。但是,学习表达功能近似器需要通过与环境进行交互来收集大型数据集(介入数据)。这种缺乏样本效率禁止DRL应用于关键场景,例如自动驾驶和个性化医学,因为在线环境中的反复试验和错误通常不安全甚至不道德。在本文中,我们研究了如何将收集的数据集(观察数据)纳入离线,通常在实践中可以充分利用,以提高在线环境中的样本效率。为了结合可能的混杂观测数据,我们提出了反光的乐观价值迭代(DOVI)算法,该算法以可证明有效的方式结合了混杂的观测数据。更具体地说,DOVI明确调整了观测数据中的混杂偏置,其中部分观察或未观察到混杂因素。在这两种情况下,此类调整都使我们能够基于信息增益概念来构建奖金,该概念考虑了从离线设置中获取的信息的数量。特别是,我们证明,DOVI的遗憾要小于通过乘法因素在纯在线环境中实现的最佳遗憾,当混杂的观察数据对调整后更具信息性时,它会降低到零。我们的算法和分析是迈向因果增强学习的一步。

Empowered by expressive function approximators such as neural networks, deep reinforcement learning (DRL) achieves tremendous empirical successes. However, learning expressive function approximators requires collecting a large dataset (interventional data) by interacting with the environment. Such a lack of sample efficiency prohibits the application of DRL to critical scenarios, e.g., autonomous driving and personalized medicine, since trial and error in the online setting is often unsafe and even unethical. In this paper, we study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting. To incorporate the possibly confounded observational data, we propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner. More specifically, DOVI explicitly adjusts for the confounding bias in the observational data, where the confounders are partially observed or unobserved. In both cases, such adjustments allow us to construct the bonus based on a notion of information gain, which takes into account the amount of information acquired from the offline setting. In particular, we prove that the regret of DOVI is smaller than the optimal regret achievable in the pure online setting by a multiplicative factor, which decreases towards zero when the confounded observational data are more informative upon the adjustments. Our algorithm and analysis serve as a step towards causal reinforcement learning.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源