论文标题
使用状态抽象进行增强学习的环境塑造
Environment Shaping in Reinforcement Learning using State Abstraction
论文作者
论文摘要
增强学习(RL)代理面临的主要挑战之一是有效地学习具有稀疏和嘈杂的反馈信号的环境中的(近)最佳政策。在实际应用程序中,具有额外领域知识的专家可以通过\ emph {塑造环境}来帮助加速学习过程,即使环境更加友好。文献中流行的范式是\ emph {潜在的奖励构成},其中使用潜在功能通过其他本地奖励来增强环境的奖励功能。但是,在(i)状态空间非常大,计算适当的潜在功能的设置中,在设置中的适用性受到限制,(ii)反馈信号是嘈杂的,即使使用形状的奖励,代理也可能被困在本地Optima中,并且(iii)单独更改re whings不足并没有足够的动力学来更改动力学。我们解决了潜在的塑造方法的这些局限性,并提出了一个新的\ emph {使用状态抽象塑造环境的框架}。我们的关键想法是将带有嘈杂信号的环境空间压缩到一个抽象的空间中,并使用此抽象为代理创建更平滑,更有效的反馈信号。我们研究了基于抽象的环境塑造的理论基础,并表明代理商在形状环境中学到的政策保留了原始环境中近乎最佳的行为。
One of the central challenges faced by a reinforcement learning (RL) agent is to effectively learn a (near-)optimal policy in environments with large state spaces having sparse and noisy feedback signals. In real-world applications, an expert with additional domain knowledge can help in speeding up the learning process via \emph{shaping the environment}, i.e., making the environment more learner-friendly. A popular paradigm in literature is \emph{potential-based reward shaping}, where the environment's reward function is augmented with additional local rewards using a potential function. However, the applicability of potential-based reward shaping is limited in settings where (i) the state space is very large, and it is challenging to compute an appropriate potential function, (ii) the feedback signals are noisy, and even with shaped rewards the agent could be trapped in local optima, and (iii) changing the rewards alone is not sufficient, and effective shaping requires changing the dynamics. We address these limitations of potential-based shaping methods and propose a novel framework of \emph{environment shaping using state abstraction}. Our key idea is to compress the environment's large state space with noisy signals to an abstracted space, and to use this abstraction in creating smoother and more effective feedback signals for the agent. We study the theoretical underpinnings of our abstraction-based environment shaping, and show that the agent's policy learnt in the shaped environment preserves near-optimal behavior in the original environment.