论文标题
通过约束不匹配的政策加速安全加强学习
Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies
论文作者
论文摘要
当提供(1)基线控制策略以及(2)学习者必须满足的一组约束时,我们会考虑加强学习的问题。基准策略可以来自演示数据或教师代理,并可能为学习提供有用的线索,但它可能对手头的任务是次优的,并且不能保证满足指定的约束,这些约束可能会编码安全性,公平性或其他特定于应用程序。为了安全地从基线策略中学习,我们提出了一种迭代策略优化算法,该算法在最大化任务的预期回报率,最大程度地限制了与基线策略的距离以及将策略投影到约束满意度的集合之间。我们从理论上分析算法并提供有限的时间收敛保证。在我们对五项不同控制任务的实验中,我们的算法始终胜过几个最先进的基线,达到了限制违规的10倍,平均奖励提高了40%。
We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the learner must satisfy. The baseline policy can arise from demonstration data or a teacher agent and may provide useful cues for learning, but it might also be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints, which might encode safety, fairness or other application-specific requirements. In order to safely learn from baseline policies, we propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set. We analyze our algorithm theoretically and provide a finite-time convergence guarantee. In our experiments on five different control tasks, our algorithm consistently outperforms several state-of-the-art baselines, achieving 10 times fewer constraint violations and 40% higher reward on average.