论文标题

无限马的无模型学习算法平均奖励MDP,近距离遗憾

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

论文作者

Jafarnia-Jahromi, Mehdi, Wei, Chen-Yu, Jain, Rahul, Luo, Haipeng

论文摘要

最近,由于模型的简单性,记忆力和计算效率以及与功能近似结合的灵活性,无需模型的增强学习引起了研究的关注。在本文中,我们提出了探索增强的Q学习(EE-QL),这是一种无模型的算法,用于无限 - - 霍森平均奖励马尔可夫决策过程(MDPS),可实现$ o(\ sqrt {t})$ bount的遗憾(\ sqrt {t})$,用于弱传播MDP的一般类别,其中$ t $ t $ bess $ bess $ bess $ is versooties bess us bess $ bess us互动。 EE-QL假定可以提供最佳平均奖励的在线集中近似。这是第一个在没有千古假设的情况下实现$ o(\ sqrt t)$遗憾的第一个无模型的学习算法,除了对数因素外,与$ t $相匹配,并与下限匹配。实验表明,所提出的算法以及最著名的基于模型的算法。

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of $O(\sqrt{T})$ for the general class of weakly communicating MDPs, where $T$ is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(\sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源