论文标题
无限马的无模型学习算法平均奖励MDP,近距离遗憾
A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret
论文作者
论文摘要
最近,由于模型的简单性,记忆力和计算效率以及与功能近似结合的灵活性,无需模型的增强学习引起了研究的关注。在本文中,我们提出了探索增强的Q学习(EE-QL),这是一种无模型的算法,用于无限 - - 霍森平均奖励马尔可夫决策过程(MDPS),可实现$ o(\ sqrt {t})$ bount的遗憾(\ sqrt {t})$,用于弱传播MDP的一般类别,其中$ t $ t $ bess $ bess $ bess $ is versooties bess us bess $ bess us互动。 EE-QL假定可以提供最佳平均奖励的在线集中近似。这是第一个在没有千古假设的情况下实现$ o(\ sqrt t)$遗憾的第一个无模型的学习算法,除了对数因素外,与$ t $相匹配,并与下限匹配。实验表明,所提出的算法以及最著名的基于模型的算法。
Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of $O(\sqrt{T})$ for the general class of weakly communicating MDPs, where $T$ is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(\sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.