无限马的无模型学习算法平均奖励MDP，近距离遗憾

论文标题

无限马的无模型学习算法平均奖励MDP，近距离遗憾

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

论文作者

Jafarnia-Jahromi, Mehdi, Wei, Chen-Yu, Jain, Rahul, Luo, Haipeng

论文摘要

最近，由于模型的简单性，记忆力和计算效率以及与功能近似结合的灵活性，无需模型的增强学习引起了研究的关注。在本文中，我们提出了探索增强的Q学习（EE-QL），这是一种无模型的算法，用于无限 - - 霍森平均奖励马尔可夫决策过程（MDPS），可实现$ o（\ sqrt {t}）$ bount的遗憾（\ sqrt {t}）$，用于弱传播MDP的一般类别，其中$ t $ t $ bess $ bess $ bess $ is versooties bess us bess $ bess us互动。 EE-QL假定可以提供最佳平均奖励的在线集中近似。这是第一个在没有千古假设的情况下实现$ o（\ sqrt t）$遗憾的第一个无模型的学习算法，除了对数因素外，与$ t $相匹配，并与下限匹配。实验表明，所提出的算法以及最著名的基于模型的算法。

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of $O(\sqrt{T})$ for the general class of weakly communicating MDPs, where $T$ is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(\sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题