Square-root遗憾的界限连续插曲马尔可夫决策过程

论文标题

Square-root遗憾的界限连续插曲马尔可夫决策过程

Square-root regret bounds for continuous-time episodic Markov decision processes

论文作者

Gao, Xuefeng, Zhou, Xun Yu

论文摘要

我们研究有限的情节环境中连续时间马尔可夫决策过程（MDP）的加强学习。与离散时间MDP相反，连续时间MDP的转移时间是指数分布的，并根据状态对 - 在每个过渡时的行动对。我们根据价值迭代和上限构成的方法提出了一种学习算法。我们在提议的算法的最坏情况下遗憾的遗憾，并建立了最差的下限，这两个界限都是在发作数量上的正方形阶。最后，我们进行仿真实验以说明算法的性能。

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state--action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题