论文标题
Square-root遗憾的界限连续插曲马尔可夫决策过程
Square-root regret bounds for continuous-time episodic Markov decision processes
论文作者
论文摘要
我们研究有限的情节环境中连续时间马尔可夫决策过程(MDP)的加强学习。与离散时间MDP相反,连续时间MDP的转移时间是指数分布的,并根据状态对 - 在每个过渡时的行动对。我们根据价值迭代和上限构成的方法提出了一种学习算法。我们在提议的算法的最坏情况下遗憾的遗憾,并建立了最差的下限,这两个界限都是在发作数量上的正方形阶。最后,我们进行仿真实验以说明算法的性能。
We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state--action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.