论文标题
平均奖励MDP的接近样本最佳减少政策学习
Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP
论文作者
论文摘要
这项工作考虑了在平均奖励马尔可夫决策过程(AMDP)中获得$ \ varepsilon $最佳策略的样本复杂性,并给定访问生成模型(模拟器)。当地面真相的MDP传播弱时,我们证明了$ \ widetilde o的上限(h \ varepsilon^{ - 3} \ ln \ frac {1}δ)$样本$样品每个状态对,每个状态对,其中$ h:$ h:= sp(h^*)$是$ iS $ is $ is $ is $ is $ is $ is $ is $ is $ iise $ rip;概率。这一界限改善了[Jin&Sidford 2021]中最著名的基于混合时间的方法,该方法假定每个确定性政策的混合时间都是有限的。我们分析的核心是从AMDP问题到折扣MDP(DMDP)问题的适当减少,这可能是独立的,因为它允许在其他设置中应用DMDP算法为AMDP应用。我们通过证明$ω(| \ Mathcal s | | \ Mathcal a | h \ varepsilon^{ - 2} \ ln \ frac {1}δ)$的最小下限来补充我们的上限。 a |,h,\ ln \ frac {1}δ$,以达到一些对数因素。
This work considers the sample complexity of obtaining an $\varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of $\widetilde O(H \varepsilon^{-3} \ln \frac{1}δ)$ samples per state-action pair, where $H := sp(h^*)$ is the span of bias of any optimal policy, $\varepsilon$ is the accuracy and $δ$ is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper bound by proving a minimax lower bound of $Ω(|\mathcal S| |\mathcal A| H \varepsilon^{-2} \ln \frac{1}δ)$ total samples, showing that a linear dependent on $H$ is necessary and that our upper bound matches the lower bound in all parameters of $(|\mathcal S|, |\mathcal A|, H, \ln \frac{1}δ)$ up to some logarithmic factors.