平均奖励MDP的接近样本最佳减少政策学习

论文标题

平均奖励MDP的接近样本最佳减少政策学习

Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

论文作者

Wang, Jinghan, Wang, Mengdi, Yang, Lin F.

论文摘要

这项工作考虑了在平均奖励马尔可夫决策过程（AMDP）中获得$ \ varepsilon $最佳策略的样本复杂性，并给定访问生成模型（模拟器）。当地面真相的MDP传播弱时，我们证明了$ \ widetilde o的上限（h \ varepsilon^{ - 3} \ ln \ frac {1}δ）$样本$样品每个状态对，每个状态对，其中$ h：$ h：= sp（h^*）$是$ iS $ is $ is $ is $ is $ is $ is $ is $ is $ iise $ rip;概率。这一界限改善了[Jin＆Sidford 2021]中最著名的基于混合时间的方法，该方法假定每个确定性政策的混合时间都是有限的。我们分析的核心是从AMDP问题到折扣MDP（DMDP）问题的适当减少，这可能是独立的，因为它允许在其他设置中应用DMDP算法为AMDP应用。我们通过证明$ω（| \ Mathcal s | | \ Mathcal a | h \ varepsilon^{ - 2} \ ln \ frac {1}δ）$的最小下限来补充我们的上限。 a |，h，\ ln \ frac {1}δ$，以达到一些对数因素。

This work considers the sample complexity of obtaining an $\varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of $\widetilde O(H \varepsilon^{-3} \ln \frac{1}δ)$ samples per state-action pair, where $H := sp(h^*)$ is the span of bias of any optimal policy, $\varepsilon$ is the accuracy and $δ$ is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper bound by proving a minimax lower bound of $Ω(|\mathcal S| |\mathcal A| H \varepsilon^{-2} \ln \frac{1}δ)$ total samples, showing that a linear dependent on $H$ is necessary and that our upper bound matches the lower bound in all parameters of $(|\mathcal S|, |\mathcal A|, H, \ln \frac{1}δ)$ up to some logarithmic factors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题