论文标题
改善了MDP中增量自主探索的样品复杂性
Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
论文作者
论文摘要
当未提供奖励功能时,我们调查对未知环境的探索。在Lim and Auer [1]引入的增量探索环境的基础上,我们定义了学习$ε$ - 最佳目标条件条件的政策,从而从参考状态$ s_0 $中获得了所有在$ l $步骤(预期)内逐步达到(预期)的逐步达到的州。在本文中,我们介绍了一种基于模型的新方法,该方法交织了从$ s_0 $中发现新状态,并提高了用于计算目标条件政策以达到新发现的状态的模型估算的准确性。所得算法,Disco,达到了样本的复杂性,以$ \ tilde {o}(l^5 s_ {l^5 s_ {l+ε}γ_{l+ε} aε^{ - 2})$,$ a $ a $ a $ a的$ s_ {l+s $ a $ a $ a $ a的$ s $ s $ sus y是$ s的$ subs $ sus y是s的$。步骤,$γ_{l+ε} $是此类动力学的分支因子。这比在$ε$和$ l $中提出的算法以额外的$γ_{l+ε} $ factor的成本改善,这在大多数感兴趣的环境中很小。此外,迪斯科是第一个可以返回$ε/c _ {\ min} $的算法 - 对于任何具有成本敏感的最短路径问题,在$ l $ - 可靠的状态下定义的,最低成本$ c _ {\ min} $。最后,我们报告了初步的经验结果证实了我们的理论发现。
We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $ε$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we introduce a novel model-based approach that interleaves discovering new states from $s_0$ and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as $\tilde{O}(L^5 S_{L+ε} Γ_{L+ε} A ε^{-2})$, where $A$ is the number of actions, $S_{L+ε}$ is the number of states that are incrementally reachable from $s_0$ in $L+ε$ steps, and $Γ_{L+ε}$ is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both $ε$ and $L$ at the cost of an extra $Γ_{L+ε}$ factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an $ε/c_{\min}$-optimal policy for any cost-sensitive shortest-path problem defined on the $L$-reachable states with minimum cost $c_{\min}$. Finally, we report preliminary empirical results confirming our theoretical findings.