论文标题
随机路径问题的统一算法
A Unified Algorithm for Stochastic Path Problems
论文作者
论文摘要
我们研究随机路径(SP)问题中的增强学习。这些问题的目标是最大化预期的奖励总和,直到代理到达最终状态为止。我们通过分析一种简单的乐观算法来提供此一般问题中的第一个遗憾保证。我们的遗憾与所有非阳性奖励的随机最短路径(SSP)的特殊情况相匹配。对于SSP,我们为奖励$ b_ \ star $未知的情况提供了一个适应过程。我们表明没有适应的价格,我们的遗憾与已知的$ b_ \ star $相匹配。我们还为随机时间最长路径(SLP)的特殊情况提供了量表适应程序,其中所有奖励都是不负的。但是,与SSP不同,我们通过下限表明适应性不可避免。
We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the scale of rewards $B_\star$ is unknown. We show that there is no price for adaptation, and our regret bound matches that with a known $B_\star$. We also provide a scale adaptation procedure for the special case of stochastic longest paths (SLP) where all rewards are non-negative. However, unlike in SSP, we show through a lower bound that there is an unavoidable price for adaptation.