国家行动可分离的强化学习

论文标题

国家行动可分离的强化学习

State Action Separable Reinforcement Learning

论文作者

Zhang, Ziyao, Ma, Liang, Leung, Kin K., Poularakis, Konstantinos, Srivatsa, Mudhakar

论文摘要

近年来，基于强化学习（RL）的方法已经看到了他们在解决串行决策和控制问题方面取得的最大成功。对于常规的RL配方，马尔可夫决策过程（MDP）和州行动值函数是问题建模和政策评估的基础。但是，仍然存在一些具有挑战性的问题。在大多数引用的问题中，状态/行动空间的巨大性是使效率低下的重要因素准确地近似于状态行动值函数。我们观察到，尽管行动直接定义了代理人的行为，但对于在国家过渡之后的下一个状态而言，在确定这种状态过渡的返回时，下一个状态比采取的行动更重要。在这方面，我们提出了一个新的学习范式，国家行动可分离的强化学习（SASRL），其中，其中的动作空间与价值函数学习过程中的效率更高。然后，学会了一个轻量级过渡模型，以帮助代理确定触发相关状态过渡的动作。此外，我们的收敛分析表明，在某些条件下，SASRL的收敛时间为$ O（t^{1/k}）$，其中$ t $是在基于MDP的公式中更新值函数的收敛时间，而$ K $是权重因子。在几种游戏场景上的实验表明，SASRL的表现优于最先进的MDP RL算法，高达$ 75 \％$。

Reinforcement Learning (RL) based methods have seen their paramount successes in solving serial decision-making and control problems in recent years. For conventional RL formulations, Markov Decision Process (MDP) and state-action-value function are the basis for the problem modeling and policy evaluation. However, several challenging issues still remain. Among most cited issues, the enormity of state/action space is an important factor that causes inefficiency in accurately approximating the state-action-value function. We observe that although actions directly define the agents' behaviors, for many problems the next state after a state transition matters more than the action taken, in determining the return of such a state transition. In this regard, we propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL), wherein the action space is decoupled from the value function learning process for higher efficiency. Then, a light-weight transition model is learned to assist the agent to determine the action that triggers the associated state transition. In addition, our convergence analysis reveals that under certain conditions, the convergence time of sasRL is $O(T^{1/k})$, where $T$ is the convergence time for updating the value function in the MDP-based formulation and $k$ is a weighting factor. Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75\%$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题