SOAC：柔软的选项参与者批评结构

论文标题

SOAC：柔软的选项参与者批评结构

SOAC: The Soft Option Actor-Critic Architecture

论文作者

Li, Chenghao, Ma, Xiaoteng, Zhang, Chongjie, Yang, Jun, Xia, Li, Zhao, Qianchuan

论文摘要

该选项框架通过从长距离任务中自动提取时间扩展的子任务来表现出巨大的希望。已经提出了同时学习低级选项策略和高级期权选择策略的方法。但是，现有方法通常面临两个主要挑战：无效的探索和不稳定的更新。在本文中，我们提出了一种新颖而稳定的非政策方法，该方法基于最大熵模型，以应对这些挑战。我们的方法引入了信息理论的内在奖励，以鼓励识别各种有效的选择。同时，我们利用概率推断模型将优化问题简化为拟合最佳轨迹。实验结果表明，我们的方法在一系列穆约科（Mujoco）基准任务中的上政策和违反政策方法显着胜过，同时仍为转移学习提供好处。在这些任务中，我们的方法学习了各种选择，每个选项的州行动空间都具有很强的连贯性。

The option framework has shown great promise by automatically extracting temporally-extended sub-tasks from a long-horizon task. Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy. However, existing methods typically suffer from two major challenges: ineffective exploration and unstable updates. In this paper, we present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges. Our approach introduces an information-theoretical intrinsic reward for encouraging the identification of diverse and effective options. Meanwhile, we utilize a probability inference model to simplify the optimization problem as fitting optimal trajectories. Experimental results demonstrate that our approach significantly outperforms prior on-policy and off-policy methods in a range of Mujoco benchmark tasks while still providing benefits for transfer learning. In these tasks, our approach learns a diverse set of options, each of whose state-action space has strong coherence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题