暂时扩展后继表示

论文标题

暂时扩展后继表示

Temporally Extended Successor Representations

论文作者

Sargent, Matthew J., Bentley, Peter J., Barry, Caswell, de Cothi, William

论文摘要

我们提出了连续表示的时间扩展的变化，我们将其称为t-SR。 T-SR通过在原始动作重复序列上构造后继表示，捕获了时间扩展动作的预期状态过渡动力学。这种时间抽象的这种形式不能学习相关任务结构的自上而下的层次结构，而是学习耦合动作和动作重复的自下而上的组成。这减少了在没有学习层次政策的情况下控制中所需的决定的数量。因此，T-SR直接考虑了时间扩展的动作序列的时间范围，而无需预定义或域特异性选项。我们表明，在具有动态奖励结构的环境中，T-SR能够利用后继表示的灵活性和时间扩展的动作所提供的抽象。因此，在一系列稀疏的网格世界环境中，T-SR最佳适应学术策略远比基于基于模型的，无模型的增强学习方法快得多。我们还表明，T-SR学会解决这些任务的方式要求学习的策略始终如一地进行采样频率，而不是非临时扩展的策略。

We present a temporally extended variation of the successor representation, which we term t-SR. t-SR captures the expected state transition dynamics of temporally extended actions by constructing successor representations over primitive action repeats. This form of temporal abstraction does not learn a top-down hierarchy of pertinent task structures, but rather a bottom-up composition of coupled actions and action repetitions. This lessens the amount of decisions required in control without learning a hierarchical policy. As such, t-SR directly considers the time horizon of temporally extended action sequences without the need for predefined or domain-specific options. We show that in environments with dynamic reward structure, t-SR is able to leverage both the flexibility of the successor representation and the abstraction afforded by temporally extended actions. Thus, in a series of sparsely rewarded gridworld environments, t-SR optimally adapts learnt policies far faster than comparable value-based, model-free reinforcement learning methods. We also show that the manner in which t-SR learns to solve these tasks requires the learnt policy to be sampled consistently less often than non-temporally extended policies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题