论文标题

控制的二分法:将您可以控制的东西与无法控制的东西分开

Dichotomy of Control: Separating What You Can Control from What You Cannot

论文作者

Yang, Mengjiao, Schuurmans, Dale, Abbeel, Pieter, Nachum, Ofir

论文摘要

未来或返回条件的监督学习是离线增强学习(RL)的新兴范式,其中与观察到的动作顺序相关的未来结果(即返回)被用作训练训练相同行动的政策的输入。虽然返回条件是流行算法(例如决策变压器(DT))的核心,但这些方法在高度随机的环境中的性能往往会差,因为在环境中偶尔会出现偶尔的高回报,而不是动作本身。这种情况可能会导致一项学识渊博的政策与其条件投入不一致;即,使用该政策在环境中行动,在特定所需的回报条件时,会导致实际收益的分布与所需的差异大不相同。在这项工作中,我们提出了控制二分法(DOC),这是一个未来条件的监督学习框架,将政策控制(行动)内的机制与策略控制之外的人(环境随机性)分开。我们通过根据未来的潜在变量表示并设计一个相互信息约束来实现这种分离,该策略从与环境中随机性相关的潜在变量中删除任何信息。从理论上讲,我们表明DOC得出的政策与其条件投入一致,从而确保将学习的政策调整为所需的高回报未来结果将正确引起高​​回报行为。从经验上讲,我们表明DOC能够在具有高度随机奖励和过渡的环境中取得明显更好的性能

Future- or return-conditioned supervised learning is an emerging paradigm for offline reinforcement learning (RL), where the future outcome (i.e., return) associated with an observed action sequence is used as input to a policy trained to imitate those same actions. While return-conditioning is at the heart of popular algorithms such as decision transformer (DT), these methods tend to perform poorly in highly stochastic environments, where an occasional high return can arise from randomness in the environment rather than the actions themselves. Such situations can lead to a learned policy that is inconsistent with its conditioning inputs; i.e., using the policy to act in the environment, when conditioning on a specific desired return, leads to a distribution of real returns that is wildly different than desired. In this work, we propose the dichotomy of control (DoC), a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environment stochasticity). We achieve this separation by conditioning the policy on a latent variable representation of the future, and designing a mutual information constraint that removes any information from the latent variable associated with randomness in the environment. Theoretically, we show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior. Empirically, we show that DoC is able to achieve significantly better performance than DT on environments that have highly stochastic rewards and transition

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源