无监督的加强学习的混合

论文标题

无监督的加强学习的混合

A Mixture of Surprises for Unsupervised Reinforcement Learning

论文作者

Zhao, Andrew, Lin, Matthieu Gaetan, Li, Yangguang, Liu, Yong-Jin, Huang, Gao

论文摘要

无监督的强化学习旨在以无奖励的方式学习通才政策，以快速适应下游任务。大多数现有方法都建议基于惊喜提供内在的奖励。最大化或最大程度地减少惊喜会驱使代理商探索或获得对环境的控制。但是，这两种策略都依赖于一个有力的假设：环境动力学的熵高或低。在现实世界中，这种假设可能并不总是存在，因为环境动力学的熵可能未知。因此，在两个目标之间进行选择是困境。我们提出了一种新颖而简单的政策混合，以解决这一问题，从而使我们能够优化一个同时最大化和最大程度地减少惊喜的目标。具体而言，我们训练一种混合组件，其目标是使惊喜最大化，而另一个混合组件是使惊喜最小化。因此，我们的方法没有对环境动态的熵进行假设。我们将我们的方法称为$ \ textbf {m} \ text {ixture} \ textbf {o} \ text {f} \ textbf {s} \ textbf {s} \ text {urprise} \ urprise} \ textbf {s} $（moss）$（moss）$ noververved Reperved Reverved Reverved Reverfored Reverforced Reverscored Learnession。实验结果表明，我们的简单方法在URLB基准测试中实现了最先进的性能，表现优于以前的纯惊奇最大化目标。我们的代码可在以下网址提供：https：//github.com/leaplabthu/moss。

Unsupervised reinforcement learning aims at learning a generalist policy in a reward-free manner for fast adaptation to downstream tasks. Most of the existing methods propose to provide an intrinsic reward based on surprise. Maximizing or minimizing surprise drives the agent to either explore or gain control over its environment. However, both strategies rely on a strong assumption: the entropy of the environment's dynamics is either high or low. This assumption may not always hold in real-world scenarios, where the entropy of the environment's dynamics may be unknown. Hence, choosing between the two objectives is a dilemma. We propose a novel yet simple mixture of policies to address this concern, allowing us to optimize an objective that simultaneously maximizes and minimizes the surprise. Concretely, we train one mixture component whose objective is to maximize the surprise and another whose objective is to minimize the surprise. Hence, our method does not make assumptions about the entropy of the environment's dynamics. We call our method a $\textbf{M}\text{ixture }\textbf{O}\text{f }\textbf{S}\text{urprise}\textbf{S}$ (MOSS) for unsupervised reinforcement learning. Experimental results show that our simple method achieves state-of-the-art performance on the URLB benchmark, outperforming previous pure surprise maximization-based objectives. Our code is available at: https://github.com/LeapLabTHU/MOSS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题