D4RL：用于深度数据驱动的增强学习的数据集

论文标题

D4RL：用于深度数据驱动的增强学习的数据集

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

论文作者

Fu, Justin, Kumar, Aviral, Nachum, Ofir, Tucker, George, Levine, Sergey

论文摘要

离线增强学习（RL）设置（也称为Full Batch RL），其中从静态数据集中学到了策略，它令人信服，因为进度使RL方法能够利用大型，先前收集的数据集，就像大型数据集的兴起如何在监督学习中助长了结果。但是，现有的在线RL基准并未针对离线设置量身定制，现有的离线RL基准测试仅限于部分训练的代理生成的数据，从而使离线RL的进展难以测量。在这项工作中，我们介绍了专门为离线设置设计的基准，并在与离线RL的现实世界应用程序相关的数据集的关键属性的指导下。重点关注数据集集合，此类属性的示例包括：通过手工设计的控制器和人类演示器生成的数据集，代理在同一环境中执行不同任务的多任务数据集以及收集的策略混合物的数据集。通过超越部分训练的RL代理收集的简单基准任务和数据，我们揭示了现有算法的重要和未欣赏的缺陷。为了促进研究，我们通过对现有算法，评估协议和开源示例的全面评估发布了基准任务和数据集。这是社区确定现有离线RL方法中缺点的常见起点，以及在该新兴领域中进步的协作途径。

The offline reinforcement learning (RL) setting (also known as full batch RL), where a policy is learned from a static dataset, is compelling as progress enables RL methods to take advantage of large, previously-collected datasets, much like how the rise of large datasets has fueled results in supervised learning. However, existing online RL benchmarks are not tailored towards the offline setting and existing offline RL benchmarks are restricted to data generated by partially-trained agents, making progress in offline RL difficult to measure. In this work, we introduce benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL. With a focus on dataset collection, examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multitask datasets where an agent performs different tasks in the same environment, and datasets collected with mixtures of policies. By moving beyond simple benchmark tasks and data collected by partially-trained RL agents, we reveal important and unappreciated deficiencies of existing algorithms. To facilitate research, we have released our benchmark tasks and datasets with a comprehensive evaluation of existing algorithms, an evaluation protocol, and open-source examples. This serves as a common starting point for the community to identify shortcomings in existing offline RL methods and a collaborative route for progress in this emerging area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题