永不停止学习：在机器人加强学习中进行微调的有效性

论文标题

永不停止学习：在机器人加强学习中进行微调的有效性

Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning

论文作者

Julian, Ryan, Swanson, Benjamin, Sukhatme, Gaurav S., Levine, Sergey, Finn, Chelsea, Hausman, Karol

论文摘要

机器人学习系统的巨大承诺之一是，他们将能够从错误中学习并不断适应不断变化的环境。尽管有这种潜力，但当今的大多数机器人学习系统都被部署为固定政策，并且在部署后没有改编它们。我们可以有效地适应现实世界中的新环境，对象和感知的先前学习的行为吗？在本文中，我们为机器人学习框架提供了一种方法和经验证据，该框架有助于持续适应。特别是，我们演示了如何通过通过非政策强化学习进行微调来调整基于视觉的机器人操纵策略，包括背景，物体形状和外观，照明条件和机器人形态的变化。此外，这种适应性使用了从头开始学习任务所需的数据的0.2％。我们发现，调整预训练的政策的方法会在整个微调过程中带来可观的性能增长，并且通过RL进行预训练至关重要：从头开始训练或从监督的Imavering Imagenet功能中进行培训都没有成功。我们还发现，这些积极的结果在有限的持续学习环境中成立，在这种设置中，我们使用一系列新任务中的数据反复调整单个策略谱系。我们的经验结论一致地受到了模拟操纵任务的实验，以及对580,000个Grasps预先训练的实际机器人握把系统上的52个独特的微调实验。

One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today are deployed as a fixed policy and they are not being adapted after their deployment. Can we efficiently adapt previously learned behaviors to new environments, objects and percepts in the real world? In this paper, we present a method and empirical evidence towards a robot learning framework that facilitates continuous adaption. In particular, we demonstrate how to adapt vision-based robotic manipulation policies to new variations by fine-tuning via off-policy reinforcement learning, including changes in background, object shape and appearance, lighting conditions, and robot morphology. Further, this adaptation uses less than 0.2% of the data necessary to learn the task from scratch. We find that our approach of adapting pre-trained policies leads to substantial performance gains over the course of fine-tuning, and that pre-training via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. We also find that these positive results hold in a limited continual learning setting, in which we repeatedly fine-tune a single lineage of policies using data from a succession of new tasks. Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 52 unique fine-tuning experiments on a real robotic grasping system pre-trained on 580,000 grasps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题