论文标题
机器人对象检索具有上下文自然语言查询
Robot Object Retrieval with Contextual Natural Language Queries
论文作者
论文摘要
自然语言对象检索对于以人为本的环境中的机器人来说是一项非常有用但具有挑战性的任务。以前的工作主要集中在指定所需对象类型的命令上,例如“剪刀”和/或视觉属性,例如“红色”,因此将机器人限制在仅已知对象类中。我们开发了一个模型来根据其用法描述检索对象。该模型接收一个包含动词的语言命令,例如“给我一些要剪切的东西”,以及候选对象的RGB图像,然后选择最能满足动词指定的任务的对象。我们的模型直接从动词短语指定的对象使用中预测对象的外观。我们不需要明确指定对象的类标签。我们的方法使我们能够根据语言查询来预测对象效用之类的高级概念。基于语言命令中存在的上下文信息,我们的模型可以推广到命令中未看到的对象类和未知名词。我们的模型从五个候选人组中正确选择对象来满足自然语言命令,并在未看到的ImageNet对象类的持有测试集中达到62.3%的平均准确性,而在看不见的对象类和未知名词中的平均精度为53.0%。我们的模型还可以在看不见的YCB对象类中达到54.7%的平均准确性,这些YCB对象类别与成像网对象具有不同的图像分布。我们在Kuka LBR IIWA机器人部门上演示了我们的模型,使机器人能够根据其用法的自然语言描述来检索对象。我们还提供了一个新的数据集,该数据集由655个动词对对象对,表示50个动词和216个对象类的对象使用。
Natural language object retrieval is a highly useful yet challenging task for robots in human-centric environments. Previous work has primarily focused on commands specifying the desired object's type such as "scissors" and/or visual attributes such as "red," thus limiting the robot to only known object classes. We develop a model to retrieve objects based on descriptions of their usage. The model takes in a language command containing a verb, for example "Hand me something to cut," and RGB images of candidate objects and selects the object that best satisfies the task specified by the verb. Our model directly predicts an object's appearance from the object's use specified by a verb phrase. We do not need to explicitly specify an object's class label. Our approach allows us to predict high level concepts like an object's utility based on the language query. Based on contextual information present in the language commands, our model can generalize to unseen object classes and unknown nouns in the commands. Our model correctly selects objects out of sets of five candidates to fulfill natural language commands, and achieves an average accuracy of 62.3% on a held-out test set of unseen ImageNet object classes and 53.0% on unseen object classes and unknown nouns. Our model also achieves an average accuracy of 54.7% on unseen YCB object classes, which have a different image distribution from ImageNet objects. We demonstrate our model on a KUKA LBR iiwa robot arm, enabling the robot to retrieve objects based on natural language descriptions of their usage. We also present a new dataset of 655 verb-object pairs denoting object usage over 50 verbs and 216 object classes.