基于能量的模型是零镜头计划器，用于合成场景重排

Robotics: Science and Systems XIX Pub Date : 2023-04-27 DOI:10.15607/RSS.2023.XIX.030

N. Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, C. Atkeson, Katerina Fragkiadaki

{"title":"基于能量的模型是零镜头计划器，用于合成场景重排","authors":"N. Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, C. Atkeson, Katerina Fragkiadaki","doi":"10.15607/RSS.2023.XIX.030","DOIUrl":null,"url":null,"abstract":"Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.","PeriodicalId":248720,"journal":{"name":"Robotics: Science and Systems XIX","volume":"534 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement\",\"authors\":\"N. Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, C. Atkeson, Katerina Fragkiadaki\",\"doi\":\"10.15607/RSS.2023.XIX.030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.\",\"PeriodicalId\":248720,\"journal\":{\"name\":\"Robotics: Science and Systems XIX\",\"volume\":\"534 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Robotics: Science and Systems XIX\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15607/RSS.2023.XIX.030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics: Science and Systems XIX","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15607/RSS.2023.XIX.030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

语言是组成的;指令可以表达机器人任务重新排列的场景中对象之间的多个关系约束。我们在这项工作中的重点是一个可指导的场景重排框架，它可以推广到更长的指令和空间概念组合，这些组合在训练时从未见过。我们建议用相对物体排列上的能量函数来表示语言指示的空间概念。语言解析器将指令映射到相应的能量函数，开放词汇的视觉语言模型将它们的参数基于场景中的相关对象。我们通过能量函数和的梯度下降来生成目标场景配置，指令中的每个语言谓词一个。然后，基于局部视觉的策略将对象重新定位到推断的目标位置。我们在已建立的指令引导操作基准测试上测试我们的模型，以及我们引入的组合指令的基准测试。我们证明了我们的模型可以在模拟和现实世界中执行高度合成的指令零射击。它在很大程度上优于语言到行动的反应性策略和大型语言模型规划器，特别是对于涉及多个空间概念组合的长指令。仿真和真实世界的机器人执行视频，以及我们的代码和数据集都可以在我们的网站上公开获取:https://ebmplanner.github.io。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Robotics: Science and Systems XIX

自引率

0.00%

发文量