To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

Ke Su;Xingxing Zhang;Siyang Zhang;Jun Zhu;Bo Zhang
{"title":"To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training","authors":"Ke Su;Xingxing Zhang;Siyang Zhang;Jun Zhu;Bo Zhang","doi":"10.1109/TIP.2024.3459800","DOIUrl":null,"url":null,"abstract":"Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5370-5381"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10684038/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过视觉语言预训练提升嵌入式推理的零点泛化能力
最近,人们对嵌入式人工智能(EAI)的研究兴趣日益浓厚,这涉及到代理在与周围三维环境进行动态交互时学习执行特定任务。其中,一个新的挑战是,由于三维场景中物体类别的增加,可能会出现许多未见物体。因此,有必要针对新物体开发具有强大零点泛化能力的模型。现有的工作试图通过提供与要学习的任务密切相关的大量高质量人类注释来实现这一目标,但在实践中成本太高。受二维视觉任务中预训练模型最新进展的启发,我们尝试通过视觉语言预训练来提高零点泛化的嵌入式推理能力,这种预训练可以将常识编码为一般先验知识。为了进一步提高其在特定任务上的性能,我们通过遮蔽场景图建模(MSGM)以自我监督的方式修正了预训练表示,其中特定任务的知识是从迭代信息传递中学来的。我们的方法可以大幅提高各种有代表性的具身推理任务的性能(例如,在 MP3D-EQA 数据集上,答案准确率超过 5.0%,该数据集由许多真实场景组成,测试过程中出现了大量新对象),并达到新的一流性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Rethinking Feature Reconstruction via Category Prototype in Semantic Segmentation Spiking Neural Networks With Adaptive Membrane Time Constant for Event-Based Tracking Self-Supervised Monocular Depth Estimation With Dual-Path Encoders and Offset Field Interpolation Hyperspectral Image Classification via Cascaded Spatial Cross-Attention Network A New Cross-Space Total Variation Regularization Model for Color Image Restoration With Quaternion Blur Operator
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1