通过知识感知增强视觉语言导航的场景理解能力

IF 4.6 2区 计算机科学 Q2 ROBOTICS IEEE Robotics and Automation Letters Pub Date : 2024-10-17 DOI:10.1109/LRA.2024.3483042
Fang Gao;Jingfeng Tang;Jiabao Wang;Shaodong Li;Jun Yu
{"title":"通过知识感知增强视觉语言导航的场景理解能力","authors":"Fang Gao;Jingfeng Tang;Jiabao Wang;Shaodong Li;Jun Yu","doi":"10.1109/LRA.2024.3483042","DOIUrl":null,"url":null,"abstract":"Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"9 12","pages":"10874-10881"},"PeriodicalIF":4.6000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness\",\"authors\":\"Fang Gao;Jingfeng Tang;Jiabao Wang;Shaodong Li;Jun Yu\",\"doi\":\"10.1109/LRA.2024.3483042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.\",\"PeriodicalId\":13241,\"journal\":{\"name\":\"IEEE Robotics and Automation Letters\",\"volume\":\"9 12\",\"pages\":\"10874-10881\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Robotics and Automation Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10720886/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10720886/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0

摘要

视觉语言导航(VLN)因其在现实世界中的潜在应用而受到广泛关注和研究兴趣。尽管近年来视觉语言导航领域取得了重大进展,但其局限性依然存在。在导航过程中,面对类似的候选视图,许多代理都很难做出准确的决策,只能依赖于这些视图的整体特征。这一挑战主要源于缺乏有关房间布局的常识性知识。认识到房间知识可以建立房间与环境中物体之间的关系,我们利用 BLIP-2 构建了用自然语言描述的房间布局知识,包括房间与单个物体之间的关系、物体之间的关系、单个物体的属性(如颜色)以及房间类型,从而为代理提供全面的房间布局信息。我们提出了一个知识增强场景理解(KESU)模型,通过利用房间布局知识来增强代理对环境的理解。KESU 中的指令增强模块(IA)和知识历史融合模块(KHF)分别为指令和视觉历史特征提供房间布局知识,从而增强代理的导航能力。为了更有效地整合知识信息与指令特征,我们在 IA 模块中引入了动态残留融合(DRF)。最后,我们在 R2R、REVERIE 和 SOON 数据集上进行了大量实验,证明了所提方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness
Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Robotics and Automation Letters
IEEE Robotics and Automation Letters Computer Science-Computer Science Applications
CiteScore
9.60
自引率
15.40%
发文量
1428
期刊介绍: The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.
期刊最新文献
Correction To: “Design Models and Performance Analysis for a Novel Shape Memory Alloy-Actuated Wearable Hand Exoskeleton for Rehabilitation” NavTr: Object-Goal Navigation With Learnable Transformer Queries A Diffusion-Based Data Generator for Training Object Recognition Models in Ultra-Range Distance Position Prediction for Space Teleoperation With SAO-CNN-BiGRU-Attention Algorithm MR-ULINS: A Tightly-Coupled UWB-LiDAR-Inertial Estimator With Multi-Epoch Outlier Rejection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1