{"title":"Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness","authors":"Fang Gao;Jingfeng Tang;Jiabao Wang;Shaodong Li;Jun Yu","doi":"10.1109/LRA.2024.3483042","DOIUrl":null,"url":null,"abstract":"Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"9 12","pages":"10874-10881"},"PeriodicalIF":4.6000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10720886/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.
期刊介绍:
The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.