Enhancing multimodal-input object goal navigation by leveraging large language models for inferring room–object relationship knowledge

IF 8 1区工程技术 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Advanced Engineering Informatics Pub Date : 2025-01-28 DOI:10.1016/j.aei.2025.103135

Leyuan Sun , Asako Kanezaki , Guillaume Caron , Yusuke Yoshiyasu

{"title":"Enhancing multimodal-input object goal navigation by leveraging large language models for inferring room–object relationship knowledge","authors":"Leyuan Sun , Asako Kanezaki , Guillaume Caron , Yusuke Yoshiyasu","doi":"10.1016/j.aei.2025.103135","DOIUrl":null,"url":null,"abstract":"<div><div>Object-goal navigation is a task in embodied AI where an agent is navigated to a specified object within unfamiliar indoor scenarios. This task is crucial for engineering activities such as training agents in 3D simulated environments and deploying these models in actual mobile robots. Extensive research has been conducted to develop various navigation methods, including end-to-end reinforcement learning and modular map-based approaches. However, fully enabling an agent to perceive and understand the environment, and to navigate towards a target object as efficiently as humans, remains a considerable challenge. In this study, we introduce a data-driven and modular map-based approach, trained on a dataset incorporated with common-sense knowledge of object-to-room relationships extracted from a Large Language Model (LLM), aiming to enhance the efficiency of object-goal navigation. This approach enables the agent to seek the target object in rooms where it is commonly found (e.g., a bed in a bedroom, a couch in a living room), according to LLM-based common-sense knowledge. Additionally, we employ the multi-channel Swin-Unet architecture for multi-task learning, integrating multimodal sensory inputs to effectively extract meaningful features for spatial comprehension and navigation. Results from the Habitat simulator show that our framework surpasses the baseline by an average of 10.6% in the Success-weighted by Path Length (SPL) efficiency metric. Real-world demonstrations confirm that our method can effectively navigate multiple rooms in the object-goal navigation task. For further details and real-world demonstrations, please visit our project webpage (<span><span>https://sunleyuan.github.io/ObjectNav</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":50941,"journal":{"name":"Advanced Engineering Informatics","volume":"65 ","pages":"Article 103135"},"PeriodicalIF":8.0000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Engineering Informatics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S147403462500028X","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Object-goal navigation is a task in embodied AI where an agent is navigated to a specified object within unfamiliar indoor scenarios. This task is crucial for engineering activities such as training agents in 3D simulated environments and deploying these models in actual mobile robots. Extensive research has been conducted to develop various navigation methods, including end-to-end reinforcement learning and modular map-based approaches. However, fully enabling an agent to perceive and understand the environment, and to navigate towards a target object as efficiently as humans, remains a considerable challenge. In this study, we introduce a data-driven and modular map-based approach, trained on a dataset incorporated with common-sense knowledge of object-to-room relationships extracted from a Large Language Model (LLM), aiming to enhance the efficiency of object-goal navigation. This approach enables the agent to seek the target object in rooms where it is commonly found (e.g., a bed in a bedroom, a couch in a living room), according to LLM-based common-sense knowledge. Additionally, we employ the multi-channel Swin-Unet architecture for multi-task learning, integrating multimodal sensory inputs to effectively extract meaningful features for spatial comprehension and navigation. Results from the Habitat simulator show that our framework surpasses the baseline by an average of 10.6% in the Success-weighted by Path Length (SPL) efficiency metric. Real-world demonstrations confirm that our method can effectively navigate multiple rooms in the object-goal navigation task. For further details and real-world demonstrations, please visit our project webpage (https://sunleyuan.github.io/ObjectNav).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Advanced Engineering Informatics 工程技术-工程：综合

CiteScore

12.40

自引率

18.20%

发文量

292

审稿时长

45 days

期刊介绍： Advanced Engineering Informatics is an international Journal that solicits research papers with an emphasis on 'knowledge' and 'engineering applications'. The Journal seeks original papers that report progress in applying methods of engineering informatics. These papers should have engineering relevance and help provide a scientific base for more reliable, spontaneous, and creative engineering decision-making. Additionally, papers should demonstrate the science of supporting knowledge-intensive engineering tasks and validate the generality, power, and scalability of new methods through rigorous evaluation, preferably both qualitatively and quantitatively. Abstracting and indexing for Advanced Engineering Informatics include Science Citation Index Expanded, Scopus and INSPEC.