Learning and grounding visual multimodal adaptive graph for visual navigation

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2025-02-11 DOI:10.1016/j.inffus.2025.103009

Kang Zhou , Jianping Wang , Weitao Xu , Linqi Song , Zaiqiao Ye , Chi Guo , Cong Li

{"title":"Learning and grounding visual multimodal adaptive graph for visual navigation","authors":"Kang Zhou , Jianping Wang , Weitao Xu , Linqi Song , Zaiqiao Ye , Chi Guo , Cong Li","doi":"10.1016/j.inffus.2025.103009","DOIUrl":null,"url":null,"abstract":"<div><div>Visual navigation requires the agent <em>reasonably perceives</em> the environment and <em>effectively navigates</em> to the given target. In this task, we present a Multimodal Adaptive Graph (MAG) for learning and grounding the visual clues based on the object relationships. MAG consists of key navigation elements: object relative position relationships, previous navigation actions, past training experience, and target objects. This enables the agent to accurately gather multimodal information and find the target faster. Technically, our framework performs continuous modeling of pre-trained vision–language grounding model to align the multimodal graph, text information with visual perception. For output, we introduce constraints on the graph’s value estimation (GVE) functions to supervise the agent predict optimal actions, which can help it escape from deadlocks. With the MAG, the agent can effectively perceive the environment and get optimal actions. We train our framework with human demonstration and collision signals. Results demonstrate that our approach improves by 10.1% in SPL (Success weighted by Path Length) and 25.4% in success rate relative to the baseline method in the AI2THOR environment. Our code will be publicly released in the scientific community.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"118 ","pages":"Article 103009"},"PeriodicalIF":14.7000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S156625352500082X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Visual navigation requires the agent reasonably perceives the environment and effectively navigates to the given target. In this task, we present a Multimodal Adaptive Graph (MAG) for learning and grounding the visual clues based on the object relationships. MAG consists of key navigation elements: object relative position relationships, previous navigation actions, past training experience, and target objects. This enables the agent to accurately gather multimodal information and find the target faster. Technically, our framework performs continuous modeling of pre-trained vision–language grounding model to align the multimodal graph, text information with visual perception. For output, we introduce constraints on the graph’s value estimation (GVE) functions to supervise the agent predict optimal actions, which can help it escape from deadlocks. With the MAG, the agent can effectively perceive the environment and get optimal actions. We train our framework with human demonstration and collision signals. Results demonstrate that our approach improves by 10.1% in SPL (Success weighted by Path Length) and 25.4% in success rate relative to the baseline method in the AI2THOR environment. Our code will be publicly released in the scientific community.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.