Kang Zhou , Jianping Wang , Weitao Xu , Linqi Song , Zaiqiao Ye , Chi Guo , Cong Li
{"title":"Learning and grounding visual multimodal adaptive graph for visual navigation","authors":"Kang Zhou , Jianping Wang , Weitao Xu , Linqi Song , Zaiqiao Ye , Chi Guo , Cong Li","doi":"10.1016/j.inffus.2025.103009","DOIUrl":null,"url":null,"abstract":"<div><div>Visual navigation requires the agent <em>reasonably perceives</em> the environment and <em>effectively navigates</em> to the given target. In this task, we present a Multimodal Adaptive Graph (MAG) for learning and grounding the visual clues based on the object relationships. MAG consists of key navigation elements: object relative position relationships, previous navigation actions, past training experience, and target objects. This enables the agent to accurately gather multimodal information and find the target faster. Technically, our framework performs continuous modeling of pre-trained vision–language grounding model to align the multimodal graph, text information with visual perception. For output, we introduce constraints on the graph’s value estimation (GVE) functions to supervise the agent predict optimal actions, which can help it escape from deadlocks. With the MAG, the agent can effectively perceive the environment and get optimal actions. We train our framework with human demonstration and collision signals. Results demonstrate that our approach improves by 10.1% in SPL (Success weighted by Path Length) and 25.4% in success rate relative to the baseline method in the AI2THOR environment. Our code will be publicly released in the scientific community.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"118 ","pages":"Article 103009"},"PeriodicalIF":14.7000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S156625352500082X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Visual navigation requires the agent reasonably perceives the environment and effectively navigates to the given target. In this task, we present a Multimodal Adaptive Graph (MAG) for learning and grounding the visual clues based on the object relationships. MAG consists of key navigation elements: object relative position relationships, previous navigation actions, past training experience, and target objects. This enables the agent to accurately gather multimodal information and find the target faster. Technically, our framework performs continuous modeling of pre-trained vision–language grounding model to align the multimodal graph, text information with visual perception. For output, we introduce constraints on the graph’s value estimation (GVE) functions to supervise the agent predict optimal actions, which can help it escape from deadlocks. With the MAG, the agent can effectively perceive the environment and get optimal actions. We train our framework with human demonstration and collision signals. Results demonstrate that our approach improves by 10.1% in SPL (Success weighted by Path Length) and 25.4% in success rate relative to the baseline method in the AI2THOR environment. Our code will be publicly released in the scientific community.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.