{"title":"Adaptive Cross-Modal Experts Network with Uncertainty-Driven Fusion for Vision–Language Navigation","authors":"Jie Wu , Chunlei Wu , Xiuxuan Shen , Leiquan Wang","doi":"10.1016/j.knosys.2024.112735","DOIUrl":null,"url":null,"abstract":"<div><div>Vision-and-Language Navigation (VLN) enables an agent to autonomously navigate in real-world environments based on language instructions to reach specified destinations and accurately locate relevant targets. Although significant progress has been made in recent years, two major limitations remain: (1) Existing methods lack flexibility and diversity in processing multimodal information and cannot dynamically adjust to different input features. (2) Current fixed fusion strategies fail to dynamically adapt to varying data quality in open environments, insufficiently leveraging multi-scale features and handling complex nonlinear relationships. In this paper, an adaptive cross-modal experts network (ACME) with uncertainty-driven fusion is proposed to address these issues. The adaptive cross-modal experts module dynamically selects the most suitable expert network based on the input features, enhancing information processing diversity and flexibility. Additionally, the uncertainty-driven fusion module balances coarse-grained and fine-grained information by calculating their confidences and dynamically adjusting the fusion weights. Comprehensive experiments on the R2R, SOON, and REVERIE datasets demonstrate that our approach significantly outperforms existing VLN approaches.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"307 ","pages":"Article 112735"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013698","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-and-Language Navigation (VLN) enables an agent to autonomously navigate in real-world environments based on language instructions to reach specified destinations and accurately locate relevant targets. Although significant progress has been made in recent years, two major limitations remain: (1) Existing methods lack flexibility and diversity in processing multimodal information and cannot dynamically adjust to different input features. (2) Current fixed fusion strategies fail to dynamically adapt to varying data quality in open environments, insufficiently leveraging multi-scale features and handling complex nonlinear relationships. In this paper, an adaptive cross-modal experts network (ACME) with uncertainty-driven fusion is proposed to address these issues. The adaptive cross-modal experts module dynamically selects the most suitable expert network based on the input features, enhancing information processing diversity and flexibility. Additionally, the uncertainty-driven fusion module balances coarse-grained and fine-grained information by calculating their confidences and dynamically adjusting the fusion weights. Comprehensive experiments on the R2R, SOON, and REVERIE datasets demonstrate that our approach significantly outperforms existing VLN approaches.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.