{"title":"Multimodal dynamic fusion framework: Multilevel feature fusion guided by prompts","authors":"Lei Pan, Huan-Qing Wu","doi":"10.1111/exsy.13668","DOIUrl":null,"url":null,"abstract":"<p>With the progressive augmentation of parameters in multimodal models, to optimize computational efficiency, some studies have adopted the approach of fine-tuning the unimodal pre-training model to achieve multimodal fusion tasks. However, these methods tend to rely solely on simplistic or singular fusion strategies, thereby neglecting more flexible fusion approaches. Moreover, existing methods prioritize the integration of modality features containing highly semantic information, often overlooking the influence of fusing low-level features on the outcomes. Therefore, this study introduces an innovative approach named multilevel feature fusion guided by prompts (MFF-GP), a multimodal dynamic fusion framework. It guides the dynamic neural network by prompt vectors to dynamically select the suitable fusion network for each hierarchical feature of the unimodal pre-training model. This method improves the interactions between multiple modalities and promotes a more efficient fusion of features across them. Extensive experiments on the UPMC Food 101, SNLI-VE and MM-IMDB datasets demonstrate that with only a few trainable parameters, MFF-GP achieves significant accuracy improvements compared to a newly designed PMF based on fine-tuning—specifically, an accuracy improvement of 2.15% on the UPMC Food 101 dataset and 0.82% on the SNLI-VE dataset. Further study of the results reveals that increasing the diversity of interactions between distinct modalities is critical and delivers significant performance improvements. Furthermore, for certain multimodal tasks, focusing on the low-level features is beneficial for modality integration. Our implementation is available at: https://github.com/whq2024/MFF-GP.</p>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"41 11","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.13668","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
With the progressive augmentation of parameters in multimodal models, to optimize computational efficiency, some studies have adopted the approach of fine-tuning the unimodal pre-training model to achieve multimodal fusion tasks. However, these methods tend to rely solely on simplistic or singular fusion strategies, thereby neglecting more flexible fusion approaches. Moreover, existing methods prioritize the integration of modality features containing highly semantic information, often overlooking the influence of fusing low-level features on the outcomes. Therefore, this study introduces an innovative approach named multilevel feature fusion guided by prompts (MFF-GP), a multimodal dynamic fusion framework. It guides the dynamic neural network by prompt vectors to dynamically select the suitable fusion network for each hierarchical feature of the unimodal pre-training model. This method improves the interactions between multiple modalities and promotes a more efficient fusion of features across them. Extensive experiments on the UPMC Food 101, SNLI-VE and MM-IMDB datasets demonstrate that with only a few trainable parameters, MFF-GP achieves significant accuracy improvements compared to a newly designed PMF based on fine-tuning—specifically, an accuracy improvement of 2.15% on the UPMC Food 101 dataset and 0.82% on the SNLI-VE dataset. Further study of the results reveals that increasing the diversity of interactions between distinct modalities is critical and delivers significant performance improvements. Furthermore, for certain multimodal tasks, focusing on the low-level features is beneficial for modality integration. Our implementation is available at: https://github.com/whq2024/MFF-GP.
期刊介绍:
Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper.
As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.