{"title":"Cross-modal evidential fusion network for social media classification","authors":"Chen Yu, Zhiguo Wang","doi":"10.1016/j.csl.2025.101784","DOIUrl":null,"url":null,"abstract":"<div><div>Human activities on social networks can reflect the attitudes of the masses towards various events and are important for economic development and social progress. Many studies have focused on various multimodal tasks in social media, driven by the development of deep multimodal techniques. However, existing multimodal methods treat both reliable and unreliable modalities equally, which affects the efficiency of multimodal classification underlying social media. Therefore, a reliable method for multimodal fusion is required. This study presents a novel cross-modal evidential fusion network (CEFN) based on the subjective logic theory to incorporate uncertainty estimates into the multimodal fusion process. CEFN models uncertainty directly and learns more reliable representations by treating the outputs of encoders as subjective opinions. To reduce semantic uncertainty caused by random noise, momentum models are employed for each unimodal encoder. These unimodal encoders align the pseudo-views generated by the momentum models to mitigate the effects of noise. In addition, CEFN introduces a conflict loss function to facilitate representation learning from image-text pairs containing opposing views. This loss captures uncertainty from cross-modal conflicts to improve the feature extraction capability of each encoder. Experimental results on three real-world social media datasets show that CEFN outperforms related multimodal networks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101784"},"PeriodicalIF":3.1000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000099","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Human activities on social networks can reflect the attitudes of the masses towards various events and are important for economic development and social progress. Many studies have focused on various multimodal tasks in social media, driven by the development of deep multimodal techniques. However, existing multimodal methods treat both reliable and unreliable modalities equally, which affects the efficiency of multimodal classification underlying social media. Therefore, a reliable method for multimodal fusion is required. This study presents a novel cross-modal evidential fusion network (CEFN) based on the subjective logic theory to incorporate uncertainty estimates into the multimodal fusion process. CEFN models uncertainty directly and learns more reliable representations by treating the outputs of encoders as subjective opinions. To reduce semantic uncertainty caused by random noise, momentum models are employed for each unimodal encoder. These unimodal encoders align the pseudo-views generated by the momentum models to mitigate the effects of noise. In addition, CEFN introduces a conflict loss function to facilitate representation learning from image-text pairs containing opposing views. This loss captures uncertainty from cross-modal conflicts to improve the feature extraction capability of each encoder. Experimental results on three real-world social media datasets show that CEFN outperforms related multimodal networks.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.