CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognition

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-10-28 DOI:10.1016/j.neucom.2024.128792

Haitao Liu , Xianwei Xin , Jihua Song , Weiming Peng

{"title":"CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognition","authors":"Haitao Liu , Xianwei Xin , Jihua Song , Weiming Peng","doi":"10.1016/j.neucom.2024.128792","DOIUrl":null,"url":null,"abstract":"<div><div>The multimodal named entity recognition task on social media involves recognizing named entities with textual and visual information, which is of great significance for information processing. Nevertheless, many existing models still face the following challenges. First, in the process of cross-modal interaction, the attention mechanism sometimes focuses on trivial parts in the images that are not relevant to entities, which not only neglects valuable information but also inevitably introduces visual noise. Second, the gate mechanism is widely used for filtering out visual information to reduce the influence of noise on text understanding. However, the gate mechanism neglects capturing fine-grained semantic relevance between modalities, which easily affects the filtration process. To address these issues, we propose a cross-modal integration framework based on the surprisingly popular algorithm, aiming at enhancing the integration of effective visual guidance and reducing the interference of irrelevant visual noise. Specifically, we design a dual-branch interaction module that includes the attention mechanism and the surprisingly popular algorithm, allowing the model to focus on valuable but overlooked parts in the images. Furthermore, we compute the matching degree between modalities at the multi-granularity level, using the Choquet integral to establish a more reasonable basis for filtering out visual noise. We have conducted extensive experiments on public datasets, and the experimental results demonstrate the advantages of our model.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128792"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015637","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The multimodal named entity recognition task on social media involves recognizing named entities with textual and visual information, which is of great significance for information processing. Nevertheless, many existing models still face the following challenges. First, in the process of cross-modal interaction, the attention mechanism sometimes focuses on trivial parts in the images that are not relevant to entities, which not only neglects valuable information but also inevitably introduces visual noise. Second, the gate mechanism is widely used for filtering out visual information to reduce the influence of noise on text understanding. However, the gate mechanism neglects capturing fine-grained semantic relevance between modalities, which easily affects the filtration process. To address these issues, we propose a cross-modal integration framework based on the surprisingly popular algorithm, aiming at enhancing the integration of effective visual guidance and reducing the interference of irrelevant visual noise. Specifically, we design a dual-branch interaction module that includes the attention mechanism and the surprisingly popular algorithm, allowing the model to focus on valuable but overlooked parts in the images. Furthermore, we compute the matching degree between modalities at the multi-granularity level, using the Choquet integral to establish a more reasonable basis for filtering out visual noise. We have conducted extensive experiments on public datasets, and the experimental results demonstrate the advantages of our model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CRISP：基于多模态命名实体识别惊人流行算法的跨模态整合框架

社交媒体上的多模态命名实体识别任务涉及识别带有文本和视觉信息的命名实体，这对信息处理具有重要意义。然而，许多现有模型仍面临以下挑战。首先，在跨模态交互过程中，注意力机制有时会关注图像中与实体无关的琐碎部分，这不仅会忽略有价值的信息，还不可避免地会引入视觉噪声。其次，门机制被广泛用于过滤视觉信息，以减少噪声对文本理解的影响。然而，门机制忽略了捕捉模态之间细粒度的语义相关性，这很容易影响过滤过程。为了解决这些问题，我们提出了一种基于惊人算法的跨模态整合框架，旨在加强有效视觉引导的整合，减少无关视觉噪声的干扰。具体来说，我们设计了一个双分支交互模块，其中包括注意力机制和令人惊讶的流行算法，使模型能够关注图像中有价值但被忽视的部分。此外，我们在多粒度水平上计算模态之间的匹配度，利用乔奎特积分为过滤视觉噪声建立更合理的基础。我们在公共数据集上进行了大量实验，实验结果证明了我们模型的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.