通过跨注意力上下文转换器与模态协作学习实现图像-文本多模态分类

IF 1 4区 计算机科学 Q4 ENGINEERING, ELECTRICAL & ELECTRONIC Journal of Electronic Imaging Pub Date : 2024-08-01 DOI:10.1117/1.jei.33.4.043042
Qianyao Shi, Wanru Xu, Zhenjiang Miao
{"title":"通过跨注意力上下文转换器与模态协作学习实现图像-文本多模态分类","authors":"Qianyao Shi, Wanru Xu, Zhenjiang Miao","doi":"10.1117/1.jei.33.4.043042","DOIUrl":null,"url":null,"abstract":"Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model’s discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"2 4 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning\",\"authors\":\"Qianyao Shi, Wanru Xu, Zhenjiang Miao\",\"doi\":\"10.1117/1.jei.33.4.043042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model’s discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/.\",\"PeriodicalId\":54843,\"journal\":{\"name\":\"Journal of Electronic Imaging\",\"volume\":\"2 4 1\",\"pages\":\"\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Electronic Imaging\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1117/1.jei.33.4.043042\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Imaging","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1117/1.jei.33.4.043042","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

如今,我们身边充斥着来自文本、图像、音频和视频等不同模式的各类数据。这些多模态数据的存在为我们提供了丰富的信息,但同时也带来了新的挑战:如何有效利用这些数据进行准确分类?这是多模态分类任务面临的主要问题。多模态分类是一项重要任务,旨在对来自不同模态的数据进行分类。然而,由于来自不同模态的数据具有不同的特征和结构,有效融合和利用这些数据进行分类是一个具有挑战性的问题。为了解决这个问题,我们提出了一种用于多模态分类的具有模态协作学习功能的跨注意力上下文转换器(CACT-MCL-MMC),以更好地整合来自不同模态的信息。一方面,现有的多模态融合方法忽视了模态内和模态间的关系,模态中存在未被关注的信息,导致分类效果不理想。针对现有算法中模态信息交互不足的问题,我们采用交叉注意上下文转换器来捕捉模态内和模态间的上下文关系,以提高模型的代表性。另一方面,由于不同模态之间的信息质量存在差异,某些模态的信息可能具有误导性或模糊性。对每种模态一视同仁可能会导致模态感知噪声,从而降低多模态分类的性能。因此,我们利用模态协同过滤误导信息,缓解模态间信息质量的差异,将模态信息与高质量、有效的模态信息相匹配,增强单模态信息,获得更理想的多模态融合信息,从而提高模型的判别能力。我们在 CrisisMMD 和 UPMC Food-101 这两个图像-文本分类基准数据集上的对比实验结果表明,我们提出的模型优于其他分类方法,甚至优于最先进的(SOTA)多模态分类方法。同时,交叉注意力模块、多模态情境注意力网络和模态协作学习的有效性也通过消融实验得到了验证。此外,超参数验证实验表明,不同的融合计算方法会导致实验结果的差异。我们找到了最有效的特征张量计算方法。我们还进行了定性实验。与原始模型相比,我们提出的模型在绝大多数情况下都能识别出预期结果。代码见 https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC。CrisisMMD 可在 https://dataverse.mpisws.org/dataverse/icwsm18 上查阅,UPMC-Food-101 可在 https://visiir.isir.upmc.fr/ 上查阅。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning
Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model’s discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Electronic Imaging
Journal of Electronic Imaging 工程技术-成像科学与照相技术
CiteScore
1.70
自引率
27.30%
发文量
341
审稿时长
4.0 months
期刊介绍: The Journal of Electronic Imaging publishes peer-reviewed papers in all technology areas that make up the field of electronic imaging and are normally considered in the design, engineering, and applications of electronic imaging systems.
期刊最新文献
DTSIDNet: a discrete wavelet and transformer based network for single image denoising Multi-head attention with reinforcement learning for supervised video summarization End-to-end multitasking network for smart container product positioning and segmentation Generative object separation in X-ray images Toward effective local dimming-driven liquid crystal displays: a deep curve estimation–based adaptive compensation solution
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1