Exploring refined dual visual features cross-combination for image captioning

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2024-09-07 DOI:10.1016/j.neunet.2024.106710

Junbo Hu , Zhixin Li , Qiang Su , Zhenjun Tang , Huifang Ma

{"title":"Exploring refined dual visual features cross-combination for image captioning","authors":"Junbo Hu , Zhixin Li , Qiang Su , Zhenjun Tang , Huifang Ma","doi":"10.1016/j.neunet.2024.106710","DOIUrl":null,"url":null,"abstract":"<div><p>For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.</p></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"180 ","pages":"Article 106710"},"PeriodicalIF":6.3000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024006348","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索图像字幕的精制双视觉特征交叉组合

对于当前用于编码区域特征和网格特征的图像标题任务而言，基于变换器的编码器已成为常用的编码器，因为其多头自注意机制，编码器可以更好地捕捉图像中不同区域之间的关系和上下文信息。然而，堆叠变换器块需要通过视觉特征的自注意进行二次计算，不仅导致计算大量冗余特征，还大大增加了计算开销。本文提出了一种新颖的蒸馏交叉组合变换器（DCCT）网络。在技术上，我们首先引入了蒸馏级联融合编码器（DCFE），利用概率稀疏自注意力层过滤掉一些影响注意力集中的冗余和干扰特征，从而获得更精细的视觉特征并提高编码效率。接下来，我们开发了并行交叉融合注意模块（PCFA），充分利用网格和区域特征之间的互补性和相关性，更好地融合编码的双重视觉特征。在 MSCOCO 数据集上进行的大量实验表明，我们提出的 DCCT 方法性能卓越，可与目前最先进的方法相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.