LLAFN-Generator：用于大规模图像标题的可学习线性注意与快速规范化

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-11-01 Epub Date: 2024-07-18 DOI:10.1016/j.cviu.2024.104088

Xiaobao Yang , Xi Tian , Junsheng Wu , Xiaochun Yang , Sugang Ma , Xinman Qi , Zhiqiang Hou

{"title":"LLAFN-Generator：用于大规模图像标题的可学习线性注意与快速规范化","authors":"Xiaobao Yang , Xi Tian , Junsheng Wu , Xiaochun Yang , Sugang Ma , Xinman Qi , Zhiqiang Hou","doi":"10.1016/j.cviu.2024.104088","DOIUrl":null,"url":null,"abstract":"<div><p>Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task. Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator). Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity. Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable Linear-Attention instead of the original Softmax function to improve the computational speed. Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model. Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1.2% and 3.6%, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104088"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLAFN-Generator: Learnable linear-attention with fast-normalization for large-scale image captioning\",\"authors\":\"Xiaobao Yang , Xi Tian , Junsheng Wu , Xiaochun Yang , Sugang Ma , Xinman Qi , Zhiqiang Hou\",\"doi\":\"10.1016/j.cviu.2024.104088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task. Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator). Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity. Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable Linear-Attention instead of the original Softmax function to improve the computational speed. Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model. Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1.2% and 3.6%, respectively.</p></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"248 \",\"pages\":\"Article 104088\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314224001693\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/7/18 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001693","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，虽然变形器在计算机视觉领域得到了广泛应用，但其自注意力的二次方复杂性阻碍了大规模图像标题任务的处理。因此，我们在本文中提出了一种用于大规模图像标题的可学习线性自注意快速归一化方法（简称 LLAFN-Generator）。首先，它引入了可学习线性注意力（LLA）模块来解决大规模图像的权重分数学习问题，该模块通过两个线性层简单实现，大大降低了计算复杂度。同时，在可学习线性注意力中采用了快速归一化（FN）方法，取代了原来的 Softmax 函数，从而提高了计算速度。此外，还使用了特征增强模块来补偿浅层、细粒度信息，以增强模型的特征表示。最后，在 MS COCO 数据集上进行的大量实验表明，在相同规模的模型上，计算复杂度降低了 30%，参数降低了 20%，性能指标 BLEU_1 和 CIDEr 分别提高了 1.2% 和 3.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LLAFN-Generator: Learnable linear-attention with fast-normalization for large-scale image captioning

Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task. Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator). Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity. Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable Linear-Attention instead of the original Softmax function to improve the computational speed. Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model. Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1.2% and 3.6%, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems