CAST:使用代用标记进行自关注聚类,实现高效变压器

IF 3.9 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Letters Pub Date : 2024-09-06 DOI:10.1016/j.patrec.2024.08.024
Adjorn van Engelenhoven, Nicola Strisciuglio, Estefanía Talavera
{"title":"CAST:使用代用标记进行自关注聚类,实现高效变压器","authors":"Adjorn van Engelenhoven,&nbsp;Nicola Strisciuglio,&nbsp;Estefanía Talavera","doi":"10.1016/j.patrec.2024.08.024","DOIUrl":null,"url":null,"abstract":"<div><p>The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>α</mi><mi>N</mi><mo>)</mo></mrow></mrow></math></span> where <span><math><mi>N</mi></math></span> is the sequence length, and <span><math><mi>α</mi></math></span> is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 30-36"},"PeriodicalIF":3.9000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002563/pdfft?md5=41d75a76c8436c27473bdc1f0c0144be&pid=1-s2.0-S0167865524002563-main.pdf","citationCount":"0","resultStr":"{\"title\":\"CAST: Clustering self-Attention using Surrogate Tokens for efficient transformers\",\"authors\":\"Adjorn van Engelenhoven,&nbsp;Nicola Strisciuglio,&nbsp;Estefanía Talavera\",\"doi\":\"10.1016/j.patrec.2024.08.024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>α</mi><mi>N</mi><mo>)</mo></mrow></mrow></math></span> where <span><math><mi>N</mi></math></span> is the sequence length, and <span><math><mi>α</mi></math></span> is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.</p></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"186 \",\"pages\":\"Pages 30-36\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167865524002563/pdfft?md5=41d75a76c8436c27473bdc1f0c0144be&pid=1-s2.0-S0167865524002563-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865524002563\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865524002563","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

Transformer 架构已证明是执行各种任务的强大工具。它基于自我注意机制,而自我注意机制本身是一种计算昂贵的操作,其计算复杂度为二次方:内存使用量和计算时间随输入序列的长度呈二次方增长,从而限制了变形金刚的应用。在这项工作中,我们提出了一种使用代理标记的新型聚类自我注意力机制(CAST),以优化注意力计算并实现高效的变换器。CAST 利用可学习的代理标记来构建聚类亲和矩阵,用于对输入序列进行聚类并生成新的聚类摘要。然后,每个集群内的自我注意力与其他集群的集群摘要相结合,从而实现整个输入序列的信息流。CAST 提高了效率,将复杂度从 O(N2) 降低到 O(αN),其中 N 是序列长度,α 是根据簇数和每个簇的样本数确定的常数。我们的研究表明,CAST 在长程序列建模任务中的表现优于或媲美基线变换器,同时在时间和内存效率上也高于其他高效变换器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CAST: Clustering self-Attention using Surrogate Tokens for efficient transformers

The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from O(N2) to O(αN) where N is the sequence length, and α is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
期刊最新文献
Personalized Federated Learning on long-tailed data via knowledge distillation and generated features Adaptive feature alignment for adversarial training Discrete diffusion models with Refined Language-Image Pre-trained representations for remote sensing image captioning A unified framework to stereotyped behavior detection for screening Autism Spectrum Disorder Explainable hypergraphs for gait based Parkinson classification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1