Raptor-T:用于长序列和变长序列的融合且内存效率高的稀疏变换器

IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computers Pub Date : 2024-04-16 DOI:10.1109/TC.2024.3389507
Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng
{"title":"Raptor-T:用于长序列和变长序列的融合且内存效率高的稀疏变换器","authors":"Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TC.2024.3389507","DOIUrl":null,"url":null,"abstract":"Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the \n<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\n complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1852-1865"},"PeriodicalIF":3.6000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences\",\"authors\":\"Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng\",\"doi\":\"10.1109/TC.2024.3389507\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the \\n<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\\n complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.\",\"PeriodicalId\":13087,\"journal\":{\"name\":\"IEEE Transactions on Computers\",\"volume\":\"73 7\",\"pages\":\"1852-1865\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computers\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10500743/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10500743/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

基于变换器的模型在各个领域都取得了重大进展,这主要归功于自我注意机制捕捉输入序列中上下文关系的能力。然而,对于变换器模型来说,处理长序列的计算成本仍然很高,这主要是由于与自我注意相关的 $O(n^{2})$ 复杂性。为了解决这个问题,有人提出了稀疏注意力,将二次依赖关系降低为线性关系。然而,有效部署稀疏变换器遇到了两大障碍:1)由于算法的近似特性导致注意力分散,现有的系统优化对稀疏变换器的效果较差;2)输入序列的可变性导致计算和内存访问效率低下。我们提出了 Raptor-T,一个专为处理长序列和变长序列而设计的尖端变换器框架。Raptor-T 利用稀疏变换器的强大功能,降低了处理长序列的资源需求,同时还实现了系统级优化,加快了推理性能。为了解决注意力分散的问题,Raptor-T 采用了融合且内存效率高的多头注意力(Multi-Head Attention)。此外,我们还引入了一种异步数据处理方法,以减轻稀疏注意力造成的 GPU 阻塞操作。此外,Raptor-T 还最大限度地减少了变长输入的填充,有效降低了与填充相关的开销,实现了 GPU 上的均衡计算。在评估中,我们在英伟达 A100 GPU 上比较了 Raptor-T 与最先进框架的性能。实验结果表明,Raptor-T 的性能优于 FlashAttention-2 和 FasterTransformer,端到端平均性能分别提高了 3.41 倍和 3.71 倍,令人印象深刻。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences
Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the $O(n^{2})$ complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
期刊最新文献
CUSPX: Efficient GPU Implementations of Post-Quantum Signature SPHINCS+ Chiplet-Gym: Optimizing Chiplet-based AI Accelerator Design with Reinforcement Learning FLALM: A Flexible Low Area-Latency Montgomery Modular Multiplication on FPGA Novel Lagrange Multipliers-Driven Adaptive Offloading for Vehicular Edge Computing Leveraging GPU in Homomorphic Encryption: Framework Design and Analysis of BFV Variants
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1