Hardware-Software Co-Design of an In-Memory Transformer Network Accelerator

IF 1.9 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC Frontiers in electronics Pub Date : 2022-04-11 DOI:10.3389/felec.2022.847069
Ann Franchesca Laguna, Mohammed Mehdi Sharifi, A. Kazemi, Xunzhao Yin, M. Niemier, Sharon Hu, Jae-sun Seo
{"title":"Hardware-Software Co-Design of an In-Memory Transformer Network Accelerator","authors":"Ann Franchesca Laguna, Mohammed Mehdi Sharifi, A. Kazemi, Xunzhao Yin, M. Niemier, Sharon Hu, Jae-sun Seo","doi":"10.3389/felec.2022.847069","DOIUrl":null,"url":null,"abstract":"Transformer networks have outperformed recurrent and convolutional neural networks in terms of accuracy in various sequential tasks. However, memory and compute bottlenecks prevent transformer networks from scaling to long sequences due to their high execution time and energy consumption. Different neural attention mechanisms have been proposed to lower computational load but still suffer from the memory bandwidth bottleneck. In-memory processing can help alleviate memory bottlenecks by reducing the transfer overhead between the memory and compute units, thus allowing transformer networks to scale to longer sequences. We propose an in-memory transformer network accelerator (iMTransformer) that uses a combination of crossbars and content-addressable memories to accelerate transformer networks. We accelerate transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, and (3) exploiting the available parallelism in the attention mechanism computation. To reduce energy consumption, the following techniques are introduced: (1) a configurable attention selector is used to choose different sparse attention patterns, (2) a content-addressable memory aided locality sensitive hashing helps to filter the number of sequence elements by their importance, and (3) FeFET-based crossbars are used to store projection weights while CMOS-based crossbars are used as an attentional cache to store attention scores for later reuse. Using a CMOS-FeFET hybrid iMTransformer introduced a significant energy improvement compared to the CMOS-only iMTransformer. The CMOS-FeFET hybrid iMTransformer achieved an 8.96× delay improvement and 12.57× energy improvement for the Vanilla transformers compared to the GPU baseline at a sequence length of 512. Implementing BERT using CMOS-FeFET hybrid iMTransformer achieves 13.71× delay improvement and 8.95× delay improvement compared to the GPU baseline at sequence length of 512. The hybrid iMTransformer also achieves a throughput of 2.23 K samples/sec and 124.8 samples/s/W using the MLPerf benchmark using BERT-large and SQuAD 1.1 dataset, an 11× speedup and 7.92× energy improvement compared to the GPU baseline.","PeriodicalId":73081,"journal":{"name":"Frontiers in electronics","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in electronics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/felec.2022.847069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 4

Abstract

Transformer networks have outperformed recurrent and convolutional neural networks in terms of accuracy in various sequential tasks. However, memory and compute bottlenecks prevent transformer networks from scaling to long sequences due to their high execution time and energy consumption. Different neural attention mechanisms have been proposed to lower computational load but still suffer from the memory bandwidth bottleneck. In-memory processing can help alleviate memory bottlenecks by reducing the transfer overhead between the memory and compute units, thus allowing transformer networks to scale to longer sequences. We propose an in-memory transformer network accelerator (iMTransformer) that uses a combination of crossbars and content-addressable memories to accelerate transformer networks. We accelerate transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, and (3) exploiting the available parallelism in the attention mechanism computation. To reduce energy consumption, the following techniques are introduced: (1) a configurable attention selector is used to choose different sparse attention patterns, (2) a content-addressable memory aided locality sensitive hashing helps to filter the number of sequence elements by their importance, and (3) FeFET-based crossbars are used to store projection weights while CMOS-based crossbars are used as an attentional cache to store attention scores for later reuse. Using a CMOS-FeFET hybrid iMTransformer introduced a significant energy improvement compared to the CMOS-only iMTransformer. The CMOS-FeFET hybrid iMTransformer achieved an 8.96× delay improvement and 12.57× energy improvement for the Vanilla transformers compared to the GPU baseline at a sequence length of 512. Implementing BERT using CMOS-FeFET hybrid iMTransformer achieves 13.71× delay improvement and 8.95× delay improvement compared to the GPU baseline at sequence length of 512. The hybrid iMTransformer also achieves a throughput of 2.23 K samples/sec and 124.8 samples/s/W using the MLPerf benchmark using BERT-large and SQuAD 1.1 dataset, an 11× speedup and 7.92× energy improvement compared to the GPU baseline.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
内存变压器网络加速器的软硬件协同设计
变压器网络在各种顺序任务的准确性方面优于循环神经网络和卷积神经网络。然而,由于高执行时间和高能耗,内存和计算瓶颈阻碍了变压器网络扩展到长序列。人们提出了不同的神经注意机制来降低计算量,但仍然受到内存带宽瓶颈的困扰。内存中处理可以通过减少内存和计算单元之间的传输开销来帮助缓解内存瓶颈,从而允许变压器网络扩展到更长的序列。我们提出了一个内存中的变压器网络加速器(iMTransformer),它使用交叉栏和内容可寻址存储器的组合来加速变压器网络。我们通过以下方式加速变压器网络:(1)内存计算,从而最小化内存传输开销;(2)缓存可重用参数以减少操作次数;(3)利用注意力机制计算中的可用并行性。为了减少能量消耗,引入了以下技术:(1)使用可配置的注意力选择器来选择不同的稀疏注意力模式;(2)使用内容可寻址内存辅助的局部敏感哈希方法来根据序列元素的重要性过滤序列元素的数量;(3)使用基于fet的交叉条来存储投影权重,而使用基于cmos的交叉条作为注意力缓存来存储注意力分数以供以后重用。与仅使用cmos的iMTransformer相比,使用cmos - ffet混合iMTransformer可以显著改善能量。与GPU基线相比,CMOS-FeFET混合iMTransformer在序列长度为512时实现了8.96倍的延迟改进和12.57倍的能量改进。使用CMOS-FeFET混合iMTransformer实现BERT,在序列长度为512时,与GPU基线相比,延迟提高了13.71倍和8.95倍。混合iMTransformer还使用使用BERT-large和SQuAD 1.1数据集的MLPerf基准测试实现了2.23 K样本/秒和124.8样本/秒/W的吞吐量,与GPU基线相比,速度提高了11倍,能量提高了7.92倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A new e-health cloud-based system for cardiovascular risk assessment EMI challenges in modern power electronic-based converters: recent advances and mitigation techniques Two-dimensional semiconductors based field-effect transistors: review of major milestones and challenges Measurement and analysis of the electromagnetic environment in 500 kV back-to-back converter stations Editorial: Re-electrification technology and application of the energy consumption terminal
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1