内存变压器网络加速器的软硬件协同设计

IF 2.1 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC Frontiers in electronics Pub Date : 2022-04-11 DOI:10.3389/felec.2022.847069

Ann Franchesca Laguna, Mohammed Mehdi Sharifi, A. Kazemi, Xunzhao Yin, M. Niemier, Sharon Hu, Jae-sun Seo

{"title":"内存变压器网络加速器的软硬件协同设计","authors":"Ann Franchesca Laguna, Mohammed Mehdi Sharifi, A. Kazemi, Xunzhao Yin, M. Niemier, Sharon Hu, Jae-sun Seo","doi":"10.3389/felec.2022.847069","DOIUrl":null,"url":null,"abstract":"Transformer networks have outperformed recurrent and convolutional neural networks in terms of accuracy in various sequential tasks. However, memory and compute bottlenecks prevent transformer networks from scaling to long sequences due to their high execution time and energy consumption. Different neural attention mechanisms have been proposed to lower computational load but still suffer from the memory bandwidth bottleneck. In-memory processing can help alleviate memory bottlenecks by reducing the transfer overhead between the memory and compute units, thus allowing transformer networks to scale to longer sequences. We propose an in-memory transformer network accelerator (iMTransformer) that uses a combination of crossbars and content-addressable memories to accelerate transformer networks. We accelerate transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, and (3) exploiting the available parallelism in the attention mechanism computation. To reduce energy consumption, the following techniques are introduced: (1) a configurable attention selector is used to choose different sparse attention patterns, (2) a content-addressable memory aided locality sensitive hashing helps to filter the number of sequence elements by their importance, and (3) FeFET-based crossbars are used to store projection weights while CMOS-based crossbars are used as an attentional cache to store attention scores for later reuse. Using a CMOS-FeFET hybrid iMTransformer introduced a significant energy improvement compared to the CMOS-only iMTransformer. The CMOS-FeFET hybrid iMTransformer achieved an 8.96× delay improvement and 12.57× energy improvement for the Vanilla transformers compared to the GPU baseline at a sequence length of 512. Implementing BERT using CMOS-FeFET hybrid iMTransformer achieves 13.71× delay improvement and 8.95× delay improvement compared to the GPU baseline at sequence length of 512. The hybrid iMTransformer also achieves a throughput of 2.23 K samples/sec and 124.8 samples/s/W using the MLPerf benchmark using BERT-large and SQuAD 1.1 dataset, an 11× speedup and 7.92× energy improvement compared to the GPU baseline.","PeriodicalId":73081,"journal":{"name":"Frontiers in electronics","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Hardware-Software Co-Design of an In-Memory Transformer Network Accelerator\",\"authors\":\"Ann Franchesca Laguna, Mohammed Mehdi Sharifi, A. Kazemi, Xunzhao Yin, M. Niemier, Sharon Hu, Jae-sun Seo\",\"doi\":\"10.3389/felec.2022.847069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer networks have outperformed recurrent and convolutional neural networks in terms of accuracy in various sequential tasks. However, memory and compute bottlenecks prevent transformer networks from scaling to long sequences due to their high execution time and energy consumption. Different neural attention mechanisms have been proposed to lower computational load but still suffer from the memory bandwidth bottleneck. In-memory processing can help alleviate memory bottlenecks by reducing the transfer overhead between the memory and compute units, thus allowing transformer networks to scale to longer sequences. We propose an in-memory transformer network accelerator (iMTransformer) that uses a combination of crossbars and content-addressable memories to accelerate transformer networks. We accelerate transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, and (3) exploiting the available parallelism in the attention mechanism computation. To reduce energy consumption, the following techniques are introduced: (1) a configurable attention selector is used to choose different sparse attention patterns, (2) a content-addressable memory aided locality sensitive hashing helps to filter the number of sequence elements by their importance, and (3) FeFET-based crossbars are used to store projection weights while CMOS-based crossbars are used as an attentional cache to store attention scores for later reuse. Using a CMOS-FeFET hybrid iMTransformer introduced a significant energy improvement compared to the CMOS-only iMTransformer. The CMOS-FeFET hybrid iMTransformer achieved an 8.96× delay improvement and 12.57× energy improvement for the Vanilla transformers compared to the GPU baseline at a sequence length of 512. Implementing BERT using CMOS-FeFET hybrid iMTransformer achieves 13.71× delay improvement and 8.95× delay improvement compared to the GPU baseline at sequence length of 512. The hybrid iMTransformer also achieves a throughput of 2.23 K samples/sec and 124.8 samples/s/W using the MLPerf benchmark using BERT-large and SQuAD 1.1 dataset, an 11× speedup and 7.92× energy improvement compared to the GPU baseline.\",\"PeriodicalId\":73081,\"journal\":{\"name\":\"Frontiers in electronics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2022-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in electronics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/felec.2022.847069\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in electronics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/felec.2022.847069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 4

摘要

变压器网络在各种顺序任务的准确性方面优于循环神经网络和卷积神经网络。然而，由于高执行时间和高能耗，内存和计算瓶颈阻碍了变压器网络扩展到长序列。人们提出了不同的神经注意机制来降低计算量，但仍然受到内存带宽瓶颈的困扰。内存中处理可以通过减少内存和计算单元之间的传输开销来帮助缓解内存瓶颈，从而允许变压器网络扩展到更长的序列。我们提出了一个内存中的变压器网络加速器(iMTransformer)，它使用交叉栏和内容可寻址存储器的组合来加速变压器网络。我们通过以下方式加速变压器网络:(1)内存计算，从而最小化内存传输开销;(2)缓存可重用参数以减少操作次数;(3)利用注意力机制计算中的可用并行性。为了减少能量消耗，引入了以下技术:(1)使用可配置的注意力选择器来选择不同的稀疏注意力模式;(2)使用内容可寻址内存辅助的局部敏感哈希方法来根据序列元素的重要性过滤序列元素的数量;(3)使用基于fet的交叉条来存储投影权重，而使用基于cmos的交叉条作为注意力缓存来存储注意力分数以供以后重用。与仅使用cmos的iMTransformer相比，使用cmos - ffet混合iMTransformer可以显著改善能量。与GPU基线相比，CMOS-FeFET混合iMTransformer在序列长度为512时实现了8.96倍的延迟改进和12.57倍的能量改进。使用CMOS-FeFET混合iMTransformer实现BERT，在序列长度为512时，与GPU基线相比，延迟提高了13.71倍和8.95倍。混合iMTransformer还使用使用BERT-large和SQuAD 1.1数据集的MLPerf基准测试实现了2.23 K样本/秒和124.8样本/秒/W的吞吐量，与GPU基线相比，速度提高了11倍，能量提高了7.92倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Hardware-Software Co-Design of an In-Memory Transformer Network Accelerator

Transformer networks have outperformed recurrent and convolutional neural networks in terms of accuracy in various sequential tasks. However, memory and compute bottlenecks prevent transformer networks from scaling to long sequences due to their high execution time and energy consumption. Different neural attention mechanisms have been proposed to lower computational load but still suffer from the memory bandwidth bottleneck. In-memory processing can help alleviate memory bottlenecks by reducing the transfer overhead between the memory and compute units, thus allowing transformer networks to scale to longer sequences. We propose an in-memory transformer network accelerator (iMTransformer) that uses a combination of crossbars and content-addressable memories to accelerate transformer networks. We accelerate transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, and (3) exploiting the available parallelism in the attention mechanism computation. To reduce energy consumption, the following techniques are introduced: (1) a configurable attention selector is used to choose different sparse attention patterns, (2) a content-addressable memory aided locality sensitive hashing helps to filter the number of sequence elements by their importance, and (3) FeFET-based crossbars are used to store projection weights while CMOS-based crossbars are used as an attentional cache to store attention scores for later reuse. Using a CMOS-FeFET hybrid iMTransformer introduced a significant energy improvement compared to the CMOS-only iMTransformer. The CMOS-FeFET hybrid iMTransformer achieved an 8.96× delay improvement and 12.57× energy improvement for the Vanilla transformers compared to the GPU baseline at a sequence length of 512. Implementing BERT using CMOS-FeFET hybrid iMTransformer achieves 13.71× delay improvement and 8.95× delay improvement compared to the GPU baseline at sequence length of 512. The hybrid iMTransformer also achieves a throughput of 2.23 K samples/sec and 124.8 samples/s/W using the MLPerf benchmark using BERT-large and SQuAD 1.1 dataset, an 11× speedup and 7.92× energy improvement compared to the GPU baseline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in electronics

自引率

0.00%

发文量