A certain examination on heterogeneous systolic array (HSA) design for deep learning accelerations with low power computations

IF 5.7 3区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Sustainable Computing-Informatics & Systems Pub Date : 2024-12-01 Epub Date: 2024-10-11 DOI:10.1016/j.suscom.2024.101042

Dinesh Kumar Jayaraman Rajanediran , C. Ganesh Babu , K. Priyadharsini

{"title":"A certain examination on heterogeneous systolic array (HSA) design for deep learning accelerations with low power computations","authors":"Dinesh Kumar Jayaraman Rajanediran , C. Ganesh Babu , K. Priyadharsini","doi":"10.1016/j.suscom.2024.101042","DOIUrl":null,"url":null,"abstract":"<div><div>Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computational tasks and data processing in a rhythmic manner. Google's Tensor Processing Unit (TPU) leverages the power of SA for neural networks. The core SA's functionality and performance lies in the Computation Element (CE), which facilitates parallel data flow. In our article, we introduce a novel approach called Proposed Systolic Array (PSA), which is implemented on the CE and further enhanced with a modified Hybrid Kogge Stone adder (MHA). This design incorporates principles to expedite computations by rounding and extracting data model in SA as PSA-MHA. The PSA, utilizing a data flow model with MHA, significantly accelerates data shifts and control passes in execution cycles. We validated our approach through simulations on the Cadence Virtuoso platform using 65 nm process technology, comparing it to the General Matrix Multiplication (GMMN) benchmark. The results showed remarkable improvements in the CE, with a 30.29 % reduction in delay, a 23.07 % reduction in area, and an 11.87 % reduction in power consumption. The PSA outperformed these improvements, achieving a 46.38 % reduction in delay, a 7.58 % reduction in area, and an impressive 48.23 % decrease in Area Delay Product (ADP). To further substantiate our findings, we applied the PSA-based approach to pre-trained hybrid Convolutional and Recurrent (CNN-RNN) neural models. The PSA-based hybrid model incorporates 189 million Multiply-Accumulate (MAC) units, resulting in a weighted mean architecture value of 784.80 for the RNN component. We also explored variations in bit width, which led to delay reductions ranging from 20.17 % to 30.29 %, area variations between 13.08 % and 32.16 %, and power consumption changes spanning from 11.88 % to 20.42 %.</div></div>","PeriodicalId":48686,"journal":{"name":"Sustainable Computing-Informatics & Systems","volume":"44 ","pages":"Article 101042"},"PeriodicalIF":5.7000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sustainable Computing-Informatics & Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2210537924000878","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computational tasks and data processing in a rhythmic manner. Google's Tensor Processing Unit (TPU) leverages the power of SA for neural networks. The core SA's functionality and performance lies in the Computation Element (CE), which facilitates parallel data flow. In our article, we introduce a novel approach called Proposed Systolic Array (PSA), which is implemented on the CE and further enhanced with a modified Hybrid Kogge Stone adder (MHA). This design incorporates principles to expedite computations by rounding and extracting data model in SA as PSA-MHA. The PSA, utilizing a data flow model with MHA, significantly accelerates data shifts and control passes in execution cycles. We validated our approach through simulations on the Cadence Virtuoso platform using 65 nm process technology, comparing it to the General Matrix Multiplication (GMMN) benchmark. The results showed remarkable improvements in the CE, with a 30.29 % reduction in delay, a 23.07 % reduction in area, and an 11.87 % reduction in power consumption. The PSA outperformed these improvements, achieving a 46.38 % reduction in delay, a 7.58 % reduction in area, and an impressive 48.23 % decrease in Area Delay Product (ADP). To further substantiate our findings, we applied the PSA-based approach to pre-trained hybrid Convolutional and Recurrent (CNN-RNN) neural models. The PSA-based hybrid model incorporates 189 million Multiply-Accumulate (MAC) units, resulting in a weighted mean architecture value of 784.80 for the RNN component. We also explored variations in bit width, which led to delay reductions ranging from 20.17 % to 30.29 %, area variations between 13.08 % and 32.16 %, and power consumption changes spanning from 11.88 % to 20.42 %.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

针对低功耗计算深度学习加速的异构系统阵列（HSA）设计的若干研究

加速技术在提高现代高速计算性能方面发挥着至关重要的作用，尤其是在速度至关重要的深度学习（DL）应用中。在这种情况下，系统阵列（SA）就是一个重要的组成部分，它能以有节奏的方式有效处理计算任务和数据处理。谷歌的张量处理单元（TPU）就利用了SA在神经网络中的强大功能。SA的核心功能和性能在于计算元件（CE），它能促进并行数据流。在我们的文章中，我们介绍了一种名为 "拟议收缩阵列"（PSA）的新方法，它是在 CE 上实现的，并通过改进的混合 Kogge Stone 加法器（MHA）得到了进一步增强。这种设计包含了通过舍入和提取 SA 中的数据模型来加快计算速度的原则，即 PSA-MHA。PSA 利用 MHA 的数据流模型，大大加快了执行周期中的数据转移和控制传递。我们在采用 65 纳米工艺技术的 Cadence Virtuoso 平台上进行了仿真，并将其与通用矩阵乘法 (GMMN) 基准进行了比较，从而验证了我们的方法。结果表明，CE 有了明显改善，延迟减少了 30.29%，面积减少了 23.07%，功耗减少了 11.87%。PSA 的改进幅度超过了这些改进，延迟减少了 46.38%，面积减少了 7.58%，面积延迟积（ADP）减少了 48.23%，令人印象深刻。为了进一步证实我们的研究结果，我们将基于 PSA 的方法应用于预先训练好的混合卷积和递归（CNN-RNN）神经模型。基于 PSA 的混合模型包含 1.89 亿个乘积 (MAC) 单元，因此 RNN 部分的加权平均架构值为 784.80。我们还探索了位宽的变化，结果是延迟降低了 20.17% 到 30.29%，面积变化了 13.08% 到 32.16%，功耗变化了 11.88% 到 20.42%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Sustainable Computing-Informatics & Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTUREC-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

10.70

自引率

4.40%

发文量

142

期刊介绍： Sustainable computing is a rapidly expanding research area spanning the fields of computer science and engineering, electrical engineering as well as other engineering disciplines. The aim of Sustainable Computing: Informatics and Systems (SUSCOM) is to publish the myriad research findings related to energy-aware and thermal-aware management of computing resource. Equally important is a spectrum of related research issues such as applications of computing that can have ecological and societal impacts. SUSCOM publishes original and timely research papers and survey articles in current areas of power, energy, temperature, and environment related research areas of current importance to readers. SUSCOM has an editorial board comprising prominent researchers from around the world and selects competitively evaluated peer-reviewed papers.