{"title":"The Effect of Asymmetric Transistor Aging on Systolic Arrays for Mission Critical Machine Learning Applications","authors":"Firas Ramadan;Gil Shomron;Freddy Gabbay","doi":"10.1109/ACCESS.2025.3548966","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) excel in various applications, such as computer vision, natural language processing, and other mission-critical systems. As the computational complexity of these models grows, there is an increasing need for specialized accelerators to handle the demanding workloads. In response, advancements in Very Large Scale Integration (VLSI) process nodes have significantly intensified the development of machine learning (ML) accelerators, offering enhanced transistor miniaturization and power efficiency. However, the susceptibility of these advanced nodes to transistor aging poses risks to ML accelerator performance, prediction accuracy, and reliability, which can impact the functional safety of mission-critical systems. This study focuses on the impact of asymmetric transistor aging, induced by Bias Temperature Instability (BTI), on systolic arrays (SAs), which are integral to many ML accelerators in mission-critical systems. Our aging-aware analysis indicates that SAs experience asymmetric aging, causing logical elements to age at varying rates. In addition, our simulations show that asymmetric transistor aging introduces persistent and transient faults in the SA’s datapath, compromising the overall resiliency of the ML model. Our simulation results show that even with less than 1% of transient failure events, the top-1 prediction accuracy of ResNet-18 ML model drops significantly by 32–50% and with approximately 0.8% of transient failure events PTQ4ViT drops by almost 90%. To address this issue, we propose new hardware mechanisms and design flow solutions that can successfully mitigate the impact of asymmetric transistor aging on ML accelerator reliability with minimal power and area overhead.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"44041-44061"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10915624","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10915624/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Deep neural networks (DNNs) excel in various applications, such as computer vision, natural language processing, and other mission-critical systems. As the computational complexity of these models grows, there is an increasing need for specialized accelerators to handle the demanding workloads. In response, advancements in Very Large Scale Integration (VLSI) process nodes have significantly intensified the development of machine learning (ML) accelerators, offering enhanced transistor miniaturization and power efficiency. However, the susceptibility of these advanced nodes to transistor aging poses risks to ML accelerator performance, prediction accuracy, and reliability, which can impact the functional safety of mission-critical systems. This study focuses on the impact of asymmetric transistor aging, induced by Bias Temperature Instability (BTI), on systolic arrays (SAs), which are integral to many ML accelerators in mission-critical systems. Our aging-aware analysis indicates that SAs experience asymmetric aging, causing logical elements to age at varying rates. In addition, our simulations show that asymmetric transistor aging introduces persistent and transient faults in the SA’s datapath, compromising the overall resiliency of the ML model. Our simulation results show that even with less than 1% of transient failure events, the top-1 prediction accuracy of ResNet-18 ML model drops significantly by 32–50% and with approximately 0.8% of transient failure events PTQ4ViT drops by almost 90%. To address this issue, we propose new hardware mechanisms and design flow solutions that can successfully mitigate the impact of asymmetric transistor aging on ML accelerator reliability with minimal power and area overhead.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.