首页 > 最新文献

IEEE Journal on Emerging and Selected Topics in Circuits and Systems最新文献

英文 中文
Modeling the Effect of SEUs on the Configuration Memory of SRAM-FPGA-Based CNN Accelerators 模拟 SEU 对基于 SRAM-FPGA 的 CNN 加速器配置存储器的影响
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-09-16 DOI: 10.1109/JETCAS.2024.3460792
Zhen Gao;Jiaqi Feng;Shihui Gao;Qiang Liu;Guangjun Ge;Yu Wang;Pedro Reviriego
Convolutional Neural Networks (CNNs) are widely used in computer vision applications. SRAM based Field Programmable Gate Arrays (SRAM-FPGAs) are popular for the acceleration of CNNs. Since SRAM-FPGAs are prone to soft errors, the reliability evaluation and efficient fault tolerance design become very important for the use of FPGA-based CNNs in safety critical scenarios. Hardware based fault injection is an effective approach for the reliability evaluation, and the results can provide valuable references for the fault tolerance design. However, the complexity of building a fault injection platform poses a big obstacle for researchers working on the fault tolerance design. To remove this obstacle, this paper first performs a complete reliability evaluation for errors on the configuration memory of the FPGA based CNN accelerators, and then studies the impact of errors on the output feature maps of each layer. Based on the statistical analysis, we propose several fault models for the effect of SEUs on the configuration memory of the FPGA based CNN accelerators, and build a software simulator based on the fault models. Experiments show that the evaluation results based on the software simulator are very close to those from the hardware fault injections. Therefore, the proposed fault models and simulator can facilitate the fault tolerance design and reliability evaluation of CNN accelerators.
卷积神经网络(cnn)在计算机视觉领域有着广泛的应用。基于SRAM的现场可编程门阵列(SRAM- fpga)被广泛用于cnn的加速。由于sram - fpga容易出现软错误,因此可靠性评估和高效容错设计对于在安全关键场景下使用基于fpga的cnn变得非常重要。基于硬件的故障注入是一种有效的可靠性评估方法,其结果可为容错设计提供有价值的参考。然而,构建故障注入平台的复杂性给容错设计带来了很大的障碍。为了消除这一障碍,本文首先对基于FPGA的CNN加速器配置存储器的错误进行了完整的可靠性评估,然后研究了错误对各层输出特征映射的影响。在统计分析的基础上,提出了几种seu对FPGA CNN加速器组态内存影响的故障模型,并基于这些故障模型构建了软件仿真器。实验表明,基于软件模拟器的评估结果与硬件故障注入的评估结果非常接近。因此,所提出的故障模型和仿真器可以方便地进行CNN加速器的容错设计和可靠性评估。
{"title":"Modeling the Effect of SEUs on the Configuration Memory of SRAM-FPGA-Based CNN Accelerators","authors":"Zhen Gao;Jiaqi Feng;Shihui Gao;Qiang Liu;Guangjun Ge;Yu Wang;Pedro Reviriego","doi":"10.1109/JETCAS.2024.3460792","DOIUrl":"https://doi.org/10.1109/JETCAS.2024.3460792","url":null,"abstract":"Convolutional Neural Networks (CNNs) are widely used in computer vision applications. SRAM based Field Programmable Gate Arrays (SRAM-FPGAs) are popular for the acceleration of CNNs. Since SRAM-FPGAs are prone to soft errors, the reliability evaluation and efficient fault tolerance design become very important for the use of FPGA-based CNNs in safety critical scenarios. Hardware based fault injection is an effective approach for the reliability evaluation, and the results can provide valuable references for the fault tolerance design. However, the complexity of building a fault injection platform poses a big obstacle for researchers working on the fault tolerance design. To remove this obstacle, this paper first performs a complete reliability evaluation for errors on the configuration memory of the FPGA based CNN accelerators, and then studies the impact of errors on the output feature maps of each layer. Based on the statistical analysis, we propose several fault models for the effect of SEUs on the configuration memory of the FPGA based CNN accelerators, and build a software simulator based on the fault models. Experiments show that the evaluation results based on the software simulator are very close to those from the hardware fault injections. Therefore, the proposed fault models and simulator can facilitate the fault tolerance design and reliability evaluation of CNN accelerators.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 4","pages":"799-810"},"PeriodicalIF":3.7,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stealthy Backdoor Attack Against Federated Learning Through Frequency Domain by Backdoor Neuron Constraint and Model Camouflage 利用后门神经元约束和模型伪装,通过频域对联合学习进行隐形后门攻击
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-08-27 DOI: 10.1109/JETCAS.2024.3450527
Yanqi Qiao;Dazhuang Liu;Rui Wang;Kaitai Liang
Federated Learning (FL) is a beneficial decentralized learning approach for preserving the privacy of local datasets of distributed agents. However, the distributed property of FL and untrustworthy data introducing the vulnerability to backdoor attacks. In this attack scenario, an adversary manipulates its local data with a specific trigger and trains a malicious local model to implant the backdoor. During inference, the global model would misbehave for any input with the trigger to the attacker-chosen prediction. Most existing backdoor attacks against FL focus on bypassing defense mechanisms, without considering the inspection of model parameters on the server. These attacks are susceptible to detection through dynamic clustering based on model parameter similarity. Besides, current methods provide limited imperceptibility of their trigger in the spatial domain. To address these limitations, we propose a stealthy backdoor attack called “Chironex” against FL with an imperceptible trigger in frequency space to deliver attack effectiveness, stealthiness and robustness against various countermeasures on FL. We first design a frequency trigger function to generate an imperceptible frequency trigger to evade human inspection. Then we fully exploit the attacker’s advantage to enhance attack robustness by estimating benign updates and analyzing the impact of the backdoor on model parameters through a task-sensitive neuron searcher. It disguises malicious updates as benign ones by reducing the impact of backdoor neurons that greatly contribute to the backdoor task based on activation value, and encouraging them to update towards benign model parameters trained by the attacker. We conduct extensive experiments on various image classifiers with real-world datasets to provide empirical evidence that Chironex can evade the most recent robust FL aggregation algorithms, and further achieve a distinctly higher attack success rate than existing attacks, without undermining the utility of the global model.
联邦学习(FL)是一种有益的分散学习方法,用于保护分布式代理的本地数据集的隐私性。然而,FL的分布式特性和不可信数据引入了后门攻击的漏洞。在此攻击场景中,攻击者使用特定触发器操纵其本地数据,并训练恶意本地模型来植入后门。在推理过程中,全局模型会对任何带有攻击者选择的预测触发器的输入产生错误行为。大多数针对FL的后门攻击都侧重于绕过防御机制,而没有考虑对服务器上的模型参数进行检查。基于模型参数相似度的动态聚类容易检测到这些攻击。此外,目前的方法提供了有限的不可感知的触发空间域。为了解决这些限制,我们提出了一种名为“Chironex”的针对FL的隐形后门攻击,该攻击在频率空间中具有不可察觉的触发,以提供针对FL的各种对策的攻击有效性,隐潜性和鲁棒性。我们首先设计了一个频率触发函数来生成不可察觉的频率触发以逃避人类检查。然后通过任务敏感神经元搜索器估计良性更新和分析后门对模型参数的影响,充分利用攻击者的优势增强攻击的鲁棒性。它通过减少基于激活值对后门任务贡献巨大的后门神经元的影响,并鼓励它们向攻击者训练的良性模型参数更新,将恶意更新伪装成良性更新。我们用真实世界的数据集对各种图像分类器进行了广泛的实验,以提供经验证据,证明Chironex可以逃避最新的鲁棒FL聚合算法,并进一步实现比现有攻击明显更高的攻击成功率,而不会破坏全局模型的效用。
{"title":"Stealthy Backdoor Attack Against Federated Learning Through Frequency Domain by Backdoor Neuron Constraint and Model Camouflage","authors":"Yanqi Qiao;Dazhuang Liu;Rui Wang;Kaitai Liang","doi":"10.1109/JETCAS.2024.3450527","DOIUrl":"10.1109/JETCAS.2024.3450527","url":null,"abstract":"Federated Learning (FL) is a beneficial decentralized learning approach for preserving the privacy of local datasets of distributed agents. However, the distributed property of FL and untrustworthy data introducing the vulnerability to backdoor attacks. In this attack scenario, an adversary manipulates its local data with a specific trigger and trains a malicious local model to implant the backdoor. During inference, the global model would misbehave for any input with the trigger to the attacker-chosen prediction. Most existing backdoor attacks against FL focus on bypassing defense mechanisms, without considering the inspection of model parameters on the server. These attacks are susceptible to detection through dynamic clustering based on model parameter similarity. Besides, current methods provide limited imperceptibility of their trigger in the spatial domain. To address these limitations, we propose a stealthy backdoor attack called “Chironex” against FL with an imperceptible trigger in frequency space to deliver attack effectiveness, stealthiness and robustness against various countermeasures on FL. We first design a frequency trigger function to generate an imperceptible frequency trigger to evade human inspection. Then we fully exploit the attacker’s advantage to enhance attack robustness by estimating benign updates and analyzing the impact of the backdoor on model parameters through a task-sensitive neuron searcher. It disguises malicious updates as benign ones by reducing the impact of backdoor neurons that greatly contribute to the backdoor task based on activation value, and encouraging them to update towards benign model parameters trained by the attacker. We conduct extensive experiments on various image classifiers with real-world datasets to provide empirical evidence that Chironex can evade the most recent robust FL aggregation algorithms, and further achieve a distinctly higher attack success rate than existing attacks, without undermining the utility of the global model.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 4","pages":"661-672"},"PeriodicalIF":3.7,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142197276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chip and Package-Scale Interconnects for General-Purpose, Domain-Specific, and Quantum Computing Systems—Overview, Challenges, and Opportunities 通用、特定领域和量子计算系统的芯片和封装级互连 - 概述、挑战和机遇
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-08-19 DOI: 10.1109/JETCAS.2024.3445829
Abhijit Das;Maurizio Palesi;John Kim;Partha Pratim Pande
The anticipated end of Moore’s law, coupled with the breakdown of Dennard scaling, compelled everyone to conceive forthcoming computing systems once transistors reach their limits. Three leading approaches to circumvent this situation are the chiplet paradigm, domain customisation and quantum computing. However, architectural and technological innovations have shifted the fundamental bottleneck from computation to communication. Hence, on-chip and on-package communication play a pivotal role in determining the performance, energy efficiency and scalability of general-purpose, domain-specific and quantum computing systems. This article reviews the recent advances in chip and package-scale interconnects due to the change in architecture, application and technology. The primary objective of this article is to present the current status, key challenges, and impact-worthy opportunities in this research area from the perspective of hardware architectures. The secondary objective of this article is to serve as a tutorial providing an overview of academic and industrial explorations in chip and package-scale communication infrastructure design for general-purpose, domain-specific and quantum computing systems.
摩尔定律的预期终结,再加上邓纳缩放的崩溃,迫使每个人都在构想晶体管达到极限后即将出现的计算系统。规避这种情况的三种主要方法是芯片范式、领域定制和量子计算。然而,架构和技术创新已将根本瓶颈从计算转移到通信。因此,片上和封装上通信在决定通用、特定领域和量子计算系统的性能、能效和可扩展性方面发挥着举足轻重的作用。本文回顾了由于架构、应用和技术的变化而导致的芯片级和封装级互连的最新进展。本文的主要目的是从硬件架构的角度介绍这一研究领域的现状、关键挑战和有影响的机遇。本文的次要目的是作为教程,概述学术界和工业界在通用、特定领域和量子计算系统的芯片和封装级通信基础设施设计方面的探索。
{"title":"Chip and Package-Scale Interconnects for General-Purpose, Domain-Specific, and Quantum Computing Systems—Overview, Challenges, and Opportunities","authors":"Abhijit Das;Maurizio Palesi;John Kim;Partha Pratim Pande","doi":"10.1109/JETCAS.2024.3445829","DOIUrl":"10.1109/JETCAS.2024.3445829","url":null,"abstract":"The anticipated end of Moore’s law, coupled with the breakdown of Dennard scaling, compelled everyone to conceive forthcoming computing systems once transistors reach their limits. Three leading approaches to circumvent this situation are the chiplet paradigm, domain customisation and quantum computing. However, architectural and technological innovations have shifted the fundamental bottleneck from computation to communication. Hence, on-chip and on-package communication play a pivotal role in determining the performance, energy efficiency and scalability of general-purpose, domain-specific and quantum computing systems. This article reviews the recent advances in chip and package-scale interconnects due to the change in architecture, application and technology. The primary objective of this article is to present the current status, key challenges, and impact-worthy opportunities in this research area from the perspective of hardware architectures. The secondary objective of this article is to serve as a tutorial providing an overview of academic and industrial explorations in chip and package-scale communication infrastructure design for general-purpose, domain-specific and quantum computing systems.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"354-370"},"PeriodicalIF":3.7,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10638543","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142197237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GNS: Graph-Based Network-on-Chip Shield for Early Defense Against Malicious Nodes in MPSoC GNS:基于图的片上网络屏蔽,用于早期防御 MPSoC 中的恶意节点
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-08-05 DOI: 10.1109/JETCAS.2024.3438435
Haoyu Wang;Jianjie Ren;Basel Halak;Ahmad Atamli
In the rapidly evolving landscape of system design, Multi-Processor Systems-on-Chip (MPSoCs) have experienced significant growth in both scale and complexity, by integrating an array of Intellectual Properties (IPs) through Network-on-Chip (NoC) to execute complex parallel applications. However, this advancement has led to the emergence of security attacks caused by Malicious Third-Party IPs (M3PIPs), such as Denial-of-Service (DoS). Many current methods for detecting DoS attacks involve significant hardware overhead and are often inefficient in identifying anomalies at an early stage. Addressing this gap, we propose the Graph-based NoC Shield (GNS), a robust strategy meticulously crafted to detect, localize, and isolate malicious IPs at the very early stage of DoS appearance. Central to our approach is the use of a Graph Neural Network (GNN) and Long Short-Term Memory (LSTM) detection model. This combination capitalizes on network traffic data and routing dependency graphs to efficiently trace the source of network congestion and pinpoint attackers. Our extensive experimental analysis validates the effectiveness of the GNS framework, demonstrating a 98% detection accuracy and localization capabilities, achieved with minimal hardware overhead of 1.8% in each router, based on a pure 4*4 Mesh NoC system. The detection performance exceeds that of all other state-of-the-art works and most straightforward single machine learning inference models within the same context. Additionally, the hardware overhead is notably superior compared to other security schemes. Another key feature of our system is the implementation of a credit interposing mechanism. It was specifically designed to isolate M3PIPs engaging in Flooding-based DoS and effectively mitigate the spread of malicious traffic. This approach significantly enhances the security of NoC-based MPSoCs, offering early-stage detection with the superior accuracy compared to other models. Crucially, the GNS achieves this with up to 75% less hardware overhead than state-of-the-art solutions, thus striking a balance between efficiency and effectiveness in security implementation.
在快速发展的系统设计领域,多处理器片上系统(MPSoC)通过片上网络(NoC)集成了一系列知识产权(IP)以执行复杂的并行应用,其规模和复杂性都有了显著提高。然而,这一进步也导致了恶意第三方 IP(M3PIP)引起的安全攻击的出现,如拒绝服务(DoS)。目前许多检测 DoS 攻击的方法都涉及大量硬件开销,而且在早期识别异常情况方面往往效率低下。针对这一缺陷,我们提出了基于图形的 NoC 屏蔽(GNS),这是一种精心设计的强大策略,可在 DoS 出现的早期阶段检测、定位和隔离恶意 IP。我们方法的核心是使用图形神经网络(GNN)和长短期记忆(LSTM)检测模型。这一组合利用了网络流量数据和路由依赖图,可有效追踪网络拥塞的源头并精确定位攻击者。我们的大量实验分析验证了 GNS 框架的有效性,基于纯 4*4 网状 NoC 系统,每个路由器的硬件开销仅为 1.8%,却实现了 98% 的检测准确率和定位能力。其检测性能超过了所有其他最先进的研究成果和相同背景下最直接的单一机器学习推理模型。此外,硬件开销也明显优于其他安全方案。我们系统的另一个主要特点是实施了一种信用穿插机制。该机制专门用于隔离参与基于泛洪的 DoS 的 M3PIP,并有效缓解恶意流量的传播。这种方法大大增强了基于 NoC 的 MPSoC 的安全性,与其他模型相比,它能提供准确性更高的早期检测。最重要的是,与最先进的解决方案相比,GNS 可减少高达 75% 的硬件开销,从而在安全实施的效率和效果之间取得了平衡。
{"title":"GNS: Graph-Based Network-on-Chip Shield for Early Defense Against Malicious Nodes in MPSoC","authors":"Haoyu Wang;Jianjie Ren;Basel Halak;Ahmad Atamli","doi":"10.1109/JETCAS.2024.3438435","DOIUrl":"10.1109/JETCAS.2024.3438435","url":null,"abstract":"In the rapidly evolving landscape of system design, Multi-Processor Systems-on-Chip (MPSoCs) have experienced significant growth in both scale and complexity, by integrating an array of Intellectual Properties (IPs) through Network-on-Chip (NoC) to execute complex parallel applications. However, this advancement has led to the emergence of security attacks caused by Malicious Third-Party IPs (M3PIPs), such as Denial-of-Service (DoS). Many current methods for detecting DoS attacks involve significant hardware overhead and are often inefficient in identifying anomalies at an early stage. Addressing this gap, we propose the Graph-based NoC Shield (GNS), a robust strategy meticulously crafted to detect, localize, and isolate malicious IPs at the very early stage of DoS appearance. Central to our approach is the use of a Graph Neural Network (GNN) and Long Short-Term Memory (LSTM) detection model. This combination capitalizes on network traffic data and routing dependency graphs to efficiently trace the source of network congestion and pinpoint attackers. Our extensive experimental analysis validates the effectiveness of the GNS framework, demonstrating a 98% detection accuracy and localization capabilities, achieved with minimal hardware overhead of 1.8% in each router, based on a pure 4*4 Mesh NoC system. The detection performance exceeds that of all other state-of-the-art works and most straightforward single machine learning inference models within the same context. Additionally, the hardware overhead is notably superior compared to other security schemes. Another key feature of our system is the implementation of a credit interposing mechanism. It was specifically designed to isolate M3PIPs engaging in Flooding-based DoS and effectively mitigate the spread of malicious traffic. This approach significantly enhances the security of NoC-based MPSoCs, offering early-stage detection with the superior accuracy compared to other models. Crucially, the GNS achieves this with up to 75% less hardware overhead than state-of-the-art solutions, thus striking a balance between efficiency and effectiveness in security implementation.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"483-494"},"PeriodicalIF":3.7,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs 研究寄存器缓存行为:对 GPU 上 CUDA 和张量核心工作负载的影响
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-08-05 DOI: 10.1109/JETCAS.2024.3439193
Vahid Geraeinejad;Qiran Qian;Masoumeh Ebrahimi
GPUs are extensively employed as the primary devices for running a broad spectrum of applications, covering general-purpose applications as well as Artificial Intelligence (AI) applications. Register file, as the largest SRAM on the GPU die, accounts for over 20% of the total GPU energy consumption. Register cache has been introduced to reduce traffic from the register file and thus decrease total energy consumption when CUDA cores are utilized. However, the utilization of register cache has not been thoroughly investigated for Tensor Cores which are integrated into recent GPU architectures to meet AI workload demands. In this paper, we study the usage of register cache in both CUDA and Tensor Cores and conduct a thorough examination of their pros and cons. We have developed an open-source analytical simulator, called RFC-sim, to model and measure the energy consumption of both the register file and register cache. Our results show that while the register cache can reduce energy consumption by up to 40% in CUDA cores, it results in increased energy consumption by up to 23% in Tensor Cores. The main reason lies in the limited space of the register cache, which is not sufficient for the demand of Tensor cores to capture locality.
GPU 被广泛用作运行各种应用的主要设备,包括通用应用和人工智能(AI)应用。寄存器文件是 GPU 芯片上最大的 SRAM,占 GPU 总能耗的 20% 以上。引入寄存器缓存是为了减少寄存器文件的流量,从而降低使用 CUDA 内核时的总能耗。然而,对于集成到最近的 GPU 架构中以满足人工智能工作负载需求的张量核,寄存器缓存的利用率尚未得到深入研究。在本文中,我们研究了寄存器缓存在 CUDA 和 Tensor Cores 中的使用情况,并对它们的利弊进行了深入探讨。我们开发了一个名为 RFC-sim 的开源分析模拟器,对寄存器文件和寄存器缓存的能耗进行建模和测量。我们的结果表明,虽然寄存器缓存在 CUDA 内核中可以减少高达 40% 的能耗,但在 Tensor 内核中却会导致能耗增加高达 23%。主要原因在于寄存器缓存的空间有限,无法满足 Tensor 内核捕捉定位的需求。
{"title":"Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs","authors":"Vahid Geraeinejad;Qiran Qian;Masoumeh Ebrahimi","doi":"10.1109/JETCAS.2024.3439193","DOIUrl":"10.1109/JETCAS.2024.3439193","url":null,"abstract":"GPUs are extensively employed as the primary devices for running a broad spectrum of applications, covering general-purpose applications as well as Artificial Intelligence (AI) applications. Register file, as the largest SRAM on the GPU die, accounts for over 20% of the total GPU energy consumption. Register cache has been introduced to reduce traffic from the register file and thus decrease total energy consumption when CUDA cores are utilized. However, the utilization of register cache has not been thoroughly investigated for Tensor Cores which are integrated into recent GPU architectures to meet AI workload demands. In this paper, we study the usage of register cache in both CUDA and Tensor Cores and conduct a thorough examination of their pros and cons. We have developed an open-source analytical simulator, called RFC-sim, to model and measure the energy consumption of both the register file and register cache. Our results show that while the register cache can reduce energy consumption by up to 40% in CUDA cores, it results in increased energy consumption by up to 23% in Tensor Cores. The main reason lies in the limited space of the register cache, which is not sufficient for the demand of Tensor cores to capture locality.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"469-482"},"PeriodicalIF":3.7,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Fault Tolerance Approach for Network-on-Chip Architecture 片上网络架构的动态容错方法
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-08-05 DOI: 10.1109/JETCAS.2024.3438250
Kasem Khalil;Ashok Kumar;Magdy Bayoumi
Network-on-Chip (NoC) architecture provides speed-efficient and scalable communication in complex integrated circuits. Attaining fault tolerance in NoC architectures is an ongoing research problem aiming to enhance the architecture’s reliability and performance. It seeks to mitigate the impact of router failures and enhance the overall system robustness. Fault tolerance is achieved by adding additional hardware, and the research challenge is to attain high reliability, high Mean Time To Failure (MTTF), and low Energy-Delay-Product (EDP) while sacrificing an acceptable area. It is particularly vital for applications with uninterrupted data flow. This paper proposes a fault-tolerance approach for NoC systems focusing on NoC routers to yield increased reliability and MTTF with an acceptable area overhead and low EDP. The proposed method proposes a dynamic reconfiguration mechanism by using a dynamic allocation of virtual channels and a bypass crossbar mechanism, ensuring uninterrupted data flow within the NoC. Evaluations of the proposed method are done on different mesh sizes using VHDL and an Altera 10GX FPGA, demonstrating the method’s superiority in reliability, reduced latency, and enhanced throughput. The results show that the proposed method has an acceptable area overhead of 25.3%, and its MTTF values are 3.7 times to 18 times higher than the traditional methods for varying network sizes, showing remarkable robustness against faults. The results show that the proposed method attains the best-reported reliability with the least EDP. Additionally, a layout of the circuit is also created and studied.
片上网络(NoC)架构可在复杂的集成电路中提供快速高效和可扩展的通信。在 NoC 架构中实现容错是一个持续研究的问题,目的是提高架构的可靠性和性能。它旨在减轻路由器故障的影响,提高整个系统的鲁棒性。容错是通过添加额外的硬件来实现的,而研究的挑战是在牺牲可接受的面积的同时,实现高可靠性、高平均故障时间(MTTF)和低能量-延迟-产品(EDP)。这对于需要不间断数据流的应用尤为重要。本文针对 NoC 系统提出了一种容错方法,重点关注 NoC 路由器,以提高可靠性和 MTTF,同时实现可接受的面积开销和低 EDP。所提出的方法通过使用虚拟通道的动态分配和旁路交叉条机制,提出了一种动态重新配置机制,以确保 NoC 内的数据流不中断。使用 VHDL 和 Altera 10GX FPGA 在不同网格尺寸上对所提方法进行了评估,证明了该方法在可靠性、降低延迟和提高吞吐量方面的优越性。结果表明,所提方法的面积开销为 25.3%,是可以接受的,而且在不同的网络规模下,其 MTTF 值是传统方法的 3.7 倍至 18 倍,显示了对故障的显著鲁棒性。结果表明,所提出的方法以最小的 EDP 达到了最佳的可靠性。此外,还创建并研究了电路布局。
{"title":"Dynamic Fault Tolerance Approach for Network-on-Chip Architecture","authors":"Kasem Khalil;Ashok Kumar;Magdy Bayoumi","doi":"10.1109/JETCAS.2024.3438250","DOIUrl":"10.1109/JETCAS.2024.3438250","url":null,"abstract":"Network-on-Chip (NoC) architecture provides speed-efficient and scalable communication in complex integrated circuits. Attaining fault tolerance in NoC architectures is an ongoing research problem aiming to enhance the architecture’s reliability and performance. It seeks to mitigate the impact of router failures and enhance the overall system robustness. Fault tolerance is achieved by adding additional hardware, and the research challenge is to attain high reliability, high Mean Time To Failure (MTTF), and low Energy-Delay-Product (EDP) while sacrificing an acceptable area. It is particularly vital for applications with uninterrupted data flow. This paper proposes a fault-tolerance approach for NoC systems focusing on NoC routers to yield increased reliability and MTTF with an acceptable area overhead and low EDP. The proposed method proposes a dynamic reconfiguration mechanism by using a dynamic allocation of virtual channels and a bypass crossbar mechanism, ensuring uninterrupted data flow within the NoC. Evaluations of the proposed method are done on different mesh sizes using VHDL and an Altera 10GX FPGA, demonstrating the method’s superiority in reliability, reduced latency, and enhanced throughput. The results show that the proposed method has an acceptable area overhead of 25.3%, and its MTTF values are 3.7 times to 18 times higher than the traditional methods for varying network sizes, showing remarkable robustness against faults. The results show that the proposed method attains the best-reported reliability with the least EDP. Additionally, a layout of the circuit is also created and studied.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"384-394"},"PeriodicalIF":3.7,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration 基于 NoC 的通信感知和资源高效的 CNN 加速体系结构
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-08-02 DOI: 10.1109/JETCAS.2024.3437408
Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou
Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by $5.36times $ , $1.62times $ , $1.96times $ , and $5.83times $ , respectively.
卷积神经网络(CNN)的爆炸式发展极大地受益于基于硬件的加速,以保持低延迟和高资源利用率。为了提高 CNN 算法的处理效率,基于现场编程门阵列(FPGA)的加速器在设计时增加了硬件资源,以实现高并行性和高吞吐量。然而,当引入更多的处理元件(PE)以 PE 簇的形式存在时,就会出现瓶颈,包括:1)FPGA 的固定硬件资源利用率不足,导致有效性能和峰值性能不匹配;2)复杂的布线和复杂的布局导致时钟频率受限。本文提出了一种基于 2 层分级片上网络(NoC)的 CNN 加速器。在上层,引入了一个基于网状的 NoC,将多个 PE 集群互连起来。这种设计不仅提高了平衡不同数据通信模型的灵活性,从而提高了 PE 的利用率和能效,还实现了全局异步、局部同步(GALS)架构,从而提高了时序闭合性。在底层,本地 PE 被组织成三维平铺 PE 集群,目的是利用卷积网络固有的数据流最大限度地提高数据重用率。在 Xilinx ZU9EG FPGA 上对 4 个基准 CNN 模型进行了实现和实验:ResNet50、ResNet34、VGG16 和 Darknet19 表明,我们的工作在 300 MHz 频率下运行,有效吞吐量分别为 0.998 TOPS、1.022 TOPS、1.024 TOPS 和 1.026 TOPS。这一结果相当于 92.85%、95.1%、95.25% 和 95.46% 的 PE 利用率。与基于FPGA的相关设计相比,我们的工作提高了DSP的资源效率,分别为5.36(times)美元、1.62(times)美元、1.96(times)美元和5.83(times)美元。
{"title":"Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration","authors":"Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou","doi":"10.1109/JETCAS.2024.3437408","DOIUrl":"10.1109/JETCAS.2024.3437408","url":null,"abstract":"Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by \u0000<inline-formula> <tex-math>$5.36times $ </tex-math></inline-formula>\u0000, \u0000<inline-formula> <tex-math>$1.62times $ </tex-math></inline-formula>\u0000, \u0000<inline-formula> <tex-math>$1.96times $ </tex-math></inline-formula>\u0000, and \u0000<inline-formula> <tex-math>$5.83times $ </tex-math></inline-formula>\u0000, respectively.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"440-454"},"PeriodicalIF":3.7,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reinforcement Learning (RL)-Based Holistic Routing and Wavelength Assignment in Optical Network-on-Chip (ONoC): Distributed or Centralized? 基于强化学习(RL)的片上光网络(ONoC)整体路由和波长分配:分布式还是集中式?
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-07-30 DOI: 10.1109/JETCAS.2024.3435721
Hui Li;Jiahe Zhao;Feiyang Liu
With the development of silicon photonic interconnects, Optical Network-on-Chip (ONoC) becomes promising for multi-core/many-core communication. In ONoCs, both routing and wavelength assignment have an impact on the communication reliability and performance. However, the interactive impact of the routing and wavelength assignment is rarely considered. To fill this gap, this work proposes an adaptive and holistic method of routing and wavelength assignment (RWA) based on Reinforcement Learning (RL) for ONoCs. Routing and wavelength assignment is treated as a whole problem and participate in the same Markov decision process. Two corresponding implementation methods, i.e., distributed and centralized, are proposed, by using intelligent learning algorithms to process and learn the dynamic on-chip network information in multi-dimensional. Instead of considering routing and wavelength assignment separately in steps, the evaluation results show that the proposed holistic method improves by 2.58 dB, 9.21%, and 53.26% in the aspects of OSNR, waiting delay, and wavelength utilization respectively, in cost of 16.15% loss of load balancing. As for the distributed method and centralized method, the distributed method improves by 0.37 dB and 0.69% in the aspects of OSNR and waiting delay, but the centralized method improves by 13.84% and 4.46% in the aspects of load balancing and wavelength utilization.
随着硅光子互连技术的发展,片上光网络(ONoC)在多核/多核通信方面大有可为。在 ONoC 中,路由选择和波长分配都会对通信可靠性和性能产生影响。然而,路由和波长分配的交互影响却很少被考虑。为填补这一空白,本研究提出了一种基于强化学习(RL)的自适应整体路由和波长分配(RWA)方法。路由和波长分配被视为一个整体问题,参与同一个马尔可夫决策过程。通过使用智能学习算法多维度处理和学习片上动态网络信息,提出了分布式和集中式两种相应的实现方法。评估结果表明,所提出的整体方法在 OSNR、等待延迟和波长利用率方面分别提高了 2.58 dB、9.21% 和 53.26%,而代价是负载平衡损失了 16.15%。至于分布式方法和集中式方法,分布式方法在 OSNR 和等待延迟方面分别提高了 0.37 dB 和 0.69%,但集中式方法在负载平衡和波长利用率方面分别提高了 13.84% 和 4.46%。
{"title":"Reinforcement Learning (RL)-Based Holistic Routing and Wavelength Assignment in Optical Network-on-Chip (ONoC): Distributed or Centralized?","authors":"Hui Li;Jiahe Zhao;Feiyang Liu","doi":"10.1109/JETCAS.2024.3435721","DOIUrl":"10.1109/JETCAS.2024.3435721","url":null,"abstract":"With the development of silicon photonic interconnects, Optical Network-on-Chip (ONoC) becomes promising for multi-core/many-core communication. In ONoCs, both routing and wavelength assignment have an impact on the communication reliability and performance. However, the interactive impact of the routing and wavelength assignment is rarely considered. To fill this gap, this work proposes an adaptive and holistic method of routing and wavelength assignment (RWA) based on Reinforcement Learning (RL) for ONoCs. Routing and wavelength assignment is treated as a whole problem and participate in the same Markov decision process. Two corresponding implementation methods, i.e., distributed and centralized, are proposed, by using intelligent learning algorithms to process and learn the dynamic on-chip network information in multi-dimensional. Instead of considering routing and wavelength assignment separately in steps, the evaluation results show that the proposed holistic method improves by 2.58 dB, 9.21%, and 53.26% in the aspects of OSNR, waiting delay, and wavelength utilization respectively, in cost of 16.15% loss of load balancing. As for the distributed method and centralized method, the distributed method improves by 0.37 dB and 0.69% in the aspects of OSNR and waiting delay, but the centralized method improves by 13.84% and 4.46% in the aspects of load balancing and wavelength utilization.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"534-550"},"PeriodicalIF":3.7,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Ultra-Low Cost and Multicast-Enabled Asynchronous NoC for Neuromorphic Edge Computing 用于神经形态边缘计算的超低成本多播异步 NoC
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-07-25 DOI: 10.1109/JETCAS.2024.3433427
Zhe Su;Simone Ramini;Demetra Coffen Marcolin;Alessandro Veronesi;Milos Krstic;Giacomo Indiveri;Davide Bertozzi;Steven M. Nowick
Biological brains are increasingly taken as a guide toward more efficient forms of computing. The latest frontier considers the use of spiking neural-network-based neuromorphic processors for near-sensor data processing, in order to fit the tight power and resource budgets of edge computing devices. However, a prevailing focus on brain-inspired computing and storage primitives in the design of neuromorphic systems is currently bringing a fundamental bottleneck to the forefront: chip-scale communications. While communication architectures (typically, a network-on-chip) are generally inspired by, or even borrowed from, general purpose computing, neuromorphic communications exhibit unique characteristics: they consist of the event-driven routing of small amounts of information to a large number of destinations within tight area and power budgets. This article aims at an inflection point in network-on-chip design for brain-inspired communications, revolving around the combination of cost-effective and robust asynchronous design, architecture specialization for short messaging and lightweight hardware support for tree-based multicast. When validated with functional spiking neural network traffic, the proposed NoC delivers energy savings ranging from 42% to 71% over a state-of-the-art NoC used in a real multi-core neuromorphic processor for edge computing applications.
生物大脑越来越多地被用作更高效计算形式的指南。最新的前沿技术考虑使用基于尖峰神经网络的神经形态处理器进行近距离传感器数据处理,以适应边缘计算设备紧张的功率和资源预算。然而,在神经形态系统的设计中,对大脑启发计算和存储基元的普遍关注正将一个基本瓶颈推向前沿:芯片级通信。虽然通信架构(通常是片上网络)通常受到通用计算的启发,甚至是从通用计算中借鉴而来,但神经形态通信表现出独特的特点:它们包括在严格的面积和功耗预算内,将少量信息以事件驱动的方式路由到大量目的地。本文的目标是为大脑启发通信的片上网络设计找到一个拐点,将经济高效、稳健的异步设计、短信息传输的架构专业化以及基于树状组播的轻量级硬件支持结合起来。经功能性尖峰神经网络流量验证,与用于边缘计算应用的实际多核神经形态处理器中使用的最先进 NoC 相比,所提出的 NoC 可节省 42% 至 71% 的能源。
{"title":"An Ultra-Low Cost and Multicast-Enabled Asynchronous NoC for Neuromorphic Edge Computing","authors":"Zhe Su;Simone Ramini;Demetra Coffen Marcolin;Alessandro Veronesi;Milos Krstic;Giacomo Indiveri;Davide Bertozzi;Steven M. Nowick","doi":"10.1109/JETCAS.2024.3433427","DOIUrl":"10.1109/JETCAS.2024.3433427","url":null,"abstract":"Biological brains are increasingly taken as a guide toward more efficient forms of computing. The latest frontier considers the use of spiking neural-network-based neuromorphic processors for near-sensor data processing, in order to fit the tight power and resource budgets of edge computing devices. However, a prevailing focus on brain-inspired computing and storage primitives in the design of neuromorphic systems is currently bringing a fundamental bottleneck to the forefront: chip-scale communications. While communication architectures (typically, a network-on-chip) are generally inspired by, or even borrowed from, general purpose computing, neuromorphic communications exhibit unique characteristics: they consist of the event-driven routing of small amounts of information to a large number of destinations within tight area and power budgets. This article aims at an inflection point in network-on-chip design for brain-inspired communications, revolving around the combination of cost-effective and robust asynchronous design, architecture specialization for short messaging and lightweight hardware support for tree-based multicast. When validated with functional spiking neural network traffic, the proposed NoC delivers energy savings ranging from 42% to 71% over a state-of-the-art NoC used in a real multi-core neuromorphic processor for edge computing applications.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"409-424"},"PeriodicalIF":3.7,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10609786","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141775221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and Analysis of 3D Integrated Folded Ferro-Capacitive Crossbar Array (FC²A) for Brain-Inspired Computing System 用于脑启发计算系统的三维集成折叠式铁电容横杆阵列(FC 2 A)的设计与分析
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2024-07-22 DOI: 10.1109/JETCAS.2024.3432458
Sherin A. Thomas;Suyash Kushwaha;Rohit Sharma;Devarshi Mrinal Das
This paper presents a novel 3D folded capacitive synaptic crossbar array designed for in-memory computing architectures. In this architecture, the bitline is folded over the wordline to enhance the synaptic density. The proposed folded capacitive crossbar array ( $FC^{2}A$ ) architecture decreases the wordline interconnect length and physical crossbar area by 50%. Thus, it helps to reduce the crossbar-associated parasitics and optimize space utilization. The proposed folded capacitive synaptic crossbar is used for designing a brain-inspired computing system (BiCoS) to recognize different patterns using CMOS technology. The BiCoS systems are prone to various reliability issues caused by the crossbar’s parasitics. Hence, the 3D folded capacitive crossbar’s Q3D model is developed to investigate the crossbar-associated parasitics and its effect on the proposed system is analyzed. The impact of crossbar parasitics is investigated for two cases: Firstly, how the three different spiking patterns (regular spiking, fast-spiking, and chattering) of the Izhikevich neuron change for the different crossbar sizes. Secondly, the impact is analyzed on the pattern recognition rate, which gets reduced to 70%. Addressing these challenges is critical to ensure the correct and robust working of the proposed system. Therefore, we propose a solution to effectively overcome and resolve these adverse effects. The energy consumed to recognize each pattern is calculated, and the average energy needed is $0.25,nJ$ , which is significantly less when compared to the other state-of-the-art works. The circuit is implemented using 65nm standard CMOS technology.
本文介绍了一种专为内存计算架构设计的新型三维折叠电容式突触横杆阵列。在该架构中,位线折叠在字线上,以提高突触密度。所提出的折叠电容式交叉条阵列($FC^{2}A$)架构可将字线互连长度和物理交叉条面积减少 50%。因此,它有助于减少与横梁相关的寄生效应,优化空间利用率。所提出的折叠式电容突触横杆可用于设计大脑启发计算系统(BiCoS),利用 CMOS 技术识别不同的模式。BiCoS 系统容易因横梁的寄生效应而产生各种可靠性问题。因此,我们开发了三维折叠电容横梁 Q3D 模型来研究横梁相关寄生件,并分析其对拟议系统的影响。横梁寄生的影响分为两种情况:首先,伊齐克维奇神经元的三种不同尖峰模式(常规尖峰、快速尖峰和颤振)在不同横梁尺寸下的变化情况。其次,分析其对模式识别率的影响,即模式识别率降低到 70%。应对这些挑战对于确保拟议系统的正确和稳健工作至关重要。因此,我们提出了有效克服和解决这些不利影响的解决方案。我们计算了识别每个图案所消耗的能量,平均所需的能量为 0.25 美元,与其他最先进的作品相比明显降低。电路采用 65nm 标准 CMOS 技术实现。
{"title":"Design and Analysis of 3D Integrated Folded Ferro-Capacitive Crossbar Array (FC²A) for Brain-Inspired Computing System","authors":"Sherin A. Thomas;Suyash Kushwaha;Rohit Sharma;Devarshi Mrinal Das","doi":"10.1109/JETCAS.2024.3432458","DOIUrl":"10.1109/JETCAS.2024.3432458","url":null,"abstract":"This paper presents a novel 3D folded capacitive synaptic crossbar array designed for in-memory computing architectures. In this architecture, the bitline is folded over the wordline to enhance the synaptic density. The proposed folded capacitive crossbar array (\u0000<inline-formula> <tex-math>$FC^{2}A$ </tex-math></inline-formula>\u0000) architecture decreases the wordline interconnect length and physical crossbar area by 50%. Thus, it helps to reduce the crossbar-associated parasitics and optimize space utilization. The proposed folded capacitive synaptic crossbar is used for designing a brain-inspired computing system (BiCoS) to recognize different patterns using CMOS technology. The BiCoS systems are prone to various reliability issues caused by the crossbar’s parasitics. Hence, the 3D folded capacitive crossbar’s Q3D model is developed to investigate the crossbar-associated parasitics and its effect on the proposed system is analyzed. The impact of crossbar parasitics is investigated for two cases: Firstly, how the three different spiking patterns (regular spiking, fast-spiking, and chattering) of the Izhikevich neuron change for the different crossbar sizes. Secondly, the impact is analyzed on the pattern recognition rate, which gets reduced to 70%. Addressing these challenges is critical to ensure the correct and robust working of the proposed system. Therefore, we propose a solution to effectively overcome and resolve these adverse effects. The energy consumed to recognize each pattern is calculated, and the average energy needed is \u0000<inline-formula> <tex-math>$0.25,nJ$ </tex-math></inline-formula>\u0000, which is significantly less when compared to the other state-of-the-art works. The circuit is implemented using 65nm standard CMOS technology.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"563-574"},"PeriodicalIF":3.7,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141775222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Journal on Emerging and Selected Topics in Circuits and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1