IEEE Journal on Emerging and Selected Topics in Circuits and Systems最新文献_第4页

Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration 基于 NoC 的通信感知和资源高效的 CNN 加速体系结构

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-08-02 DOI: 10.1109/JETCAS.2024.3437408

Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou

Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by

$5.36times $

,

$1.62times $

,

$1.96times $

, and

$5.83times $

, respectively.

卷积神经网络（CNN）的爆炸式发展极大地受益于基于硬件的加速，以保持低延迟和高资源利用率。为了提高 CNN 算法的处理效率，基于现场编程门阵列（FPGA）的加速器在设计时增加了硬件资源，以实现高并行性和高吞吐量。然而，当引入更多的处理元件（PE）以 PE 簇的形式存在时，就会出现瓶颈，包括：1）FPGA 的固定硬件资源利用率不足，导致有效性能和峰值性能不匹配；2）复杂的布线和复杂的布局导致时钟频率受限。本文提出了一种基于 2 层分级片上网络（NoC）的 CNN 加速器。在上层，引入了一个基于网状的 NoC，将多个 PE 集群互连起来。这种设计不仅提高了平衡不同数据通信模型的灵活性，从而提高了 PE 的利用率和能效，还实现了全局异步、局部同步（GALS）架构，从而提高了时序闭合性。在底层，本地 PE 被组织成三维平铺 PE 集群，目的是利用卷积网络固有的数据流最大限度地提高数据重用率。在 Xilinx ZU9EG FPGA 上对 4 个基准 CNN 模型进行了实现和实验：ResNet50、ResNet34、VGG16 和 Darknet19 表明，我们的工作在 300 MHz 频率下运行，有效吞吐量分别为 0.998 TOPS、1.022 TOPS、1.024 TOPS 和 1.026 TOPS。这一结果相当于 92.85%、95.1%、95.25% 和 95.46% 的 PE 利用率。与基于FPGA的相关设计相比，我们的工作提高了DSP的资源效率，分别为5.36（times）美元、1.62（times）美元、1.96（times）美元和5.83（times）美元。

{"title":"Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration","authors":"Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou","doi":"10.1109/JETCAS.2024.3437408","DOIUrl":"10.1109/JETCAS.2024.3437408","url":null,"abstract":"Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by \u0000<inline-formula> <tex-math>$5.36times $ </tex-math></inline-formula>\u0000, \u0000<inline-formula> <tex-math>$1.62times $ </tex-math></inline-formula>\u0000, \u0000<inline-formula> <tex-math>$1.96times $ </tex-math></inline-formula>\u0000, and \u0000<inline-formula> <tex-math>$5.83times $ </tex-math></inline-formula>\u0000, respectively.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"440-454"},"PeriodicalIF":3.7,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reinforcement Learning (RL)-Based Holistic Routing and Wavelength Assignment in Optical Network-on-Chip (ONoC): Distributed or Centralized? 基于强化学习（RL）的片上光网络（ONoC）整体路由和波长分配：分布式还是集中式？

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-30 DOI: 10.1109/JETCAS.2024.3435721

Hui Li;Jiahe Zhao;Feiyang Liu

With the development of silicon photonic interconnects, Optical Network-on-Chip (ONoC) becomes promising for multi-core/many-core communication. In ONoCs, both routing and wavelength assignment have an impact on the communication reliability and performance. However, the interactive impact of the routing and wavelength assignment is rarely considered. To fill this gap, this work proposes an adaptive and holistic method of routing and wavelength assignment (RWA) based on Reinforcement Learning (RL) for ONoCs. Routing and wavelength assignment is treated as a whole problem and participate in the same Markov decision process. Two corresponding implementation methods, i.e., distributed and centralized, are proposed, by using intelligent learning algorithms to process and learn the dynamic on-chip network information in multi-dimensional. Instead of considering routing and wavelength assignment separately in steps, the evaluation results show that the proposed holistic method improves by 2.58 dB, 9.21%, and 53.26% in the aspects of OSNR, waiting delay, and wavelength utilization respectively, in cost of 16.15% loss of load balancing. As for the distributed method and centralized method, the distributed method improves by 0.37 dB and 0.69% in the aspects of OSNR and waiting delay, but the centralized method improves by 13.84% and 4.46% in the aspects of load balancing and wavelength utilization.

随着硅光子互连技术的发展，片上光网络（ONoC）在多核/多核通信方面大有可为。在 ONoC 中，路由选择和波长分配都会对通信可靠性和性能产生影响。然而，路由和波长分配的交互影响却很少被考虑。为填补这一空白，本研究提出了一种基于强化学习（RL）的自适应整体路由和波长分配（RWA）方法。路由和波长分配被视为一个整体问题，参与同一个马尔可夫决策过程。通过使用智能学习算法多维度处理和学习片上动态网络信息，提出了分布式和集中式两种相应的实现方法。评估结果表明，所提出的整体方法在 OSNR、等待延迟和波长利用率方面分别提高了 2.58 dB、9.21% 和 53.26%，而代价是负载平衡损失了 16.15%。至于分布式方法和集中式方法，分布式方法在 OSNR 和等待延迟方面分别提高了 0.37 dB 和 0.69%，但集中式方法在负载平衡和波长利用率方面分别提高了 13.84% 和 4.46%。

{"title":"Reinforcement Learning (RL)-Based Holistic Routing and Wavelength Assignment in Optical Network-on-Chip (ONoC): Distributed or Centralized?","authors":"Hui Li;Jiahe Zhao;Feiyang Liu","doi":"10.1109/JETCAS.2024.3435721","DOIUrl":"10.1109/JETCAS.2024.3435721","url":null,"abstract":"With the development of silicon photonic interconnects, Optical Network-on-Chip (ONoC) becomes promising for multi-core/many-core communication. In ONoCs, both routing and wavelength assignment have an impact on the communication reliability and performance. However, the interactive impact of the routing and wavelength assignment is rarely considered. To fill this gap, this work proposes an adaptive and holistic method of routing and wavelength assignment (RWA) based on Reinforcement Learning (RL) for ONoCs. Routing and wavelength assignment is treated as a whole problem and participate in the same Markov decision process. Two corresponding implementation methods, i.e., distributed and centralized, are proposed, by using intelligent learning algorithms to process and learn the dynamic on-chip network information in multi-dimensional. Instead of considering routing and wavelength assignment separately in steps, the evaluation results show that the proposed holistic method improves by 2.58 dB, 9.21%, and 53.26% in the aspects of OSNR, waiting delay, and wavelength utilization respectively, in cost of 16.15% loss of load balancing. As for the distributed method and centralized method, the distributed method improves by 0.37 dB and 0.69% in the aspects of OSNR and waiting delay, but the centralized method improves by 13.84% and 4.46% in the aspects of load balancing and wavelength utilization.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"534-550"},"PeriodicalIF":3.7,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Ultra-Low Cost and Multicast-Enabled Asynchronous NoC for Neuromorphic Edge Computing 用于神经形态边缘计算的超低成本多播异步 NoC

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-25 DOI: 10.1109/JETCAS.2024.3433427

Zhe Su;Simone Ramini;Demetra Coffen Marcolin;Alessandro Veronesi;Milos Krstic;Giacomo Indiveri;Davide Bertozzi;Steven M. Nowick

Biological brains are increasingly taken as a guide toward more efficient forms of computing. The latest frontier considers the use of spiking neural-network-based neuromorphic processors for near-sensor data processing, in order to fit the tight power and resource budgets of edge computing devices. However, a prevailing focus on brain-inspired computing and storage primitives in the design of neuromorphic systems is currently bringing a fundamental bottleneck to the forefront: chip-scale communications. While communication architectures (typically, a network-on-chip) are generally inspired by, or even borrowed from, general purpose computing, neuromorphic communications exhibit unique characteristics: they consist of the event-driven routing of small amounts of information to a large number of destinations within tight area and power budgets. This article aims at an inflection point in network-on-chip design for brain-inspired communications, revolving around the combination of cost-effective and robust asynchronous design, architecture specialization for short messaging and lightweight hardware support for tree-based multicast. When validated with functional spiking neural network traffic, the proposed NoC delivers energy savings ranging from 42% to 71% over a state-of-the-art NoC used in a real multi-core neuromorphic processor for edge computing applications.

生物大脑越来越多地被用作更高效计算形式的指南。最新的前沿技术考虑使用基于尖峰神经网络的神经形态处理器进行近距离传感器数据处理，以适应边缘计算设备紧张的功率和资源预算。然而，在神经形态系统的设计中，对大脑启发计算和存储基元的普遍关注正将一个基本瓶颈推向前沿：芯片级通信。虽然通信架构（通常是片上网络）通常受到通用计算的启发，甚至是从通用计算中借鉴而来，但神经形态通信表现出独特的特点：它们包括在严格的面积和功耗预算内，将少量信息以事件驱动的方式路由到大量目的地。本文的目标是为大脑启发通信的片上网络设计找到一个拐点，将经济高效、稳健的异步设计、短信息传输的架构专业化以及基于树状组播的轻量级硬件支持结合起来。经功能性尖峰神经网络流量验证，与用于边缘计算应用的实际多核神经形态处理器中使用的最先进 NoC 相比，所提出的 NoC 可节省 42% 至 71% 的能源。

{"title":"An Ultra-Low Cost and Multicast-Enabled Asynchronous NoC for Neuromorphic Edge Computing","authors":"Zhe Su;Simone Ramini;Demetra Coffen Marcolin;Alessandro Veronesi;Milos Krstic;Giacomo Indiveri;Davide Bertozzi;Steven M. Nowick","doi":"10.1109/JETCAS.2024.3433427","DOIUrl":"10.1109/JETCAS.2024.3433427","url":null,"abstract":"Biological brains are increasingly taken as a guide toward more efficient forms of computing. The latest frontier considers the use of spiking neural-network-based neuromorphic processors for near-sensor data processing, in order to fit the tight power and resource budgets of edge computing devices. However, a prevailing focus on brain-inspired computing and storage primitives in the design of neuromorphic systems is currently bringing a fundamental bottleneck to the forefront: chip-scale communications. While communication architectures (typically, a network-on-chip) are generally inspired by, or even borrowed from, general purpose computing, neuromorphic communications exhibit unique characteristics: they consist of the event-driven routing of small amounts of information to a large number of destinations within tight area and power budgets. This article aims at an inflection point in network-on-chip design for brain-inspired communications, revolving around the combination of cost-effective and robust asynchronous design, architecture specialization for short messaging and lightweight hardware support for tree-based multicast. When validated with functional spiking neural network traffic, the proposed NoC delivers energy savings ranging from 42% to 71% over a state-of-the-art NoC used in a real multi-core neuromorphic processor for edge computing applications.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"409-424"},"PeriodicalIF":3.7,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10609786","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141775221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and Analysis of 3D Integrated Folded Ferro-Capacitive Crossbar Array (FC²A) for Brain-Inspired Computing System 用于脑启发计算系统的三维集成折叠式铁电容横杆阵列（FC 2 A）的设计与分析

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-22 DOI: 10.1109/JETCAS.2024.3432458

Sherin A. Thomas;Suyash Kushwaha;Rohit Sharma;Devarshi Mrinal Das

This paper presents a novel 3D folded capacitive synaptic crossbar array designed for in-memory computing architectures. In this architecture, the bitline is folded over the wordline to enhance the synaptic density. The proposed folded capacitive crossbar array (

$FC^{2}A$

) architecture decreases the wordline interconnect length and physical crossbar area by 50%. Thus, it helps to reduce the crossbar-associated parasitics and optimize space utilization. The proposed folded capacitive synaptic crossbar is used for designing a brain-inspired computing system (BiCoS) to recognize different patterns using CMOS technology. The BiCoS systems are prone to various reliability issues caused by the crossbar’s parasitics. Hence, the 3D folded capacitive crossbar’s Q3D model is developed to investigate the crossbar-associated parasitics and its effect on the proposed system is analyzed. The impact of crossbar parasitics is investigated for two cases: Firstly, how the three different spiking patterns (regular spiking, fast-spiking, and chattering) of the Izhikevich neuron change for the different crossbar sizes. Secondly, the impact is analyzed on the pattern recognition rate, which gets reduced to 70%. Addressing these challenges is critical to ensure the correct and robust working of the proposed system. Therefore, we propose a solution to effectively overcome and resolve these adverse effects. The energy consumed to recognize each pattern is calculated, and the average energy needed is

$0.25,nJ$

, which is significantly less when compared to the other state-of-the-art works. The circuit is implemented using 65nm standard CMOS technology.

本文介绍了一种专为内存计算架构设计的新型三维折叠电容式突触横杆阵列。在该架构中，位线折叠在字线上，以提高突触密度。所提出的折叠电容式交叉条阵列（$FC^{2}A$）架构可将字线互连长度和物理交叉条面积减少 50%。因此，它有助于减少与横梁相关的寄生效应，优化空间利用率。所提出的折叠式电容突触横杆可用于设计大脑启发计算系统（BiCoS），利用 CMOS 技术识别不同的模式。BiCoS 系统容易因横梁的寄生效应而产生各种可靠性问题。因此，我们开发了三维折叠电容横梁 Q3D 模型来研究横梁相关寄生件，并分析其对拟议系统的影响。横梁寄生的影响分为两种情况：首先，伊齐克维奇神经元的三种不同尖峰模式（常规尖峰、快速尖峰和颤振）在不同横梁尺寸下的变化情况。其次，分析其对模式识别率的影响，即模式识别率降低到 70%。应对这些挑战对于确保拟议系统的正确和稳健工作至关重要。因此，我们提出了有效克服和解决这些不利影响的解决方案。我们计算了识别每个图案所消耗的能量，平均所需的能量为 0.25 美元，与其他最先进的作品相比明显降低。电路采用 65nm 标准 CMOS 技术实现。

{"title":"Design and Analysis of 3D Integrated Folded Ferro-Capacitive Crossbar Array (FC²A) for Brain-Inspired Computing System","authors":"Sherin A. Thomas;Suyash Kushwaha;Rohit Sharma;Devarshi Mrinal Das","doi":"10.1109/JETCAS.2024.3432458","DOIUrl":"10.1109/JETCAS.2024.3432458","url":null,"abstract":"This paper presents a novel 3D folded capacitive synaptic crossbar array designed for in-memory computing architectures. In this architecture, the bitline is folded over the wordline to enhance the synaptic density. The proposed folded capacitive crossbar array (\u0000<inline-formula> <tex-math>$FC^{2}A$ </tex-math></inline-formula>\u0000) architecture decreases the wordline interconnect length and physical crossbar area by 50%. Thus, it helps to reduce the crossbar-associated parasitics and optimize space utilization. The proposed folded capacitive synaptic crossbar is used for designing a brain-inspired computing system (BiCoS) to recognize different patterns using CMOS technology. The BiCoS systems are prone to various reliability issues caused by the crossbar’s parasitics. Hence, the 3D folded capacitive crossbar’s Q3D model is developed to investigate the crossbar-associated parasitics and its effect on the proposed system is analyzed. The impact of crossbar parasitics is investigated for two cases: Firstly, how the three different spiking patterns (regular spiking, fast-spiking, and chattering) of the Izhikevich neuron change for the different crossbar sizes. Secondly, the impact is analyzed on the pattern recognition rate, which gets reduced to 70%. Addressing these challenges is critical to ensure the correct and robust working of the proposed system. Therefore, we propose a solution to effectively overcome and resolve these adverse effects. The energy consumed to recognize each pattern is calculated, and the average energy needed is \u0000<inline-formula> <tex-math>$0.25,nJ$ </tex-math></inline-formula>\u0000, which is significantly less when compared to the other state-of-the-art works. The circuit is implemented using 65nm standard CMOS technology.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"563-574"},"PeriodicalIF":3.7,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141775222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SwInt: A Non-Blocking Switch-Based Silicon Photonic Interposer Network for 2.5D Machine Learning Accelerators SwInt：用于 2.5D 机器学习加速器的基于无阻塞开关的硅光子集成电路网络

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-16 DOI: 10.1109/JETCAS.2024.3429354

Ebadollah Taheri;Mohammad Amin Mahdian;Sudeep Pasricha;Mahdi Nikdast

The surging demand for machine learning (ML) applications has emphasized the pressing need for efficient ML accelerators capable of addressing the computational and energy demands of increasingly complex ML models. However, the conventional monolithic design of large-scale ML accelerators on a single chip often entails prohibitively high fabrication costs. To address this challenge, this paper proposes a 2.5D chiplet-based architecture based on a silicon photonic interposer, called SwInt, to enable high bandwidth, low latency, and energy-efficient data movement on the interposer, for ML applications. Existing silicon photonic interposer implementations suffer from high power consumption attributed to their inefficient network designs, primarily relying on bus-based communication. Bus-based communication is not scalable, as it suffers from high power consumption of the optical laser due to cumulative losses on the readers and writers when the bandwidth per waveguide (i.e., wavelength division multiplexing degree) increases or the number of processing elements in ML accelerators scales up. SwInt incorporates a novel switch-based network designed using Mach-Zehnder Interferometer (MZI)-based switch cells for offering scalable interposer communication and reducing power consumption. The designed switch architecture avoids blocking using an efficient design, while minimizing the number of stages to offer a low-loss switch. Furthermore, the MZI switch cells are designed with a dividing state, enabling energy-efficient broadcast communication over the interposer and supporting broadcasting demand in ML accelerators. Additionally, we optimized and fabricated silicon photonic devices, Microring Resonators (MRRs) and MZIs, which are integral components of our network architecture. Our analysis shows that SwInt achieves, on average, 62% and 64% improvement in power consumption under, respectively, unicast and broadcast communication, resulting in 59.7% energy-efficiency improvement compared to the state-of-the-art silicon photonic interposers specifically designed for ML accelerators.

机器学习（ML）应用需求的激增，凸显了对高效 ML 加速器的迫切需求，这种加速器必须能够满足日益复杂的 ML 模型的计算和能源需求。然而，在单个芯片上进行大规模 ML 加速器的传统单片设计往往需要高昂的制造成本。为了应对这一挑战，本文提出了一种基于硅光子插层的 2.5D 芯片架构（称为 SwInt），可在插层上实现高带宽、低延迟和高能效的数据移动，适用于 ML 应用。现有的硅光子插层实施方案主要依赖基于总线的通信，其低效的网络设计导致功耗居高不下。基于总线的通信不可扩展，因为当每个波导的带宽（即波分复用度）增加或 ML 加速器的处理元件数量增加时，读写器上的累积损耗会导致光激光器的高功耗。SwInt 采用了一种基于开关的新型网络，使用基于马赫-泽恩德干涉仪（MZI）的开关单元进行设计，以提供可扩展的层间通信并降低功耗。所设计的开关架构采用高效设计，避免了阻塞，同时最大限度地减少了级数，从而实现了低损耗开关。此外，MZI 交换单元还设计了一种分割状态，从而实现了高能效的插接器广播通信，并支持 ML 加速器中的广播需求。此外，我们还优化并制造了硅光子器件、微oring 谐振器 (MRR) 和 MZI，它们是我们网络架构不可或缺的组成部分。我们的分析表明，与专为 ML 加速器设计的最先进的硅光子中间件相比，SwInt 在单播和广播通信下的功耗平均分别提高了 62% 和 64%，能效提高了 59.7%。

{"title":"SwInt: A Non-Blocking Switch-Based Silicon Photonic Interposer Network for 2.5D Machine Learning Accelerators","authors":"Ebadollah Taheri;Mohammad Amin Mahdian;Sudeep Pasricha;Mahdi Nikdast","doi":"10.1109/JETCAS.2024.3429354","DOIUrl":"10.1109/JETCAS.2024.3429354","url":null,"abstract":"The surging demand for machine learning (ML) applications has emphasized the pressing need for efficient ML accelerators capable of addressing the computational and energy demands of increasingly complex ML models. However, the conventional monolithic design of large-scale ML accelerators on a single chip often entails prohibitively high fabrication costs. To address this challenge, this paper proposes a 2.5D chiplet-based architecture based on a silicon photonic interposer, called SwInt, to enable high bandwidth, low latency, and energy-efficient data movement on the interposer, for ML applications. Existing silicon photonic interposer implementations suffer from high power consumption attributed to their inefficient network designs, primarily relying on bus-based communication. Bus-based communication is not scalable, as it suffers from high power consumption of the optical laser due to cumulative losses on the readers and writers when the bandwidth per waveguide (i.e., wavelength division multiplexing degree) increases or the number of processing elements in ML accelerators scales up. SwInt incorporates a novel switch-based network designed using Mach-Zehnder Interferometer (MZI)-based switch cells for offering scalable interposer communication and reducing power consumption. The designed switch architecture avoids blocking using an efficient design, while minimizing the number of stages to offer a low-loss switch. Furthermore, the MZI switch cells are designed with a dividing state, enabling energy-efficient broadcast communication over the interposer and supporting broadcasting demand in ML accelerators. Additionally, we optimized and fabricated silicon photonic devices, Microring Resonators (MRRs) and MZIs, which are integral components of our network architecture. Our analysis shows that SwInt achieves, on average, 62% and 64% improvement in power consumption under, respectively, unicast and broadcast communication, resulting in 59.7% energy-efficiency improvement compared to the state-of-the-art silicon photonic interposers specifically designed for ML accelerators.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"520-533"},"PeriodicalIF":3.7,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10599539","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OpenPiton4HPC: Optimizing OpenPiton Toward High-Performance Manycores OpenPiton4HPC：优化 OpenPiton 以实现高性能 Manycores

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-15 DOI: 10.1109/JETCAS.2024.3428929

Neiel Leyva;Alireza Monemi;Noelia Oliete-Escuín;Guillem López-Paradís;Xabier Abancens;Jonathan Balkind;Enrique Vallejo;Miquel Moretó;Lluc Alvarez

In recent years, numerous multicore RISC-V platforms have emerged. Development frameworks such as OpenPiton are employed in designs that aim to scale to a large number of cores. While OpenPiton presents a large flexibility, supporting different requirements and processing cores, some of its design decisions result in designs that are not optimized for High-Performance Computing (HPC) requirements. This work presents OpenPiton4HPC, an extension and optimization of OpenPiton for high-performance manycores. The key contributions are enabling multiple memory controllers, supporting router bypassing and NoC concentration, adding support for configurable cache sizes and cache block sizes, and allowing configurable bus widths in the NoC and in the cache SRAMs. On a 64-core manycore architecture, these new features and optimizations provide a geometric mean speedup of 7.2x compared to the OpenPiton baseline.

近年来，出现了许多多核 RISC-V 平台。在旨在扩展到大量内核的设计中采用了 OpenPiton 等开发框架。虽然 OpenPiton 具有很大的灵活性，可支持不同的需求和处理内核，但其某些设计决策导致设计无法针对高性能计算（HPC）需求进行优化。这项工作介绍了 OpenPiton4HPC，它是 OpenPiton 针对高性能多核的扩展和优化。其主要贡献在于启用了多个内存控制器，支持路由器旁路和 NoC 集中，增加了对可配置缓存大小和缓存块大小的支持，并允许在 NoC 和缓存 SRAM 中配置总线宽度。在 64 核多核架构上，与 OpenPiton 基准相比，这些新功能和优化的几何平均速度提高了 7.2 倍。

引用次数: 0

Quantum Cryptanalysis of Affine Cipher 仿射密码的量子密码分析

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-15 DOI: 10.1109/JETCAS.2024.3428436

Mahima Mary Mathews;Panchami V;Vishnu Ajith

Quantum Algorithms reduce the computational complexity or solve certain difficult problems that were originally impossible to solve with classical computers. Grover’s search algorithm is a Quantum computation algorithm that can find target elements from a set of unstructured data with the best possible,

$O(sqrt {N})$

queries. Grover’s search Quantum circuits implemented accurately can be used to successfully search and find the keys of Symmetric ciphers. However, very few demonstrations of such practical cryptanalysis are available. In this paper, practical Quantum cryptanalysis circuits for Affine Cipher are proposed and demonstrated, that successfully break the cipher by finding the keys.

量子算法可降低计算复杂度，或解决某些原本无法用经典计算机解决的难题。格罗弗搜索算法是一种量子计算算法，它能从一组非结构化数据中找到目标元素，查询次数为 $O(sqrt {N})$。格罗弗搜索量子电路的精确实现可用于成功搜索和查找对称密码的密钥。然而，这种实用密码分析的演示却寥寥无几。本文提出并演示了针对仿射密码的实用量子密码分析电路，通过找到密钥成功破解密码。

引用次数: 0

HALO: Communication-Aware Heterogeneous 2.5-D System for Energy-Efficient LLM Execution at Edge HALO：在边缘执行高能效 LLM 的通信感知异构 2.5D 系统

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-12 DOI: 10.1109/JETCAS.2024.3427421

Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal

Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to

$972times $

improvement in latency and

$1600times $

improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.

大型语言模型（LLM）用于执行各种任务，尤其是在自然语言处理（NLP）领域。最先进的 LLM 包含大量参数，需要进行大量计算。目前，GPU 是执行 LLM 推理的首选硬件平台。然而，执行大型 LLM 的基于 GPU 的单片系统在制造成本和能效方面存在明显缺陷。在这项工作中，我们提出了一种基于 2.5D chiplet 的异构架构，用于加速 LLM 推断。所提出的 2.5D 系统由异构芯片组成，通过网络连接（NoP）。在拟议的 2.5D 系统中，我们充分利用了内存计算（IMC）的能效和基于 CMOS 的浮点运算单元（FPU）的通用计算能力。2.5D 技术有助于在同一系统上集成两种不同的技术（IMC 和 CMOS）。由于存在大量参数，如果在执行 LLM 时不对芯片间通信进行优化，那么芯片间通信就会成为一个重要的性能瓶颈。为此，我们提出了一种通信感知可扩展技术，将 LLM 的不同计算映射到不同的芯片上。所提出的映射技术最大限度地减少了 NoP 上的通信能量和延迟，速度明显快于现有的优化技术。对各种LLM进行的全面实验评估表明，与配备GPU的最先进边缘设备相比，所提出的2.5D系统在延迟和能耗方面分别提高了972倍和1600倍。

{"title":"HALO: Communication-Aware Heterogeneous 2.5-D System for Energy-Efficient LLM Execution at Edge","authors":"Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal","doi":"10.1109/JETCAS.2024.3427421","DOIUrl":"10.1109/JETCAS.2024.3427421","url":null,"abstract":"Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to \u0000<inline-formula> <tex-math>$972times $ </tex-math></inline-formula>\u0000 improvement in latency and \u0000<inline-formula> <tex-math>$1600times $ </tex-math></inline-formula>\u0000 improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"425-439"},"PeriodicalIF":3.7,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Fidelity-Oriented Entanglement Distribution for Quantum Switches 论量子开关面向保真度的纠缠分发

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-09 DOI: 10.1109/JETCAS.2024.3425712

Ziyue Jia;Lin Chen

We consider a star-shaped quantum network with a quantum switch in the center serving a number of requests, each characterized by two non-classical QoS requirements, the end-to-end entanglement delivery rate and the fidelity of the delivered entanglements. The central task of the switch is to allocate the limited entanglement resources among requests to maximize the system performance. We formulate the fundamental entanglement distribution problem where the switch decides 1) which requests to admit, and 2) as multiple requests may share a same quantum link, how to distributed the limited link-level entanglement resources among those competing requests. We then design a framework of joint entanglement purification scheduling and distribution for quantum switches. Our entanglement purification scheduling algorithm seeks to use minimal link-level entanglement resources to satisfy the QoS requirement of a single request. Our entanglement distribution algorithm further allocates the limited entanglement resources among multiple requests to maximize the overall utility by integrating the designed entanglement purification scheduling algorithm. We establish theoretical performance guarantee of our proposition, which is complemented by extensive numerical experiments demonstrating its effectiveness in a variety of network settings.

我们考虑了一个星形量子网络，其中心有一个量子交换机，为多个请求提供服务，每个请求都有两个非经典的服务质量要求，即端到端纠缠交付率和交付纠缠的保真度。交换机的核心任务是在请求之间分配有限的纠缠资源，以最大限度地提高系统性能。我们提出了基本的纠缠分配问题，即交换机决定：1）接纳哪些请求；2）由于多个请求可能共享同一量子链路，如何在这些相互竞争的请求之间分配有限的链路级纠缠资源。然后，我们为量子交换机设计了一个联合纠缠净化调度和分配框架。我们的纠缠净化调度算法力求使用最少的链路级纠缠资源来满足单个请求的 QoS 要求。我们的纠缠分配算法通过整合所设计的纠缠净化调度算法，进一步在多个请求之间分配有限的纠缠资源，以实现整体效用最大化。我们建立了我们的理论性能保证，并通过大量数值实验证明了它在各种网络环境中的有效性。

{"title":"On Fidelity-Oriented Entanglement Distribution for Quantum Switches","authors":"Ziyue Jia;Lin Chen","doi":"10.1109/JETCAS.2024.3425712","DOIUrl":"10.1109/JETCAS.2024.3425712","url":null,"abstract":"We consider a star-shaped quantum network with a quantum switch in the center serving a number of requests, each characterized by two non-classical QoS requirements, the end-to-end entanglement delivery rate and the fidelity of the delivered entanglements. The central task of the switch is to allocate the limited entanglement resources among requests to maximize the system performance. We formulate the fundamental entanglement distribution problem where the switch decides 1) which requests to admit, and 2) as multiple requests may share a same quantum link, how to distributed the limited link-level entanglement resources among those competing requests. We then design a framework of joint entanglement purification scheduling and distribution for quantum switches. Our entanglement purification scheduling algorithm seeks to use minimal link-level entanglement resources to satisfy the QoS requirement of a single request. Our entanglement distribution algorithm further allocates the limited entanglement resources among multiple requests to maximize the overall utility by integrating the designed entanglement purification scheduling algorithm. We establish theoretical performance guarantee of our proposition, which is complemented by extensive numerical experiments demonstrating its effectiveness in a variety of network settings.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"495-506"},"PeriodicalIF":3.7,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141567373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Delay-Constrained GNR Routing With CNT-Via Insertion in Nano-Scale Designs 纳米级设计中插入 CNT-Via 的延迟受限 GNR 路由选择

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-07-05 DOI: 10.1109/JETCAS.2024.3424217

Jin-Tai Yan

It is well known that graphene nanoribbon (GNR) can be used as interconnects in nano-scale designs. In this paper, given a set of delay-constrained GNR nets in a multiple-layer routing plane, based on the construction of a combined carbon nanotube (CNT)/graphene hetero-structure for CNT-vias between two adjacent layers, an efficient routing algorithm can be proposed to minimize the number of the used layers with satisfying the non-crossing constraints between two GNR nets and the delay constraints on the GNR nets in GNR routing with CNT-via insertion. In the initial assignment, based on the definition of the delay-constrained routing pattern on a GNR net with tight delay constraint and the delay-constrained via path on a GNR net, the delay-constrained routing patterns can be firstly assigned for layer minimization and the delay-driven minimum-length routing paths and the delay-constrained via paths can be further assigned onto the available layers. In the iterative routing, the unrouted GNR nets can be further routed on the available layers and some possible new layers by using one iterative maze-routing and rip-up-and-rerouting process. Compared with the published routing algorithms with no via insertion, the experimental results show that our proposed routing algorithm with CNT-via insertion can insert some CNT-vias and use shorter wirelength to decrease 53.8% and 24.9% of the number of the used layer under reasonable CPU time on the given GNR nets with two different sets of the delay constraints for 8 tested examples on the average, respectively.

众所周知，石墨烯纳米带（GNR）可用作纳米级设计中的互连器件。本文在给定多层路由平面中一组延迟受限的 GNR 网的基础上，通过构建碳纳米管（CNT）/石墨烯组合异质结构来实现相邻两层之间的 CNT 通孔，提出了一种高效的路由算法，以在满足两个 GNR 网之间不交叉约束和带有 CNT 通孔插入的 GNR 路由中 GNR 网的延迟约束的前提下，最大限度地减少所用层数。在初始分配中，根据具有严格延迟约束的 GNR 网的延迟约束路由模式和 GNR 网的延迟约束通路的定义，首先分配延迟约束路由模式以最小化层，然后将延迟驱动的最小长度路由通路和延迟约束通路进一步分配到可用层上。在迭代路由过程中，未路由的 GNR 网可以通过一次迭代迷宫路由和撕裂-向上-路由过程在可用层和一些可能的新层上进一步路由。实验结果表明，与已发表的不插入通孔的路由算法相比，我们提出的插入 CNT 通孔的路由算法可以插入一些 CNT 通孔，并使用更短的线长，在合理的 CPU 时间内，对给定的 GNR 网，在两组不同的延迟约束下，8 个测试实例的平均使用层数分别减少了 53.8% 和 24.9%。

{"title":"Delay-Constrained GNR Routing With CNT-Via Insertion in Nano-Scale Designs","authors":"Jin-Tai Yan","doi":"10.1109/JETCAS.2024.3424217","DOIUrl":"10.1109/JETCAS.2024.3424217","url":null,"abstract":"It is well known that graphene nanoribbon (GNR) can be used as interconnects in nano-scale designs. In this paper, given a set of delay-constrained GNR nets in a multiple-layer routing plane, based on the construction of a combined carbon nanotube (CNT)/graphene hetero-structure for CNT-vias between two adjacent layers, an efficient routing algorithm can be proposed to minimize the number of the used layers with satisfying the non-crossing constraints between two GNR nets and the delay constraints on the GNR nets in GNR routing with CNT-via insertion. In the initial assignment, based on the definition of the delay-constrained routing pattern on a GNR net with tight delay constraint and the delay-constrained via path on a GNR net, the delay-constrained routing patterns can be firstly assigned for layer minimization and the delay-driven minimum-length routing paths and the delay-constrained via paths can be further assigned onto the available layers. In the iterative routing, the unrouted GNR nets can be further routed on the available layers and some possible new layers by using one iterative maze-routing and rip-up-and-rerouting process. Compared with the published routing algorithms with no via insertion, the experimental results show that our proposed routing algorithm with CNT-via insertion can insert some CNT-vias and use shorter wirelength to decrease 53.8% and 24.9% of the number of the used layer under reasonable CPU time on the given GNR nets with two different sets of the delay constraints for 8 tested examples on the average, respectively.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"371-383"},"PeriodicalIF":3.7,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141567377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0