首页 > 最新文献

IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文 中文
An Injection-Locked and Sub-Sampling Clock Multiplier With a Two-Step SC DAC Achieving 2.67% Jitter Variation 带有两级 SC DAC 的注入锁定和子采样时钟乘法器,抖动变化率为 2.67
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-26 DOI: 10.1109/TVLSI.2024.3417015
Qifeng Huang;Siji Huang;Yanhang Chen;Yifei Fan;Jie Yuan
This article presents an injection-locked clock multiplier (ILCM) using a digitally controlled frequency-tracking loop (FTL) with an integral two-step switched-capacitor (SC) digital-to-analog converter (DAC). Conventionally, the DAC resolution needs to be increased for low noise at the cost of degraded monotonicity due to device mismatch. To overcome this tradeoff, the proposed DAC utilizes the SC technique to achieve fine steps. With only two capacitors involved in charge transfer, the DAC is inherently monotonic, avoiding the boundary-crossing issue and the mismatch calibration. A control-voltage-tracking loop (CVTL) further suppresses the quantization noise by balancing the up and down step sizes and helps achieve a 16-bit-level voltage step. The FTL is sub-sampling and utilizes a bang-bang phase detector (BBPD). Locking at 700 MHz, the ILCM achieves a 0.9-ps integrated jitter, a -125-dBc/Hz phase noise at a 1-MHz offset, and a small jitter variation of 2.67% under different supply voltages and temperatures. With FTL, the spur is around -56 dBc from the prototype fabricated in a 180-nm CMOS process. The chip occupies a core area of 0.054 mm2 and consumes $689~mu $ W from a 1.8-V supply, achieving an FoM of -242.5 dB.
本文介绍了一种注入锁定时钟乘法器(ILCM),该乘法器采用数字控制频率跟踪环路(FTL)和集成两步式开关电容(SC)数模转换器(DAC)。传统上,为了实现低噪声,需要提高 DAC 分辨率,但代价是器件失配导致单调性下降。为了克服这种折衷,所提出的 DAC 利用 SC 技术实现了精细阶跃。由于只有两个电容器参与电荷转移,因此 DAC 本身具有单调性,避免了越界问题和失配校准。控制电压跟踪环路 (CVTL) 通过平衡上下阶跃大小进一步抑制量化噪声,并有助于实现 16 位电平的电压阶跃。FTL 采用子采样,并使用了砰砰相位检测器 (BBPD)。ILCM 锁定频率为 700 MHz,综合抖动为 0.9ps,1 MHz 偏移时的相位噪声为 -125-dBc/Hz,在不同电源电压和温度条件下的抖动变化很小,仅为 2.67%。采用 180 纳米 CMOS 工艺制造的原型芯片,在 FTL 的情况下,抖动约为 -56 dBc。芯片核心面积为 0.054 mm2,1.8 V 电源功耗为 689~mu $ W,FoM 达到 -242.5 dB。
{"title":"An Injection-Locked and Sub-Sampling Clock Multiplier With a Two-Step SC DAC Achieving 2.67% Jitter Variation","authors":"Qifeng Huang;Siji Huang;Yanhang Chen;Yifei Fan;Jie Yuan","doi":"10.1109/TVLSI.2024.3417015","DOIUrl":"10.1109/TVLSI.2024.3417015","url":null,"abstract":"This article presents an injection-locked clock multiplier (ILCM) using a digitally controlled frequency-tracking loop (FTL) with an integral two-step switched-capacitor (SC) digital-to-analog converter (DAC). Conventionally, the DAC resolution needs to be increased for low noise at the cost of degraded monotonicity due to device mismatch. To overcome this tradeoff, the proposed DAC utilizes the SC technique to achieve fine steps. With only two capacitors involved in charge transfer, the DAC is inherently monotonic, avoiding the boundary-crossing issue and the mismatch calibration. A control-voltage-tracking loop (CVTL) further suppresses the quantization noise by balancing the up and down step sizes and helps achieve a 16-bit-level voltage step. The FTL is sub-sampling and utilizes a bang-bang phase detector (BBPD). Locking at 700 MHz, the ILCM achieves a 0.9-ps integrated jitter, a -125-dBc/Hz phase noise at a 1-MHz offset, and a small jitter variation of 2.67% under different supply voltages and temperatures. With FTL, the spur is around -56 dBc from the prototype fabricated in a 180-nm CMOS process. The chip occupies a core area of 0.054 mm2 and consumes \u0000<inline-formula> <tex-math>$689~mu $ </tex-math></inline-formula>\u0000W from a 1.8-V supply, achieving an FoM of -242.5 dB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"1841-1851"},"PeriodicalIF":2.8,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Parallel Architecture and Implementation for Near-Lossless Hyperspectral Image Compression Based on CCSDS 123.0-B-2 With Scalable Data-Rate Performance 基于 CCSDS 123.0-B-2 的近乎无损高光谱图像压缩并行架构与实现,具有可扩展的数据速率性能
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-26 DOI: 10.1109/TVLSI.2024.3415505
Panagiotis Chatziantoniou;Antonis Tsigkanos;Dimitris Theodoropoulos;Nektarios Kranitis;Antonis Paschalis
Hyperspectral and multispectral imaging maintains a crucial role in remote sensing technology for Earth observation missions. However, the huge volume of produced data requires compression for storage and downlink transmission. In 2019, the Consultative Committee for Space Data Systems (CCSDS) released the CCSDS 123.0-B-2 recommended standard, allowing near-lossless compression, through a closed-loop quantizer, by introducing a Hybrid Entropy Coder option. However, the in-loop quantizer introduced additional data dependencies constituting a throughput performance bottleneck. This contribution addresses the need for high data-rate on-board compression by presenting an efficient parallel architecture and hardware implementation based on CCSDS 123.0-B-2. It bypasses the throughput performance bottleneck with an external, hardware-efficient quantizer while maintaining competitive quality near-lossless functionality with compatibility to the CCSDS standard. The parallel architecture leverages segmentation along the X-axis of the spectral cube, enabling scalable data-rate performance with constant embedded memory footprint. The introduced architecture is implemented in VHSIC hardware description language (VHDL) indicatively targeting Xilinx Kintex UltraScale technology, validated and demonstrated using state-of-the-art SpaceFibre serial link interface IP Cores and test equipment, achieving very high code coverage. A single hyperspectral compression engine (HCE) achieves throughput performance of 285 MSamples/s (4.56 Gb/s) at 1.68 W, while six parallel HCEs reach 1590 MSamples/s (25.44 Gb/s) at 6.12 W, measured on a full breadboard system. Maximum performance only depends on image dimensions, available field programmable gate array (FPGA) resources and high-speed serial interface technology. To the best of our knowledge, this implementation achieves the highest data-rate performance for near-lossless compression based on CCSDS 123.0-B-2 implemented in FPGA technology suitable for next-generation institutional missions.
高光谱和多光谱成像技术在地球观测任务的遥感技术中发挥着至关重要的作用。然而,产生的大量数据需要压缩后才能存储和下行传输。2019 年,空间数据系统协商委员会(CCSDS)发布了 CCSDS 123.0-B-2 推荐标准,通过引入混合熵编码器选项,允许通过闭环量化器进行近乎无损的压缩。然而,内环量化器引入了额外的数据依赖性,构成了吞吐量性能瓶颈。本文提出了一种基于 CCSDS 123.0-B-2 的高效并行架构和硬件实现方法,以满足高数据率机载压缩的需求。它利用外部硬件高效量化器绕过了吞吐量性能瓶颈,同时保持了与 CCSDS 标准兼容的近乎无损的功能,质量极具竞争力。并行架构利用沿光谱立方体 X 轴的分段,实现了可扩展的数据速率性能和恒定的嵌入式内存占用空间。引入的架构采用 VHSIC 硬件描述语言(VHDL)实现,以 Xilinx Kintex UltraScale 技术为目标,使用最先进的 SpaceFibre 串行链路接口 IP 核和测试设备进行验证和演示,实现了极高的代码覆盖率。在全面包板系统上测量,单个高光谱压缩引擎(HCE)在 1.68 W 的功率下实现了 285 MS 样本/秒(4.56 Gb/秒)的吞吐量性能,而六个并行 HCE 在 6.12 W 的功率下实现了 1590 MS 样本/秒(25.44 Gb/秒)的吞吐量性能。最高性能仅取决于图像尺寸、可用的现场可编程门阵列 (FPGA) 资源和高速串行接口技术。据我们所知,这一实现达到了基于 CCSDS 123.0-B-2 的近乎无损压缩的最高数据速率性能,并采用了适合下一代机构任务的 FPGA 技术。
{"title":"A Parallel Architecture and Implementation for Near-Lossless Hyperspectral Image Compression Based on CCSDS 123.0-B-2 With Scalable Data-Rate Performance","authors":"Panagiotis Chatziantoniou;Antonis Tsigkanos;Dimitris Theodoropoulos;Nektarios Kranitis;Antonis Paschalis","doi":"10.1109/TVLSI.2024.3415505","DOIUrl":"10.1109/TVLSI.2024.3415505","url":null,"abstract":"Hyperspectral and multispectral imaging maintains a crucial role in remote sensing technology for Earth observation missions. However, the huge volume of produced data requires compression for storage and downlink transmission. In 2019, the Consultative Committee for Space Data Systems (CCSDS) released the CCSDS 123.0-B-2 recommended standard, allowing near-lossless compression, through a closed-loop quantizer, by introducing a Hybrid Entropy Coder option. However, the in-loop quantizer introduced additional data dependencies constituting a throughput performance bottleneck. This contribution addresses the need for high data-rate on-board compression by presenting an efficient parallel architecture and hardware implementation based on CCSDS 123.0-B-2. It bypasses the throughput performance bottleneck with an external, hardware-efficient quantizer while maintaining competitive quality near-lossless functionality with compatibility to the CCSDS standard. The parallel architecture leverages segmentation along the X-axis of the spectral cube, enabling scalable data-rate performance with constant embedded memory footprint. The introduced architecture is implemented in VHSIC hardware description language (VHDL) indicatively targeting Xilinx Kintex UltraScale technology, validated and demonstrated using state-of-the-art SpaceFibre serial link interface IP Cores and test equipment, achieving very high code coverage. A single hyperspectral compression engine (HCE) achieves throughput performance of 285 MSamples/s (4.56 Gb/s) at 1.68 W, while six parallel HCEs reach 1590 MSamples/s (25.44 Gb/s) at 6.12 W, measured on a full breadboard system. Maximum performance only depends on image dimensions, available field programmable gate array (FPGA) resources and high-speed serial interface technology. To the best of our knowledge, this implementation achieves the highest data-rate performance for near-lossless compression based on CCSDS 123.0-B-2 implemented in FPGA technology suitable for next-generation institutional missions.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1616-1629"},"PeriodicalIF":2.8,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Reinforcement Learning-Based Power Management for Chiplet-Based Multicore Systems 基于深度强化学习的芯片组多核系统电源管理
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-26 DOI: 10.1109/TVLSI.2024.3415487
Xiao Li;Lin Chen;Shixi Chen;Fan Jiang;Chengeng Li;Wei Zhang;Jiang Xu
Chiplet technology has emerged as a promising solution to address the increasing demand for high-performance computing in light of the slowdown of Moore’s law. While chiplet-based multicore systems offer higher performance through heterogeneous integration, they also pose challenges for power delivery system (PDS) design. The integration of additional vertical and inter-chiplet connections, along with higher power density, impose stringent requirements on power delivery. Moreover, PDS efficiency is affected by workload variations at runtime, necessitating the need to design and manage PDSs and processors as a whole to improve system energy efficiency while balancing performance. In this article, we propose an offline-online co-design optimization methodology that combines offline PDS design optimization with online power management. To address the power consumption and delivery mismatch, we introduce a centralized deep Q-network (DQN)-based online control scheme for power co-management in chiplet-based multicore systems. By carefully designing the state space and reward functions, our approach achieves workload-aware adaptive control to reduce the energy-delay-product (EDP) while maintaining PDS efficiency under a given performance target (PT). We conduct evaluations on realistic applications to validate the effectiveness of our approach. For 64-core systems, our method achieves an average EDP reduction of 67% while meeting a 90% PT, surpassing state-of-the-art modular Q-learning (MQL)-based and heuristic-based approaches by up to 4% and 16%, respectively. Additionally, our approach demonstrates wiser action selection policies, higher control stability, and lower implementation overhead compared to the MQL-based approach.
随着摩尔定律的放缓,芯片组技术已成为满足日益增长的高性能计算需求的一种有前途的解决方案。基于芯片组的多核系统通过异构集成提供了更高的性能,但同时也给功率传输系统(PDS)的设计带来了挑战。额外的垂直和芯片间连接的集成以及更高的功率密度,对功率传输提出了严格的要求。此外,PDS 效率还会受到运行时工作负载变化的影响,因此有必要将 PDS 和处理器作为一个整体进行设计和管理,以提高系统能效,同时兼顾性能。本文提出了一种离线-在线协同设计优化方法,将离线 PDS 设计优化与在线功耗管理相结合。为解决功耗和交付不匹配问题,我们引入了一种基于深度 Q 网络(DQN)的集中式在线控制方案,用于基于芯片组的多核系统中的功耗协同管理。通过精心设计状态空间和奖励函数,我们的方法实现了工作负载感知自适应控制,从而在给定性能目标(PT)下保持 PDS 效率的同时降低能耗延迟积(EDP)。我们在实际应用中进行了评估,以验证我们方法的有效性。对于 64 核系统,我们的方法在满足 90% 的性能目标的同时,实现了 67% 的平均 EDP 降低率,比基于模块化 Q 学习(MQL)和启发式的先进方法分别高出 4% 和 16%。此外,与基于 MQL 的方法相比,我们的方法展示了更明智的行动选择策略、更高的控制稳定性和更低的实施开销。
{"title":"Deep Reinforcement Learning-Based Power Management for Chiplet-Based Multicore Systems","authors":"Xiao Li;Lin Chen;Shixi Chen;Fan Jiang;Chengeng Li;Wei Zhang;Jiang Xu","doi":"10.1109/TVLSI.2024.3415487","DOIUrl":"10.1109/TVLSI.2024.3415487","url":null,"abstract":"Chiplet technology has emerged as a promising solution to address the increasing demand for high-performance computing in light of the slowdown of Moore’s law. While chiplet-based multicore systems offer higher performance through heterogeneous integration, they also pose challenges for power delivery system (PDS) design. The integration of additional vertical and inter-chiplet connections, along with higher power density, impose stringent requirements on power delivery. Moreover, PDS efficiency is affected by workload variations at runtime, necessitating the need to design and manage PDSs and processors as a whole to improve system energy efficiency while balancing performance. In this article, we propose an offline-online co-design optimization methodology that combines offline PDS design optimization with online power management. To address the power consumption and delivery mismatch, we introduce a centralized deep Q-network (DQN)-based online control scheme for power co-management in chiplet-based multicore systems. By carefully designing the state space and reward functions, our approach achieves workload-aware adaptive control to reduce the energy-delay-product (EDP) while maintaining PDS efficiency under a given performance target (PT). We conduct evaluations on realistic applications to validate the effectiveness of our approach. For 64-core systems, our method achieves an average EDP reduction of 67% while meeting a 90% PT, surpassing state-of-the-art modular Q-learning (MQL)-based and heuristic-based approaches by up to 4% and 16%, respectively. Additionally, our approach demonstrates wiser action selection policies, higher control stability, and lower implementation overhead compared to the MQL-based approach.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1726-1739"},"PeriodicalIF":2.8,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core ABS:设计低精度张量核心的累积位宽缩放方法
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-25 DOI: 10.1109/TVLSI.2024.3414260
Yasong Cao;Mei Wen;Zhongdi Luo;Xin Ju;Haolan Huang;Junzhong Shen;Haiyan Chen
A big gap exists between deep neural network (DNN) applications’ computational demand and the computing power of DNN accelerators. Low-precision floating-point (LP-FP) computation is one of the important means to improve the performance of DNN training and inference. However, the high-precision accumulators are typically applied to summating the dot products during general matrix multiplication (GEMM) in tensor cores (TCs). As the precision of data decreases, the accumulator becomes the main consumer of multiply-accumulate’s (MAC’s) area and power. Reducing the accumulators’ bit-width is of significant importance for improving the area- and energy-efficiency of TCs. There are two main challenges: 1) theoretical support on the floating-point (FP) formats with the lowest bit-width of TC’s accumulators and 2) how to integrate the LP-FP TC in the framework of DNN training and inference to evaluate its benefits. In this article, we propose accumulation bit-width scaling (ABS), a novel ABS method, to guide the design of LP-FP TCs. We 1) implement this method by constructing a novel variance retention ratio (VRR) model to predict the FP format with the minimum bit-width for TC’s accumulator; 2) provide a generator of DNN accelerator based on a systolic-array (SA) TC, supporting many low-precision configurations; and 3) design an LP-FP DNN executing framework that supports software-simulation mode and hardware-accelerator mode to run LP-FP DNN tasks. The experimental results show that the LP-FP TC guided by our ABS method has a maximum reduction of 76.47% and 75.60% in area and power consumption, respectively, compared with the advanced TCs.
深度神经网络(DNN)应用的计算需求与DNN加速器的计算能力之间存在巨大差距。低精度浮点(LP-FP)计算是提高 DNN 训练和推理性能的重要手段之一。然而,高精度累加器通常用于在张量内核(TC)的通用矩阵乘法(GEMM)过程中求和点积。随着数据精度的降低,累加器成为乘法累加器(MAC)面积和功耗的主要消耗者。减少累加器的位宽对提高 TC 的面积和能效具有重要意义。目前面临两大挑战1) 从理论上支持具有最低积算器位宽的浮点(FP)格式;2) 如何将 LP-FP TC 集成到 DNN 训练和推理框架中,以评估其优势。在本文中,我们提出了一种新颖的累加位宽缩放(ABS)方法来指导 LP-FP TC 的设计。我们:1)通过构建一个新颖的方差保留率(VRR)模型来预测积算器位宽最小的 FP 格式,从而实现该方法;2)提供一个基于收缩阵列(SA)积算器的 DNN 加速器生成器,支持多种低精度配置;3)设计一个 LP-FP DNN 执行框架,支持软件模拟模式和硬件加速模式,以运行 LP-FP DNN 任务。实验结果表明,采用我们的 ABS 方法指导的 LP-FP TC 与先进的 TC 相比,面积和功耗分别最大减少了 76.47% 和 75.60%。
{"title":"ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core","authors":"Yasong Cao;Mei Wen;Zhongdi Luo;Xin Ju;Haolan Huang;Junzhong Shen;Haiyan Chen","doi":"10.1109/TVLSI.2024.3414260","DOIUrl":"10.1109/TVLSI.2024.3414260","url":null,"abstract":"A big gap exists between deep neural network (DNN) applications’ computational demand and the computing power of DNN accelerators. Low-precision floating-point (LP-FP) computation is one of the important means to improve the performance of DNN training and inference. However, the high-precision accumulators are typically applied to summating the dot products during general matrix multiplication (GEMM) in tensor cores (TCs). As the precision of data decreases, the accumulator becomes the main consumer of multiply-accumulate’s (MAC’s) area and power. Reducing the accumulators’ bit-width is of significant importance for improving the area- and energy-efficiency of TCs. There are two main challenges: 1) theoretical support on the floating-point (FP) formats with the lowest bit-width of TC’s accumulators and 2) how to integrate the LP-FP TC in the framework of DNN training and inference to evaluate its benefits. In this article, we propose accumulation bit-width scaling (ABS), a novel ABS method, to guide the design of LP-FP TCs. We 1) implement this method by constructing a novel variance retention ratio (VRR) model to predict the FP format with the minimum bit-width for TC’s accumulator; 2) provide a generator of DNN accelerator based on a systolic-array (SA) TC, supporting many low-precision configurations; and 3) design an LP-FP DNN executing framework that supports software-simulation mode and hardware-accelerator mode to run LP-FP DNN tasks. The experimental results show that the LP-FP TC guided by our ABS method has a maximum reduction of 76.47% and 75.60% in area and power consumption, respectively, compared with the advanced TCs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1590-1601"},"PeriodicalIF":2.8,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 28-nm Dual-Mode Explicit Class-F₂₃ VCO With Low-Loss CM Return Path Achieving 70–400-kHz 1/f³ PN Corner Over 4.9–7.3-GHz TR 具有低损耗 CM 返回路径的 28 纳米双模显式 F$_{23}$ 类 VCO,可在 4.9-7.3-GHz TR 范围内实现 70-400-kHz 1/$f^{3}$ PN 波角
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-25 DOI: 10.1109/TVLSI.2024.3414158
Shan Lu;Danyu Wu;Xuan Guo;Hanbo Jia;Yong Chen;Xinyu Liu
This brief presents an explicit Class-F23 voltage-controlled oscillator (VCO). The square-like voltage waveform is obtained via waveform shaping, and flicker noise upconversion is suppressed by a proper common-mode (CM) return path. CM resonance at the second harmonic frequency is introduced by a compact octagonal inductor. The rms value of the impulse sensitivity function (ISF) is significantly reduced through Class-F23 operation. The VCO switches between two modes of a high-order LC resonator consisting of two identical LC tanks coupled by capacitors. A prototype of the VCO is implemented in a 28-nm CMOS. Measurements show a continuous tuning range (TR) of 4.89–7.29 GHz, with a peak figure of merit (FoM) of 190.5 dB/Hz at 5.8 GHz and better than 188.5 dB across the entire TR. The flicker phase-noise corner ranges from 70 to 400 kHz. The VCO consumes 16–19 mW from a 0.5-V supply and occupies an active area of 0.21 mm2.
本简介介绍了一种明确的 F23 类压控振荡器(VCO)。通过波形整形获得方形电压波形,并通过适当的共模(CM)返回路径抑制闪烁噪声上变频。二次谐波频率处的 CM 谐振由一个紧凑的八角形电感器引入。通过 F23 类工作,脉冲灵敏度函数 (ISF) 的均方根值显著降低。VCO 可在高阶 LC 谐振器的两种模式之间切换,高阶 LC 谐振器由两个通过电容器耦合的相同 LC 槽组成。VCO 原型采用 28 纳米 CMOS 实现。测量结果表明,其连续调谐范围 (TR) 为 4.89-7.29 GHz,在 5.8 GHz 时的峰值优越性 (FoM) 为 190.5 dB/Hz,在整个 TR 范围内优于 188.5 dB。闪烁相噪角范围为 70 至 400 kHz。VCO 在 0.5 V 电源下的功耗为 16-19 mW,占用的有效面积为 0.21 mm2。
{"title":"A 28-nm Dual-Mode Explicit Class-F₂₃ VCO With Low-Loss CM Return Path Achieving 70–400-kHz 1/f³ PN Corner Over 4.9–7.3-GHz TR","authors":"Shan Lu;Danyu Wu;Xuan Guo;Hanbo Jia;Yong Chen;Xinyu Liu","doi":"10.1109/TVLSI.2024.3414158","DOIUrl":"10.1109/TVLSI.2024.3414158","url":null,"abstract":"This brief presents an explicit Class-F23 voltage-controlled oscillator (VCO). The square-like voltage waveform is obtained via waveform shaping, and flicker noise upconversion is suppressed by a proper common-mode (CM) return path. CM resonance at the second harmonic frequency is introduced by a compact octagonal inductor. The rms value of the impulse sensitivity function (ISF) is significantly reduced through Class-F23 operation. The VCO switches between two modes of a high-order LC resonator consisting of two identical LC tanks coupled by capacitors. A prototype of the VCO is implemented in a 28-nm CMOS. Measurements show a continuous tuning range (TR) of 4.89–7.29 GHz, with a peak figure of merit (FoM) of 190.5 dB/Hz at 5.8 GHz and better than 188.5 dB across the entire TR. The flicker phase-noise corner ranges from 70 to 400 kHz. The VCO consumes 16–19 mW from a 0.5-V supply and occupies an active area of 0.21 mm2.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1749-1753"},"PeriodicalIF":2.8,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
1.63 pJ/SOP Neuromorphic Processor With Integrated Partial Sum Routers for In-Network Computing 1.63 pJ/SOP 神经形态处理器,集成部分和路由器,用于网内计算
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3409652
Dongrui Li;Ming Ming Wong;Yi Sheng Chong;Jun Zhou;Mohit Upadhyay;Ananta Balaji;Aarthy Mani;Weng Fai Wong;Li Shiuan Peh;Anh Tuan Do;Bo Wang
Neuromorphic computing is promising to achieve unprecedented energy efficiency by emulating the human brain’s mechanism. Conventional neuromorphic accelerators employ split-and-merge method to map spiking neural networks’ inputs to surpass the fan-in capabilities of a single neuron core. However, this approach gives rise to the risk of accuracy compromise and extra core usage for the merging process. Moreover, it requires excessive data movement and clock cycles to aggregate spikes generated by partial sums instead of total sums obtained from different cores with substantial power and energy overhead. This work presents a novel approach to addressing the challenges imposed by the split-and-merge method. We propose an energy-efficient, reconfigurable neuromorphic processor that leverages several key techniques to mitigate the above issues. First, we introduce a partial sum router circuitry that enables in-network computing (INC), eliminating the need for extra merge cores. Second, we adopt software-defined Networks-on-Chip (NoCs) by leveraging predefined, efficient routing, eliminating power-hungry routing computation. At last, we incorporate fine-grained power gating and clock gating techniques for further power reduction. Experimental results from our test chip demonstrate the lossless mapping of the algorithm and exceptional energy efficiency, achieving an energy consumption of 1.63 pJ/SOP at 0.48 V. This energy efficiency represents a 22.4% improvement compared to the state-of-the-art results. Our proposed neuromorphic processor provides an efficient and flexible solution for neural network processing, mitigating the limitations of the traditional split-and-merge approach while delivering superior energy efficiency.
神经形态计算有望通过模拟人脑机制实现前所未有的能效。传统的神经形态加速器采用分割合并法映射尖峰神经网络的输入,以超越单个神经元内核的扇入能力。然而,这种方法存在精度受损的风险,而且在合并过程中需要使用额外的内核。此外,它还需要过多的数据移动和时钟周期,以聚合由部分总和产生的峰值,而不是从不同内核获得的总和,这将带来巨大的功耗和能耗开销。本研究提出了一种新方法来应对分割合并法带来的挑战。我们提出了一种高能效、可重构的神经形态处理器,利用几种关键技术来缓解上述问题。首先,我们引入了部分和路由器电路,实现了网络内计算(INC),从而无需额外的合并内核。其次,我们采用软件定义的片上网络(NoC),利用预定义的高效路由,消除了耗电的路由计算。最后,我们采用了细粒度功率门控和时钟门控技术,以进一步降低功耗。我们测试芯片的实验结果表明,该算法具有无损映射和出色的能效,在 0.48 V 电压下能耗仅为 1.63 pJ/SOP。我们提出的神经形态处理器为神经网络处理提供了高效、灵活的解决方案,既缓解了传统拆分合并方法的局限性,又实现了卓越的能效。
{"title":"1.63 pJ/SOP Neuromorphic Processor With Integrated Partial Sum Routers for In-Network Computing","authors":"Dongrui Li;Ming Ming Wong;Yi Sheng Chong;Jun Zhou;Mohit Upadhyay;Ananta Balaji;Aarthy Mani;Weng Fai Wong;Li Shiuan Peh;Anh Tuan Do;Bo Wang","doi":"10.1109/TVLSI.2024.3409652","DOIUrl":"10.1109/TVLSI.2024.3409652","url":null,"abstract":"Neuromorphic computing is promising to achieve unprecedented energy efficiency by emulating the human brain’s mechanism. Conventional neuromorphic accelerators employ split-and-merge method to map spiking neural networks’ inputs to surpass the fan-in capabilities of a single neuron core. However, this approach gives rise to the risk of accuracy compromise and extra core usage for the merging process. Moreover, it requires excessive data movement and clock cycles to aggregate spikes generated by partial sums instead of total sums obtained from different cores with substantial power and energy overhead. This work presents a novel approach to addressing the challenges imposed by the split-and-merge method. We propose an energy-efficient, reconfigurable neuromorphic processor that leverages several key techniques to mitigate the above issues. First, we introduce a partial sum router circuitry that enables in-network computing (INC), eliminating the need for extra merge cores. Second, we adopt software-defined Networks-on-Chip (NoCs) by leveraging predefined, efficient routing, eliminating power-hungry routing computation. At last, we incorporate fine-grained power gating and clock gating techniques for further power reduction. Experimental results from our test chip demonstrate the lossless mapping of the algorithm and exceptional energy efficiency, achieving an energy consumption of 1.63 pJ/SOP at 0.48 V. This energy efficiency represents a 22.4% improvement compared to the state-of-the-art results. Our proposed neuromorphic processor provides an efficient and flexible solution for neural network processing, mitigating the limitations of the traditional split-and-merge approach while delivering superior energy efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"2085-2092"},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 206 μW Vital Signs Monitoring System on Chip for Measuring Five Vitals 用于测量五种生命体征的 206 $mu$W 片上生命体征监测系统
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3415469
Sameen Minto;Austin Cable;Wala Saadeh
This article presents an area and power-efficient system-on-chip (SoC) for vital signs monitoring to provide patients with remote monitoring. It measures five important vitals including blood oxygen saturation (SpO2), respiration rate (RR), heart rate (HR), HR variability (HRV), and temperature. The proposed SoC utilizes a photoplethysmography (PPG) signal to compute HR, HRV, SpO2, and RR. The PPG signal is amplified and filtered using a PPG readout that includes a transimpedance amplifier (TIA) with a switched integrator (SI) to filter and amplify the signal. A differential second-order, delta-sigma analog-to-digital converter ( $Delta Sigma $ -ADC) is adopted to digitize the PPG signal. The SoC also comprises a low-power LED driver for both red and infrared (IR) LEDs which operate in pulsed mode with a 0.625% duty cycle. A vital signs extractor performs feature extraction (FE) and computes the vital signs with a maximum absolute error of less than 1%. In this work, the temperature is also measured by employing a Wheatstone bridge (WhB)-based temperature sensor which integrates thermal resistors into a second-order $Delta Sigma $ -ADC. The proposed system shares $Delta Sigma $ -ADC for digitizing the PPG signal and the temperature readings to reduce both area and power consumption. The proposed system computes the temperature over the human’s temperature range ( $32~^{circ }$ C to $42~^{circ }$ C) with an accuracy of +/ $- 0.09~^{circ }$ C. The SoC is implemented using a 180 nm CMOS process with an area of 4.8 mm2 while consuming $206~mu $ W.
本文介绍了一种面积小、功耗低的生命体征监测片上系统 (SoC),可为患者提供远程监测。它可测量五种重要的生命体征,包括血氧饱和度 (SpO2)、呼吸频率 (RR)、心率 (HR)、心率变异性 (HRV) 和体温。所提议的 SoC 利用光敏血压计(PPG)信号来计算心率、心率变异、SpO2 和 RR。PPG 信号通过 PPG 读出装置进行放大和滤波,该读出装置包括一个带开关积分器 (SI) 的跨阻抗放大器 (TIA),用于滤波和放大信号。采用差分二阶、delta-sigma 模数转换器($Delta Sigma $ -ADC)将 PPG 信号数字化。SoC 还包括一个低功耗 LED 驱动器,用于以 0.625% 占空比脉冲模式工作的红色和红外 (IR) LED。生命体征提取器执行特征提取(FE)并计算生命体征,最大绝对误差小于 1%。在这项工作中,还采用了基于惠斯通电桥(WhB)的温度传感器来测量温度,该传感器将热敏电阻集成到二阶 $Delta Sigma $ -ADC 中。提议的系统共用 $Delta Sigma $ -ADC 对 PPG 信号和温度读数进行数字化,以减少面积和功耗。该系统使用 180 纳米 CMOS 工艺实现,面积为 4.8 mm2,功耗为 206~mu $ W。
{"title":"A 206 μW Vital Signs Monitoring System on Chip for Measuring Five Vitals","authors":"Sameen Minto;Austin Cable;Wala Saadeh","doi":"10.1109/TVLSI.2024.3415469","DOIUrl":"10.1109/TVLSI.2024.3415469","url":null,"abstract":"This article presents an area and power-efficient system-on-chip (SoC) for vital signs monitoring to provide patients with remote monitoring. It measures five important vitals including blood oxygen saturation (SpO2), respiration rate (RR), heart rate (HR), HR variability (HRV), and temperature. The proposed SoC utilizes a photoplethysmography (PPG) signal to compute HR, HRV, SpO2, and RR. The PPG signal is amplified and filtered using a PPG readout that includes a transimpedance amplifier (TIA) with a switched integrator (SI) to filter and amplify the signal. A differential second-order, delta-sigma analog-to-digital converter (\u0000<inline-formula> <tex-math>$Delta Sigma $ </tex-math></inline-formula>\u0000-ADC) is adopted to digitize the PPG signal. The SoC also comprises a low-power LED driver for both red and infrared (IR) LEDs which operate in pulsed mode with a 0.625% duty cycle. A vital signs extractor performs feature extraction (FE) and computes the vital signs with a maximum absolute error of less than 1%. In this work, the temperature is also measured by employing a Wheatstone bridge (WhB)-based temperature sensor which integrates thermal resistors into a second-order \u0000<inline-formula> <tex-math>$Delta Sigma $ </tex-math></inline-formula>\u0000-ADC. The proposed system shares \u0000<inline-formula> <tex-math>$Delta Sigma $ </tex-math></inline-formula>\u0000-ADC for digitizing the PPG signal and the temperature readings to reduce both area and power consumption. The proposed system computes the temperature over the human’s temperature range (\u0000<inline-formula> <tex-math>$32~^{circ }$ </tex-math></inline-formula>\u0000 C to \u0000<inline-formula> <tex-math>$42~^{circ }$ </tex-math></inline-formula>\u0000 C) with an accuracy of +/\u0000<inline-formula> <tex-math>$- 0.09~^{circ }$ </tex-math></inline-formula>\u0000 C. The SoC is implemented using a 180 nm CMOS process with an area of 4.8 mm2 while consuming \u0000<inline-formula> <tex-math>$206~mu $ </tex-math></inline-formula>\u0000 W.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1652-1660"},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VLSI Design of Light-Field Factorization for Dual-Layer Factored Display 用于双层因式显示器的光场因式化 VLSI 设计
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3414262
Li-De Chen;Li-Qun Weng;Hao-Chien Cheng;An-Yu Cheng;Kai-Ping Lin;Chao-Tsung Huang
This article introduces a VLSI design for light-field factorization, aimed at enhancing immersive 3-D visual experiences for computational light-field factored displays. The main design challenges are intensive memory-access demands and high computational complexity. Accordingly, we first propose half-block-based factorization (HBBF) and sparse ray sampling (SRS) to reduce DRAM bandwidth by 99% and SRAM size by 74%. Then, we devise integer hybrid quantization (INTH) to cut down computational logic by 41%, leading to improvements in die area and power efficiency. Finally, we fabricated a processor chip that incorporates 75.1 kB of SRAM and 5.9M logic gates using 40-nm CMOS technology. It can operate with three different performance modes: high quality (56.9 MPixel/s at 971 mW), balanced (62.5 MPixel/s at 442 mW), and low power (61.7 MPixel/s at 283 mW). Across these modes, its normalized energy ranges between 4.4 and 16.2 nJ/pixel. This implementation surpasses existing GPU platforms and offers an $85times $ increase in processing speed and a $311times $ reduction in power consumption. We also showcase a real-time computational 3-D display system with this chip, demonstrating its practical efficacy in computational 3-D display technology.
本文介绍了一种用于光场因数分解的 VLSI 设计,旨在增强计算光场因数分解显示器的沉浸式三维视觉体验。设计面临的主要挑战是密集的内存访问需求和较高的计算复杂性。因此,我们首先提出了基于半块的因式分解(HBBF)和稀疏射线采样(SRS),从而将 DRAM 带宽减少了 99%,将 SRAM 大小减少了 74%。然后,我们设计了整数混合量化 (INTH),将计算逻辑减少了 41%,从而改善了芯片面积和能效。最后,我们利用 40 纳米 CMOS 技术制造出了一款处理器芯片,它集成了 75.1 kB 的 SRAM 和 590 万个逻辑门。它可以在三种不同的性能模式下运行:高质量(56.9 MPixel/s,971 mW)、平衡(62.5 MPixel/s,442 mW)和低功耗(61.7 MPixel/s,283 mW)。在这些模式下,其归一化能量介于 4.4 和 16.2 nJ/像素之间。这种实现方式超越了现有的 GPU 平台,处理速度提高了 85 倍,功耗降低了 311 倍。我们还展示了采用该芯片的实时计算三维显示系统,证明了它在计算三维显示技术中的实际功效。
{"title":"VLSI Design of Light-Field Factorization for Dual-Layer Factored Display","authors":"Li-De Chen;Li-Qun Weng;Hao-Chien Cheng;An-Yu Cheng;Kai-Ping Lin;Chao-Tsung Huang","doi":"10.1109/TVLSI.2024.3414262","DOIUrl":"10.1109/TVLSI.2024.3414262","url":null,"abstract":"This article introduces a VLSI design for light-field factorization, aimed at enhancing immersive 3-D visual experiences for computational light-field factored displays. The main design challenges are intensive memory-access demands and high computational complexity. Accordingly, we first propose half-block-based factorization (HBBF) and sparse ray sampling (SRS) to reduce DRAM bandwidth by 99% and SRAM size by 74%. Then, we devise integer hybrid quantization (INTH) to cut down computational logic by 41%, leading to improvements in die area and power efficiency. Finally, we fabricated a processor chip that incorporates 75.1 kB of SRAM and 5.9M logic gates using 40-nm CMOS technology. It can operate with three different performance modes: high quality (56.9 MPixel/s at 971 mW), balanced (62.5 MPixel/s at 442 mW), and low power (61.7 MPixel/s at 283 mW). Across these modes, its normalized energy ranges between 4.4 and 16.2 nJ/pixel. This implementation surpasses existing GPU platforms and offers an \u0000<inline-formula> <tex-math>$85times $ </tex-math></inline-formula>\u0000 increase in processing speed and a \u0000<inline-formula> <tex-math>$311times $ </tex-math></inline-formula>\u0000 reduction in power consumption. We also showcase a real-time computational 3-D display system with this chip, demonstrating its practical efficacy in computational 3-D display technology.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"2093-2106"},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore Clusters 在共享 L1 内存的多核集群中实现高效混合 Systolic 计算
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3415486
Sergio Mazzola;Samuel Riedel;Luca Benini
Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array’s processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster’s shared memory. We introduce two low-overhead RISC-V instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture’s trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool’s compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80 V/25 °C), in a 22-nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.
收缩阵列和共享 L1 内存多核集群是常用的架构范例,它们为加速并行工作负载提供了不同的权衡。前者擅长常规数据流,但代价是僵化的架构和复杂的编程模型;后者功能多样,易于编程,但需要明确的数据流管理和同步。这项工作的目标是在共享 L1 内存的多核集群上实现高效的系统执行。我们设计了一种灵活的架构,在这种架构中,小型高能效 RISC-V 内核充当系统阵列的处理元件(PE),并可通过映射到集群共享内存中的队列形成多样化、可重新配置的系统拓扑结构。我们为高效的系统执行引入了两个低开销 RISC-V 指令集架构(ISA)扩展,即支持硬件队列管理的 Xqueue 和队列连接寄存器(QLRs)。Xqueue 扩展实现了对共享内存映射队列的单指令访问,而 QLRs 则实现了对队列的隐式自主访问,减轻了内核的显式通信指令负担。我们在拥有 256 个内核的开源共享内存集群 MemPool 中演示了 Xqueue 和 QLR,并在具有不同算术强度的几个数字信号处理(DSP)内核上分析了混合系统-共享内存架构的权衡。只需增加 6% 的面积,我们的混合架构就能将 MemPool 的计算单元利用率提高一倍,最高可达 73%。在 22 纳米 FDX 技术的典型条件下(TT/0.80 V/25 °C),我们的混合架构以 600 MHz 的频率运行,频率没有降低,能效比共享内存基线高出 65%,达到 208 GOPS/W,63% 的功耗消耗在 PE 上。
{"title":"Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore Clusters","authors":"Sergio Mazzola;Samuel Riedel;Luca Benini","doi":"10.1109/TVLSI.2024.3415486","DOIUrl":"10.1109/TVLSI.2024.3415486","url":null,"abstract":"Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array’s processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster’s shared memory. We introduce two low-overhead RISC-V instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture’s trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool’s compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80 V/25 °C), in a 22-nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1602-1615"},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis and Optimization of Sense-and-Set Piezoelectric Energy Harvesting Interface Circuits 感应和设置压电能量收集接口电路的分析与优化
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-20 DOI: 10.1109/TVLSI.2024.3409668
Loai G. Salem
This article presents the modeling and optimization of a sense-and-set (SaS) rectifier. The basic equations governing the operation of a SaS rectifier are derived analytically using Laplace-transform techniques. An expression for the harvesting efficiency of a SaS rectifier is developed by evaluating the conduction and gate-drive losses as well as the output power of the rectifier. The derived expressions are then employed to locate the optimal design point of a SaS interface circuit. The proposed modeling approach reduces the required run time by more than 2000 times as compared to SPICE simulation without sacrificing accuracy. The following design parameters are determined for maximum efficiency: optimal relative size between the rectifier switches, total conductance of the rectifier, and sensing frequency. The close match between the theoretical expressions and circuit simulation results validates the proposed analysis.
本文介绍了感测和设定(SaS)整流器的建模和优化。使用拉普拉斯变换技术分析得出了控制 SaS 整流器运行的基本方程。通过评估整流器的传导和栅极驱动损耗以及输出功率,得出了 SaS 整流器的采集效率表达式。推导出的表达式可用于确定 SaS 接口电路的最佳设计点。与 SPICE 仿真相比,所提出的建模方法在不影响精度的前提下将所需运行时间缩短了 2000 多倍。为实现最高效率,确定了以下设计参数:整流器开关之间的最佳相对尺寸、整流器的总电导和感应频率。理论表达式与电路仿真结果之间的密切匹配验证了所提出的分析方法。
{"title":"Analysis and Optimization of Sense-and-Set Piezoelectric Energy Harvesting Interface Circuits","authors":"Loai G. Salem","doi":"10.1109/TVLSI.2024.3409668","DOIUrl":"10.1109/TVLSI.2024.3409668","url":null,"abstract":"This article presents the modeling and optimization of a sense-and-set (SaS) rectifier. The basic equations governing the operation of a SaS rectifier are derived analytically using Laplace-transform techniques. An expression for the harvesting efficiency of a SaS rectifier is developed by evaluating the conduction and gate-drive losses as well as the output power of the rectifier. The derived expressions are then employed to locate the optimal design point of a SaS interface circuit. The proposed modeling approach reduces the required run time by more than 2000 times as compared to SPICE simulation without sacrificing accuracy. The following design parameters are determined for maximum efficiency: optimal relative size between the rectifier switches, total conductance of the rectifier, and sensing frequency. The close match between the theoretical expressions and circuit simulation results validates the proposed analysis.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1630-1639"},"PeriodicalIF":2.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1