Pub Date : 2024-11-08DOI: 10.1109/TVLSI.2024.3486237
Eduardo Antonio Ceśar da Costa;Morgana Macedo Azevedo da Rosa
Cubic operations are among the most used arithmetic operations in many applications that demand higher order simultaneous operand computation, such as cryptography and bicubic polynomial interpolation. This article proposes a novel VLSI radix-$2^{m}$ cubic unit (RCU-$2^{m}$ ) capable of processing cubic operations at m bits simultaneously, with m values of 2 (RCU-4), 3 (RCU-8), and 4 (RCU-16). RCU-16 emerges as the most area-efficient configuration, surpassing RCU-8 and notably outperforming RCU-4. In the 8-bit scenario, RCU-16 achieves remarkable area savings, surpassing the literature’s proposed cubic unit by $11.58times $ . Across all configurations, RCU-$2^{m}$ consistently outperforms the automatically selected cube unit, with energy savings ranging from $1.04times $ to $2times $ . In application specific integrated circuit (ASIC) and field-programmable gate array (FPGA)-based analyses, RCU-16 consistently exhibits superior performance in both area and energy savings compared with RCU-4, RCU-8, and solutions from the literature. These findings emphasize the importance of adopting radix-$2^{m}$ configurations, particularly RCU-16, for optimal energy-constrained VLSI applications.
{"title":"RCU- 2m: A VLSI Radix- 2m Cubic Unit","authors":"Eduardo Antonio Ceśar da Costa;Morgana Macedo Azevedo da Rosa","doi":"10.1109/TVLSI.2024.3486237","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3486237","url":null,"abstract":"Cubic operations are among the most used arithmetic operations in many applications that demand higher order simultaneous operand computation, such as cryptography and bicubic polynomial interpolation. This article proposes a novel VLSI radix-<inline-formula> <tex-math>$2^{m}$ </tex-math></inline-formula> cubic unit (RCU-<inline-formula> <tex-math>$2^{m}$ </tex-math></inline-formula>) capable of processing cubic operations at m bits simultaneously, with m values of 2 (RCU-4), 3 (RCU-8), and 4 (RCU-16). RCU-16 emerges as the most area-efficient configuration, surpassing RCU-8 and notably outperforming RCU-4. In the 8-bit scenario, RCU-16 achieves remarkable area savings, surpassing the literature’s proposed cubic unit by <inline-formula> <tex-math>$11.58times $ </tex-math></inline-formula>. Across all configurations, RCU-<inline-formula> <tex-math>$2^{m}$ </tex-math></inline-formula> consistently outperforms the automatically selected cube unit, with energy savings ranging from <inline-formula> <tex-math>$1.04times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$2times $ </tex-math></inline-formula>. In application specific integrated circuit (ASIC) and field-programmable gate array (FPGA)-based analyses, RCU-16 consistently exhibits superior performance in both area and energy savings compared with RCU-4, RCU-8, and solutions from the literature. These findings emphasize the importance of adopting radix-<inline-formula> <tex-math>$2^{m}$ </tex-math></inline-formula> configurations, particularly RCU-16, for optimal energy-constrained VLSI applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"733-745"},"PeriodicalIF":2.8,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scaling to finer CMOS process nodes necessitates more masks, resulting in higher costs and extended turnaround times (TATs). High costs and long TATs have hindered researchers outside the field of integrated circuits, including those in medicine, physics, and science from prototyping their own chips. Therefore, opportunities for diverse innovations in integrated circuits and talent development have been limited. We have developed the Agile-X platform for low-cost, rapid manufacturing of system-on-chips. Users can implement their own dedicated circuits with gate-array circuits on a base chip, which has common intellectual properties (IPs) such as RISC-V CPUs, various IOs, and ADCs. The base chip is manufactured in a foundry up to the intermediate metal layers and shipped with metal deposition on its surface. By directly drawing wiring patterns on this base chip with a mask-less lithography system, custom chips can be manufactured on-site without masks. As this process only requires wiring and eliminates masks, production time is drastically reduced compared to traditional full-mask wafer processes and multiproject wafer (MPW) shuttles. Development and manufacturing costs for the base chip, including preintegrated IPs, are shared among all Agile-X users. This reduces both IP and base-chip wafer costs per user. We prototyped wafers using a 0.18-$mu $ m CMOS process and tested the proposed structured ASIC platform and manufacturing process using mask-less lithography systems. The results indicate that the process from inputting GDS data to lithography and dry etching can be completed within 30 min, and custom application-specific integrated circuits (ASICs) can be manufactured within a day. Compared with full-mask wafer design and manufacturing, the manufacturing cost per chip, including IP costs, is reduced from 271000 USD to 22 USD, a reduction of 1/12252, and the manufacturing period is reduced from 20 days to 30 min, a reduction of 1/960.
{"title":"Agile-X: A Structured-ASIC Created With a Mask-Less Lithography System Enabling Low-Cost and Agile Chip Fabrication","authors":"Atsutake Kosuge;Hirofumi Sumi;Naonobu Shimamoto;Yukinori Ochiai;Yurie Inoue;Hideharu Amano;Tohru Mogami;Yoshio Mita;Makoto Ikeda;Tadahiro Kuroda","doi":"10.1109/TVLSI.2024.3486239","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3486239","url":null,"abstract":"Scaling to finer CMOS process nodes necessitates more masks, resulting in higher costs and extended turnaround times (TATs). High costs and long TATs have hindered researchers outside the field of integrated circuits, including those in medicine, physics, and science from prototyping their own chips. Therefore, opportunities for diverse innovations in integrated circuits and talent development have been limited. We have developed the Agile-X platform for low-cost, rapid manufacturing of system-on-chips. Users can implement their own dedicated circuits with gate-array circuits on a base chip, which has common intellectual properties (IPs) such as RISC-V CPUs, various IOs, and ADCs. The base chip is manufactured in a foundry up to the intermediate metal layers and shipped with metal deposition on its surface. By directly drawing wiring patterns on this base chip with a mask-less lithography system, custom chips can be manufactured on-site without masks. As this process only requires wiring and eliminates masks, production time is drastically reduced compared to traditional full-mask wafer processes and multiproject wafer (MPW) shuttles. Development and manufacturing costs for the base chip, including preintegrated IPs, are shared among all Agile-X users. This reduces both IP and base-chip wafer costs per user. We prototyped wafers using a 0.18-<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m CMOS process and tested the proposed structured ASIC platform and manufacturing process using mask-less lithography systems. The results indicate that the process from inputting GDS data to lithography and dry etching can be completed within 30 min, and custom application-specific integrated circuits (ASICs) can be manufactured within a day. Compared with full-mask wafer design and manufacturing, the manufacturing cost per chip, including IP costs, is reduced from 271000 USD to 22 USD, a reduction of 1/12252, and the manufacturing period is reduced from 20 days to 30 min, a reduction of 1/960.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"746-756"},"PeriodicalIF":2.8,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computational imaging (CI) has advanced significantly due to the use of convolutional neural networks (CNNs). Its edge deployment relies on layer fusion to offload the monstrous external memory access (EMA) of feature maps, necessitating the handling of overlapped features either through reusing or recomputing them. Depending on how the boundary-handling strategy is organized, the induced computing complexity and EMA can be optimized. However, state-of-the-art CI accelerators primarily apply homogeneous inference flows, which employ a single overlap-handling strategy throughout the fused layers, limiting their ability to balance computation and data access. In this article, we explore layer-wise optimization in fused-layer CNNs by exploiting hybrid-strategy inference flows and devising a corresponding computing architecture. We categorize layer-wise strategies and put forward a layer-wise hybrid inference flow (LHIF) to integrate their advantages, and we propose an optimization procedure that explicitly analyzes essential figures of merit (FoMs), including throughput, EMA, and energy efficiency. Furthermore, we develop a high-throughput accelerator—Falcon—to efficiently support LHIF under massive parallelism, especially with a time-division-multiplexing (TDM) buffer interface that enables seamless access to feature maps stored in an interleaved manner. Layout results show that the accelerator, delivering 41 TOPS with 1.5 MB of feature-map buffers, supports LHIF while increasing the die area by only 1.4% and power consumption by only 0.7%. Extensive simulations are conducted to demonstrate the versatility of LHIF in working scenarios at operational, design, and system levels. Compared with using homogeneous inference flows, the proposed LHIF achieves Pareto optimality with up to $2.28times $ higher throughput and $3.5times $ lower EMA.
{"title":"Falcon: A Fused-Layer Accelerator With Layer-Wise Hybrid Inference Flow for Computational Imaging CNNs","authors":"Yong-Tai Chen;Yen-Ting Chiu;Hao-Jiun Tu;Chao-Tsung Huang","doi":"10.1109/TVLSI.2024.3488042","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3488042","url":null,"abstract":"Computational imaging (CI) has advanced significantly due to the use of convolutional neural networks (CNNs). Its edge deployment relies on layer fusion to offload the monstrous external memory access (EMA) of feature maps, necessitating the handling of overlapped features either through reusing or recomputing them. Depending on how the boundary-handling strategy is organized, the induced computing complexity and EMA can be optimized. However, state-of-the-art CI accelerators primarily apply homogeneous inference flows, which employ a single overlap-handling strategy throughout the fused layers, limiting their ability to balance computation and data access. In this article, we explore layer-wise optimization in fused-layer CNNs by exploiting hybrid-strategy inference flows and devising a corresponding computing architecture. We categorize layer-wise strategies and put forward a layer-wise hybrid inference flow (LHIF) to integrate their advantages, and we propose an optimization procedure that explicitly analyzes essential figures of merit (FoMs), including throughput, EMA, and energy efficiency. Furthermore, we develop a high-throughput accelerator—Falcon—to efficiently support LHIF under massive parallelism, especially with a time-division-multiplexing (TDM) buffer interface that enables seamless access to feature maps stored in an interleaved manner. Layout results show that the accelerator, delivering 41 TOPS with 1.5 MB of feature-map buffers, supports LHIF while increasing the die area by only 1.4% and power consumption by only 0.7%. Extensive simulations are conducted to demonstrate the versatility of LHIF in working scenarios at operational, design, and system levels. Compared with using homogeneous inference flows, the proposed LHIF achieves Pareto optimality with up to <inline-formula> <tex-math>$2.28times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$3.5times $ </tex-math></inline-formula> lower EMA.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"720-732"},"PeriodicalIF":2.8,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05DOI: 10.1109/TVLSI.2024.3486332
Juyong Lee;Hayoung Lee;Sooryeong Lee;Sungho Kang
An algorithmic pattern generator (ALPG) has been developed within automatic test equipment (ATE) due to the extensive number of test patterns required for testing the memories. Since shared-resource ALPG generates the test pattern using the same arithmetic instruction and timing across multiple input/output (I/O) pins, the maximum operating frequency is limited by the delay of the arithmetic operation. On the other hand, per-pin ALPG can achieve high-speed operations by generating one bit of the test pattern for each I/O pin. However, the hardware cost is significantly increased due to the need for individual instruction and pattern generator (PG) for each I/O pin. To address these limitations, a cost-effective per-pin ALPG for high-speed memory testing is proposed. The proposed per-pin ALPG can achieve high-speed operations, and the hardware resources for storing and decoding the instructions are shared among multiple I/O pins to reduce the hardware cost. The experimental results indicate that the proposed ALPG can achieve a higher speed than the conventional per-pin ALPG with a reasonable hardware cost comparable to the conventional shared-resource ALPG.
{"title":"A Cost-Effective Per-Pin ALPG for High-Speed Memory Testing","authors":"Juyong Lee;Hayoung Lee;Sooryeong Lee;Sungho Kang","doi":"10.1109/TVLSI.2024.3486332","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3486332","url":null,"abstract":"An algorithmic pattern generator (ALPG) has been developed within automatic test equipment (ATE) due to the extensive number of test patterns required for testing the memories. Since shared-resource ALPG generates the test pattern using the same arithmetic instruction and timing across multiple input/output (I/O) pins, the maximum operating frequency is limited by the delay of the arithmetic operation. On the other hand, per-pin ALPG can achieve high-speed operations by generating one bit of the test pattern for each I/O pin. However, the hardware cost is significantly increased due to the need for individual instruction and pattern generator (PG) for each I/O pin. To address these limitations, a cost-effective per-pin ALPG for high-speed memory testing is proposed. The proposed per-pin ALPG can achieve high-speed operations, and the hardware resources for storing and decoding the instructions are shared among multiple I/O pins to reduce the hardware cost. The experimental results indicate that the proposed ALPG can achieve a higher speed than the conventional per-pin ALPG with a reasonable hardware cost comparable to the conventional shared-resource ALPG.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"867-871"},"PeriodicalIF":2.8,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-performance (HP) elliptic curve scalar multiplication (ECSM) hardware implementations hold significant importance in ensuring communication security in high-capacity and high-concurrence application scenarios. By analyzing the inherent priorities and parallelism in ECSMs, we proposed a novel HP ECSM algorithm and a partially parallel inversion algorithm based on the interleaved mechanism. With two dedicated multipliers and one interleaved multiplier, we introduced a compact hardware scheduling scheme to realize the consumption of four clock cycles within each loop of ECSM. The proposed HP ECSM architecture consists of two Karatsuba-Ofman multipliers (KOMs) and one classical multiplier (CM). The multiplexors and pipeline stages are meticulously designed to optimize the critical path (CP). The proposed architecture is implemented over Virtex-7 field-programmable gate array (FPGA), and the throughput reaches 158.03, 138.23, and 117.50 Mbps over $text {GF}(2^{163})$ , $text {GF}(2^{283})$ , and $text {GF}(2^{571})$ using 8762, 20451, and 41974 slices, respectively. The comparisons with recent existing works demonstrate that the performance and throughput of our design are among the top.
{"title":"High-Performance Elliptic Curve Scalar Multiplication Architecture Based on Interleaved Mechanism","authors":"Jingqi Zhang;Zhiming Chen;Mingzhi Ma;Rongkun Jiang;An Wang;Weijiang Wang;Hua Dang","doi":"10.1109/TVLSI.2024.3486312","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3486312","url":null,"abstract":"High-performance (HP) elliptic curve scalar multiplication (ECSM) hardware implementations hold significant importance in ensuring communication security in high-capacity and high-concurrence application scenarios. By analyzing the inherent priorities and parallelism in ECSMs, we proposed a novel HP ECSM algorithm and a partially parallel inversion algorithm based on the interleaved mechanism. With two dedicated multipliers and one interleaved multiplier, we introduced a compact hardware scheduling scheme to realize the consumption of four clock cycles within each loop of ECSM. The proposed HP ECSM architecture consists of two Karatsuba-Ofman multipliers (KOMs) and one classical multiplier (CM). The multiplexors and pipeline stages are meticulously designed to optimize the critical path (CP). The proposed architecture is implemented over Virtex-7 field-programmable gate array (FPGA), and the throughput reaches 158.03, 138.23, and 117.50 Mbps over <inline-formula> <tex-math>$text {GF}(2^{163})$ </tex-math></inline-formula>, <inline-formula> <tex-math>$text {GF}(2^{283})$ </tex-math></inline-formula>, and <inline-formula> <tex-math>$text {GF}(2^{571})$ </tex-math></inline-formula> using 8762, 20451, and 41974 slices, respectively. The comparisons with recent existing works demonstrate that the performance and throughput of our design are among the top.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"757-770"},"PeriodicalIF":2.8,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-30DOI: 10.1109/TVLSI.2024.3477731
Zhaolin Yang;Jing Jin;Xiaoming Liu;Jianjun Zhou
A 0.2–2.6 GHz reconfigurable direct conversion receiver is proposed in this article. The receiver’s high-linearity mode and high-gain mode can be configured by either bypassing or including the low-noise amplifier (LNA) stage. An agile-switching module is designed to facilitate the mode transitioning. In high-gain mode, a variable-gain current-reused shunt-feedback (VGCRSF) LNA with radio frequency (RF) gain-adapted impedance matching technique is proposed. Instead of utilizing a shared transconductance (Gm) stage in both the I- and Q-path, the Gm-separated IQ-leakage suppression (GSIQLS) structure is employed in the mixer stage to reduce the complex and frequency-dependent IQ mismatch engendered by the nonideal local oscillator (LO) signal overlap. In baseband, both the gain and the bandwidth are made configurable through the utilization of a bi-quad low pass filter (LPF) and a programmable gain amplifier (PGA). The proposed receiver is fabricated in a 40-nm CMOS technology. Measurement results indicate a maximum conversion gain of 78.5 dB and a minimum noise figure (NF) of 2.5 dB are achieved. The input 1-dB compression point (IP1dB), in-band (IB) third-order input-referred intercept point (IIP3), and out-of-band (OOB) IIP3 are larger than 0, 9.7, and 13.1 dBm, respectively. The gain and phase mismatch of the quadrature receiver are lower than 0.3 dB and 1°, respectively, over the baseband bandwidth ranging from 410 kHz to 24 MHz. The receiver occupies an area of 0.605 mm2 and consumes a power of 75.4 mW.
{"title":"A 0.2–2.6 GHz Reconfigurable Receiver Using RF-Gain-Adapted Impedance Matching and Gm-Separated IQ-Leakage Suppression Structure in 40-nm CMOS","authors":"Zhaolin Yang;Jing Jin;Xiaoming Liu;Jianjun Zhou","doi":"10.1109/TVLSI.2024.3477731","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3477731","url":null,"abstract":"A 0.2–2.6 GHz reconfigurable direct conversion receiver is proposed in this article. The receiver’s high-linearity mode and high-gain mode can be configured by either bypassing or including the low-noise amplifier (LNA) stage. An agile-switching module is designed to facilitate the mode transitioning. In high-gain mode, a variable-gain current-reused shunt-feedback (VGCRSF) LNA with radio frequency (RF) gain-adapted impedance matching technique is proposed. Instead of utilizing a shared transconductance (Gm) stage in both the I- and Q-path, the Gm-separated IQ-leakage suppression (GSIQLS) structure is employed in the mixer stage to reduce the complex and frequency-dependent IQ mismatch engendered by the nonideal local oscillator (LO) signal overlap. In baseband, both the gain and the bandwidth are made configurable through the utilization of a bi-quad low pass filter (LPF) and a programmable gain amplifier (PGA). The proposed receiver is fabricated in a 40-nm CMOS technology. Measurement results indicate a maximum conversion gain of 78.5 dB and a minimum noise figure (NF) of 2.5 dB are achieved. The input 1-dB compression point (IP1dB), in-band (IB) third-order input-referred intercept point (IIP3), and out-of-band (OOB) IIP3 are larger than 0, 9.7, and 13.1 dBm, respectively. The gain and phase mismatch of the quadrature receiver are lower than 0.3 dB and 1°, respectively, over the baseband bandwidth ranging from 410 kHz to 24 MHz. The receiver occupies an area of 0.605 mm2 and consumes a power of 75.4 mW.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"234-247"},"PeriodicalIF":2.8,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1109/TVLSI.2024.3480958
Pranav O. Mathews;Jennifer O. Hasler
Analog Hopfield networks perform continuous energy minimization, leading to efficient and near-optimal solutions to nonpolynomial (NP)-hard problems. However, practical implementations suffer from scaling and connectivity issues. A programmable and reconfigurable analog Hopfield network is presented that addresses these challenges through a reconfigurable Manhattan architecture with a high-precision 14-bit floating-gate (FG) compute-in-memory (CiM) fabric. The network is implemented on a field programmable analog array (FPAA) and experimentally tested on three different NP-hard problems with different scaling challenges: Weighted Max-Cut (high connectivity and weight precision), traveling salesman problem (TSP) (high connectivity and medium weight precision), and Boolean Satisfiability/3SAT (low connectivity and weight precision) where it solved each problem optimally in microseconds.
{"title":"A Programmable and Reconfigurable CMOS Analog Hopfield Network for NP-Hard Problems","authors":"Pranav O. Mathews;Jennifer O. Hasler","doi":"10.1109/TVLSI.2024.3480958","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3480958","url":null,"abstract":"Analog Hopfield networks perform continuous energy minimization, leading to efficient and near-optimal solutions to nonpolynomial (NP)-hard problems. However, practical implementations suffer from scaling and connectivity issues. A programmable and reconfigurable analog Hopfield network is presented that addresses these challenges through a reconfigurable Manhattan architecture with a high-precision 14-bit floating-gate (FG) compute-in-memory (CiM) fabric. The network is implemented on a field programmable analog array (FPAA) and experimentally tested on three different NP-hard problems with different scaling challenges: Weighted Max-Cut (high connectivity and weight precision), traveling salesman problem (TSP) (high connectivity and medium weight precision), and Boolean Satisfiability/3SAT (low connectivity and weight precision) where it solved each problem optimally in microseconds.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"821-830"},"PeriodicalIF":2.8,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1109/TVLSI.2024.3481993
Hui Hu;Bingbing Yao;Yi Shan;Lei Qiu
The conversion accuracy of successive approximation register (SAR) analog-to-digital converter (ADC) is mainly affected by the capacitor mismatch. In this brief, a histogram-based calibration technique is proposed, which does not require any additional analog circuitry. In this work, the method of partial fitting is used to detect irregular code densities, and construct a cost function to update the weight recursively. The prototype of the calibration is verified with a 12-bit SAR ADC manufactured in 28-nm standard CMOS process. At the sampling rate 50 MS/s, the measurement results indicate that the maximum spurious-free dynamic range (SFDR) can be improved from 77.26 to 88.26 dB, using 10.6 fJ/conversion-step, including reference voltage buffer, with a low-frequency input signal.
{"title":"A Histogram-Based Calibration Algorithm of Capacitor Mismatch for SAR ADCs","authors":"Hui Hu;Bingbing Yao;Yi Shan;Lei Qiu","doi":"10.1109/TVLSI.2024.3481993","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3481993","url":null,"abstract":"The conversion accuracy of successive approximation register (SAR) analog-to-digital converter (ADC) is mainly affected by the capacitor mismatch. In this brief, a histogram-based calibration technique is proposed, which does not require any additional analog circuitry. In this work, the method of partial fitting is used to detect irregular code densities, and construct a cost function to update the weight recursively. The prototype of the calibration is verified with a 12-bit SAR ADC manufactured in 28-nm standard CMOS process. At the sampling rate 50 MS/s, the measurement results indicate that the maximum spurious-free dynamic range (SFDR) can be improved from 77.26 to 88.26 dB, using 10.6 fJ/conversion-step, including reference voltage buffer, with a low-frequency input signal.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"872-876"},"PeriodicalIF":2.8,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The power over Ethernet (PoE) technology has gained intensive attention in networking market owing to the advantages of compactness, flexibility, and cost in application. The automatic maintain power signature (MPS) function specified by IEEE standard extracts the periodic pulsed current to enable applications requiring low power modes. However, a large driving capacity is required due to a large MPS current above 10 mA, sacrificing a certain area. This brief proposes an adaptive MPS scheme, which reuses existing class regulator and delay timer to source a pulsed MPS current to meet the MPS requirements, saving an area of 0.0104 mm2. The proposed MPS scheme has been fabricated in 0.18-$mu $ m 120-V BCD process and the area is $1.37times 1.00$ mm2. The experimental results show that the proposed PoE interface draws a pulsed current with a period of 312 ms and 25.6% duty cycle to address the issue of MPS absence in very low-power standby modes.
{"title":"An Adaptive Maintain Power Signature (MPS) Scheme With Reusable Current Generator for Powered Device (PD)","authors":"Yongyuan Li;Xuhong Yin;Wei Guo;Qiang Wu;Yongbo Zhang;Yong You;Zhangming Zhu","doi":"10.1109/TVLSI.2024.3480955","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3480955","url":null,"abstract":"The power over Ethernet (PoE) technology has gained intensive attention in networking market owing to the advantages of compactness, flexibility, and cost in application. The automatic maintain power signature (MPS) function specified by IEEE standard extracts the periodic pulsed current to enable applications requiring low power modes. However, a large driving capacity is required due to a large MPS current above 10 mA, sacrificing a certain area. This brief proposes an adaptive MPS scheme, which reuses existing class regulator and delay timer to source a pulsed MPS current to meet the MPS requirements, saving an area of 0.0104 mm2. The proposed MPS scheme has been fabricated in 0.18-<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m 120-V BCD process and the area is <inline-formula> <tex-math>$1.37times 1.00$ </tex-math></inline-formula> mm2. The experimental results show that the proposed PoE interface draws a pulsed current with a period of 312 ms and 25.6% duty cycle to address the issue of MPS absence in very low-power standby modes.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"877-881"},"PeriodicalIF":2.8,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this brief, a unified voltage frequency regulator (UVFR) system is designed to eliminate the voltage margin induced by process, voltage, and temperature (PVT) variations. The frequency is regulated with voltage by a universal logic line oscillator (ULLO), which can protect the system from timing violations. The length of the ULLO is self-calibrated by a ULL-based time-digital converter (ULL-TDC) and an in situ half-critical path timing detector, where the ULL is designed to track the critical path delay. A fully synthesizable digital low dropout (DLDO) is designed with the ULL-TDC and a proportional differential (PD) circuit for voltage regulation. The proposed system is implemented in an ARM Cortex-M0 microcontroller in 22 nm technology. Simulation results show that the ULL can accurately track the critical path delay with a maximum variation of 3% at 0.6 V and 11.5% at 0.45 V. The UVFR system consumes 13.2–112 uW of power overhead, and eliminates the voltage margin by 22.3%–28% while reducing the power consumption by 35%–42.3%.
{"title":"A Self-Calibrated Unified Voltage-and-Frequency Regulator System Design Based on Universal Logic Line Circuit","authors":"Jiliang Liu;Huidong Zhao;Zhi Li;Kangning Wang;Shushan Qiao","doi":"10.1109/TVLSI.2024.3466132","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3466132","url":null,"abstract":"In this brief, a unified voltage frequency regulator (UVFR) system is designed to eliminate the voltage margin induced by process, voltage, and temperature (PVT) variations. The frequency is regulated with voltage by a universal logic line oscillator (ULLO), which can protect the system from timing violations. The length of the ULLO is self-calibrated by a ULL-based time-digital converter (ULL-TDC) and an in situ half-critical path timing detector, where the ULL is designed to track the critical path delay. A fully synthesizable digital low dropout (DLDO) is designed with the ULL-TDC and a proportional differential (PD) circuit for voltage regulation. The proposed system is implemented in an ARM Cortex-M0 microcontroller in 22 nm technology. Simulation results show that the ULL can accurately track the critical path delay with a maximum variation of 3% at 0.6 V and 11.5% at 0.45 V. The UVFR system consumes 13.2–112 uW of power overhead, and eliminates the voltage margin by 22.3%–28% while reducing the power consumption by 35%–42.3%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"593-597"},"PeriodicalIF":2.8,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}