Pub Date : 2025-06-26DOI: 10.1109/OJCAS.2024.3518754
Yifei Zhu;Zhenxuan Luan;Dawei Feng;Weiwei Chen;Lei Ren;Zhangxi Tan
The escalating demand for high-performance and energy-efficient electronics has propelled 3D integrated circuits (3D ICs) as a promising solution. However, major obstacles have been the lack of specialized electronic design automation (EDA) software and standardized design flows for 3D chiplets. To bridge the gap, we introduce Open3DFlow,1 an open-source design platform for 3D ICs. It is a seven-step workflow that incorporates essential ASIC back-end processes while supporting multi-physics analysis, such as through silicon via (TSV) modeling, thermal analysis, and signal integrity (SI) evaluations. To illustrate all functionalities of Open3DFlow, we use it to implement a 3D RISC-V CPU design with a vertically stacked L2 cache on a separated die. We harden both CPU logic and 3D-cache die in a GlobalFoundries $0.18mu $ m (GF180) process with open-source PDK support. We enable face-to-face (F2F) coupling of the top and bottom die by constructing a bonding layer based on the original technology file. Open3DFlow’s open-source nature allows seamless integration of custom AI optimization algorithms. As a showcase, we leverage large language models (LLMs) to help the bonding pad placement. In addition, we apply LLM on back-end Tcl script generations to improve design productivity. We expect Open3DFlow to open up a brand-new paradigm for future 3D IC innovations.
对高性能和节能电子产品不断增长的需求推动了3D集成电路(3D ic)作为一个有前途的解决方案。然而,主要的障碍是缺乏专门的电子设计自动化(EDA)软件和3D小芯片的标准化设计流程。为了弥补这一差距,我们引入了Open3DFlow,一个3D ic的开源设计平台。这是一个七步工作流程,结合了基本的ASIC后端流程,同时支持多物理场分析,如通过硅孔(TSV)建模、热分析和信号完整性(SI)评估。为了说明Open3DFlow的所有功能,我们用它来实现一个3D RISC-V CPU设计,在一个独立的die上有一个垂直堆叠的L2缓存。我们在GlobalFoundries $0.18mu $ m (GF180)进程中强化CPU逻辑和3d缓存芯片,并支持开源PDK。我们通过基于原始技术文件构建键合层,实现了上下模具的面对面(F2F)耦合。Open3DFlow的开源特性允许自定义AI优化算法的无缝集成。作为展示,我们利用大型语言模型(llm)来帮助键合垫的放置。此外,我们将LLM应用于后端Tcl脚本生成,以提高设计效率。我们期待Open3DFlow为未来的3D集成电路创新开辟一个全新的范例。
{"title":"Revolutionize 3D-Chip Design With Open3DFlow, an Open-Source AI-Enhanced Solution","authors":"Yifei Zhu;Zhenxuan Luan;Dawei Feng;Weiwei Chen;Lei Ren;Zhangxi Tan","doi":"10.1109/OJCAS.2024.3518754","DOIUrl":"https://doi.org/10.1109/OJCAS.2024.3518754","url":null,"abstract":"The escalating demand for high-performance and energy-efficient electronics has propelled 3D integrated circuits (3D ICs) as a promising solution. However, major obstacles have been the lack of specialized electronic design automation (EDA) software and standardized design flows for 3D chiplets. To bridge the gap, we introduce Open3DFlow,<xref>1</xref> an open-source design platform for 3D ICs. It is a seven-step workflow that incorporates essential ASIC back-end processes while supporting multi-physics analysis, such as through silicon via (TSV) modeling, thermal analysis, and signal integrity (SI) evaluations. To illustrate all functionalities of <italic>Open3DFlow</i>, we use it to implement a 3D RISC-V CPU design with a vertically stacked L2 cache on a separated die. We harden both CPU logic and 3D-cache die in a GlobalFoundries <inline-formula> <tex-math>$0.18mu $ </tex-math></inline-formula>m (GF180) process with open-source PDK support. We enable face-to-face (F2F) coupling of the top and bottom die by constructing a bonding layer based on the original technology file. <italic>Open3DFlow</i>’s open-source nature allows seamless integration of custom AI optimization algorithms. As a showcase, we leverage large language models (LLMs) to help the bonding pad placement. In addition, we apply LLM on back-end Tcl script generations to improve design productivity. We expect <italic>Open3DFlow</i> to open up a brand-new paradigm for future 3D IC innovations.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"169-180"},"PeriodicalIF":2.4,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11052893","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144492383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-18DOI: 10.1109/OJCAS.2025.3580744
Inès Winandy;Arnaud Dion;Florent Manni;Pierre-Loïc Garoche;Dorra Ben Khalifa;Matthieu Martel
Precision tuning of fixed-point arithmetic is a powerful technique for optimizing hardware designs on, where computing resources and memory are often severely constrained. While fixed-point arithmetic offers significant performance and area advantages over floating-point implementations, deriving an appropriate fixed-point representation remains a challenging task. In particular, developers must carefully select the number of bits assigned to the integer and fractional parts of each variable to balance accuracy and resource consumption. In this article, we introduce an original precision tuning technique for synthesizing fixed-point programs from floating-point code, specifically targeting platforms. The distinguishing feature of our technique lies in its formal approach to error analysis: it systematically propagates numerical errors through computations to infer variable-specific fixed-point formats that guarantee user-specified accuracy bounds. Unlike heuristic or ad-hoc methods, our technique provides formal guarantees on the final accuracy of the generated code, ensuring safe deployment on hardware platforms. To enable hardware-friendly implementations, the resulting fixed-point programs use the ap_fixed data types provided by High Level Synthesis (HLS) tools, allowing fine-grained control over the precision of each variable. Our method has been implemented within the POPiX 2.0 framework, which automatically generates optimized fixed-point code ready for synthesis. Experimental results on a set of embedded benchmarks show that our fixed-point codes use predominantly fewer machine cycles than floating-point codes when compiled on an with the state-of-the-art HLS compiler by AMD. Also, our generated fixed-point codes reduce hardware resource usage, such as LUTs, flip-flops, and DSP blocks, with typical reductions ranging from 67% to 83% compared to double precision floating-point codes, depending on the application.
{"title":"Automated Fixed-Point Precision Optimization for FPGA Synthesis","authors":"Inès Winandy;Arnaud Dion;Florent Manni;Pierre-Loïc Garoche;Dorra Ben Khalifa;Matthieu Martel","doi":"10.1109/OJCAS.2025.3580744","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3580744","url":null,"abstract":"Precision tuning of fixed-point arithmetic is a powerful technique for optimizing hardware designs on, where computing resources and memory are often severely constrained. While fixed-point arithmetic offers significant performance and area advantages over floating-point implementations, deriving an appropriate fixed-point representation remains a challenging task. In particular, developers must carefully select the number of bits assigned to the integer and fractional parts of each variable to balance accuracy and resource consumption. In this article, we introduce an original precision tuning technique for synthesizing fixed-point programs from floating-point code, specifically targeting platforms. The distinguishing feature of our technique lies in its formal approach to error analysis: it systematically propagates numerical errors through computations to infer variable-specific fixed-point formats that guarantee user-specified accuracy bounds. Unlike heuristic or ad-hoc methods, our technique provides formal guarantees on the final accuracy of the generated code, ensuring safe deployment on hardware platforms. To enable hardware-friendly implementations, the resulting fixed-point programs use the ap_fixed data types provided by High Level Synthesis (HLS) tools, allowing fine-grained control over the precision of each variable. Our method has been implemented within the <sc>POPiX 2.0</small> framework, which automatically generates optimized fixed-point code ready for synthesis. Experimental results on a set of embedded benchmarks show that our fixed-point codes use predominantly fewer machine cycles than floating-point codes when compiled on an with the state-of-the-art HLS compiler by AMD. Also, our generated fixed-point codes reduce hardware resource usage, such as LUTs, flip-flops, and DSP blocks, with typical reductions ranging from 67% to 83% compared to double precision floating-point codes, depending on the application.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"192-204"},"PeriodicalIF":2.4,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11039693","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-10DOI: 10.1109/OJCAS.2025.3559774
Jiovana S. Gomes;Mateus Grellert;Fábio L. L. Ramos;Sergio Bampi
The pervasive presence of video content has spurred the development of advanced technologies to manage, process, and deliver high-quality content efficiently. Video compression is crucial in providing high-quality video services under limited network and storage capacities, traditionally achieved through hybrid codecs. However, as these frameworks reach a performance bottleneck with compression gains becoming harder to achieve with conventional methods, Deep Neural Networks (DNNs) offer a promising alternative. By leveraging DNNs’ nonlinear representation capacity, these networks can enhance compression efficiency and visual quality. Neural Video Coding (NVC) has recently received significant attention, with Neural Image Coding models surpassing traditional codecs in compression ratios. Therefore, this survey explores the state-of-the-art in NVC, examining recent works, frameworks, and the potential of this innovative approach to revolutionize video compression. We identify that NVC models have come a long way since the first proposals and currently are on par in compression efficiency with the latest hybrid codec, VVC. Still, many improvements are required to enable the practical usage of NVC, such as hardware-friendly development to enable faster inference and execution on mobile and energy-constrained devices.
{"title":"End-to-End Neural Video Compression: A Review","authors":"Jiovana S. Gomes;Mateus Grellert;Fábio L. L. Ramos;Sergio Bampi","doi":"10.1109/OJCAS.2025.3559774","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3559774","url":null,"abstract":"The pervasive presence of video content has spurred the development of advanced technologies to manage, process, and deliver high-quality content efficiently. Video compression is crucial in providing high-quality video services under limited network and storage capacities, traditionally achieved through hybrid codecs. However, as these frameworks reach a performance bottleneck with compression gains becoming harder to achieve with conventional methods, Deep Neural Networks (DNNs) offer a promising alternative. By leveraging DNNs’ nonlinear representation capacity, these networks can enhance compression efficiency and visual quality. Neural Video Coding (NVC) has recently received significant attention, with Neural Image Coding models surpassing traditional codecs in compression ratios. Therefore, this survey explores the state-of-the-art in NVC, examining recent works, frameworks, and the potential of this innovative approach to revolutionize video compression. We identify that NVC models have come a long way since the first proposals and currently are on par in compression efficiency with the latest hybrid codec, VVC. Still, many improvements are required to enable the practical usage of NVC, such as hardware-friendly development to enable faster inference and execution on mobile and energy-constrained devices.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"120-134"},"PeriodicalIF":2.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10962175","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-04DOI: 10.1109/OJCAS.2025.3557835
Sung-June Byun;Byeong-Gi Jang;Jong-Wan Jo;Dae-Young Choi;Young-Gun Pu;Sang-Sun Yoo;Seok-Kee Kim;Yeon-Jae Jung;Kang-Yoon Lee
This paper presents a bidirectional Four-Switch Buck-Boost (FSBB) converter with a high-voltage (HV) gate driver for use in power bank applications. The proposed FSBB is also integrated into this converter for increased efficiency. Thus, the proposed buck-boost converter can reduce conduction loss over a wide input voltage range by reducing the on-resistance of external MOSFETs using a gate source voltage (VGS) of 5V or 10V. The chip to be examined in this study is fabricated using a 130 nm 1P5M bipolar-CMOS-DMOS HV process with laterally diffused MOSFET (LDMOS) options to have a die size of 2.7 x 2.7 mm2. The proposed architecture is found to achieve a maximum output power level of 40W. The measurement results show that the maximum efficiencies at gate-source voltages (VGS) of 5V and 10V are 96.67% and 98.15%, respectively.
本文提出了一种带高压栅极驱动器的双向四开关降压升压(FSBB)变换器,用于移动电源应用。所提出的FSBB也集成到该转换器中以提高效率。因此,所提出的降压-升压变换器可以通过使用5V或10V的栅极源电压(VGS)降低外部mosfet的导通电阻来减少宽输入电压范围内的导通损耗。本研究中要研究的芯片采用130 nm 1P5M双极cmos - dmos HV工艺制造,具有横向扩散MOSFET (LDMOS)选项,其芯片尺寸为2.7 x 2.7 mm2。所提出的架构可实现40W的最大输出功率水平。测量结果表明,在5V和10V的栅源电压下,器件的最大效率分别为96.67%和98.15%。
{"title":"Design of a High Efficiency Bi-Directional Four-Switch Buck-Boost Converter With HV Gate Driver for Multi-Cell Battery Power Bank Applications","authors":"Sung-June Byun;Byeong-Gi Jang;Jong-Wan Jo;Dae-Young Choi;Young-Gun Pu;Sang-Sun Yoo;Seok-Kee Kim;Yeon-Jae Jung;Kang-Yoon Lee","doi":"10.1109/OJCAS.2025.3557835","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3557835","url":null,"abstract":"This paper presents a bidirectional Four-Switch Buck-Boost (FSBB) converter with a high-voltage (HV) gate driver for use in power bank applications. The proposed FSBB is also integrated into this converter for increased efficiency. Thus, the proposed buck-boost converter can reduce conduction loss over a wide input voltage range by reducing the on-resistance of external MOSFETs using a gate source voltage (VGS) of 5V or 10V. The chip to be examined in this study is fabricated using a 130 nm 1P5M bipolar-CMOS-DMOS HV process with laterally diffused MOSFET (LDMOS) options to have a die size of 2.7 x 2.7 mm2. The proposed architecture is found to achieve a maximum output power level of 40W. The measurement results show that the maximum efficiencies at gate-source voltages (VGS) of 5V and 10V are 96.67% and 98.15%, respectively.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"110-119"},"PeriodicalIF":2.4,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10949157","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-28DOI: 10.1109/OJCAS.2025.3573989
Georgios G. Roumeliotis;Jan Desmet;Jos Knockaert
An application of the ability of the Haar wavelet operational matrix to perform the numerical inverse Laplace transform as combined with the intrinsically convenient Haar wavelet transform of any time-domain signal is presented in this paper. A case study of the transient- and steady-state behavior of the input impedance of a short-circuited transmission line showcases a method to perform the numerical inverse Laplace transform of fractional-order approximative expressions of the skin effect. Furthermore, an improved skin effect approximation is presented.
{"title":"Circuit Simulation of Any Time-Domain Source on Fractional-Order Impedances by Use of the Haar Wavelet Transform, Case Study of the Skin Effect","authors":"Georgios G. Roumeliotis;Jan Desmet;Jos Knockaert","doi":"10.1109/OJCAS.2025.3573989","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3573989","url":null,"abstract":"An application of the ability of the Haar wavelet operational matrix to perform the numerical inverse Laplace transform as combined with the intrinsically convenient Haar wavelet transform of any time-domain signal is presented in this paper. A case study of the transient- and steady-state behavior of the input impedance of a short-circuited transmission line showcases a method to perform the numerical inverse Laplace transform of fractional-order approximative expressions of the skin effect. Furthermore, an improved skin effect approximation is presented.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"155-168"},"PeriodicalIF":2.4,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11016785","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-08DOI: 10.1109/OJCAS.2025.3568058
Hanyoung Lee;Ardianto Satriawan;Hanho Lee
Fully Homomorphic Encryption (FHE) allows computational processing of encrypted data on cloud servers, providing high security and enabling safe data utilization. As homomorphic multiplication progresses with encrypted data, noise accumulates, requiring a process called bootstrapping to restore the noise level of the new ciphertext $ct^{prime }$ . Bootstrapping involves linear transformation processes, such as Coefficient to Slots and Slots to Coefficient, where most operations used are rotation. Rotation shifts elements in slots to new positions based on rotation index k. However, the computational cost and memory bandwidth required for a rotation adds significant overhead and limits the ability to perform FHE operations. Therefore, an efficient implementation of rotation is crucial for high-performance FHE applications. To address this problem, we optimized the datapath of rotation in the CKKS scheme to be hardware-friendly and proposed a homomorphic evaluation cluster hardware accelerator tailored for FHE workloads. Our architecture is aware of the computational and memory constraints of field programmable gate arrays (FPGAs) and performs number theoretic transform (NTT), its inverse (INTT), key multiplication, base conversion, and automorphism in a single cluster. We implemented our design in the AMD Alveo U280 FPGA platform. With a polynomial length of 216 and operating at 250 MHz as a rotation accelerator, the design implementation on the FPGA shows a speed-up of about $700times $ compared to the CPU implementation in OpenFHE. Compared to the GPU implementation, it shows a $1.77times $ speed-up, and compared to previous FPGA implementations, it shows a $1.13times $ better.
{"title":"Homomorphic Evaluation Cluster Architecture for Fully Homomorphic Encryption","authors":"Hanyoung Lee;Ardianto Satriawan;Hanho Lee","doi":"10.1109/OJCAS.2025.3568058","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3568058","url":null,"abstract":"Fully Homomorphic Encryption (FHE) allows computational processing of encrypted data on cloud servers, providing high security and enabling safe data utilization. As homomorphic multiplication progresses with encrypted data, noise accumulates, requiring a process called bootstrapping to restore the noise level of the new ciphertext <inline-formula> <tex-math>$ct^{prime }$ </tex-math></inline-formula>. Bootstrapping involves linear transformation processes, such as Coefficient to Slots and Slots to Coefficient, where most operations used are rotation. Rotation shifts elements in slots to new positions based on rotation index k. However, the computational cost and memory bandwidth required for a rotation adds significant overhead and limits the ability to perform FHE operations. Therefore, an efficient implementation of rotation is crucial for high-performance FHE applications. To address this problem, we optimized the datapath of rotation in the CKKS scheme to be hardware-friendly and proposed a homomorphic evaluation cluster hardware accelerator tailored for FHE workloads. Our architecture is aware of the computational and memory constraints of field programmable gate arrays (FPGAs) and performs number theoretic transform (NTT), its inverse (INTT), key multiplication, base conversion, and automorphism in a single cluster. We implemented our design in the AMD Alveo U280 FPGA platform. With a polynomial length of 216 and operating at 250 MHz as a rotation accelerator, the design implementation on the FPGA shows a speed-up of about <inline-formula> <tex-math>$700times $ </tex-math></inline-formula> compared to the CPU implementation in OpenFHE. Compared to the GPU implementation, it shows a <inline-formula> <tex-math>$1.77times $ </tex-math></inline-formula> speed-up, and compared to previous FPGA implementations, it shows a <inline-formula> <tex-math>$1.13times $ </tex-math></inline-formula> better.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"135-146"},"PeriodicalIF":2.4,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10993408","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-27DOI: 10.1109/OJCAS.2025.3546464
Maximilian Scherzer;Mario Auer
In this article an integrated fully differential current amplifier is presented. It was designed for inductive sensor excitation, in this case for a fluxgate sensor, however the concept is applicable wherever a low noise and precise current is required. A brief review of some of the basic elements of the circuit is given, followed by the development of a model that takes into account output impedance limitations due to mismatch and stability criteria, an essential consideration in the design of a stable current amplifier for inductive loads. Based on the proposed model, the design and implementation of the current amplifier is outlined, identifying potential difficulties for on-chip integration. The final design was then fabricated using a standard 180nm CMOS technology. Measurement results show that the circuit draws only 2.8 mA from a 3.3V supply voltage and occupies a total area of 0.64 mm2. Special efforts were made to accurately evaluate the output impedance, whereby a value of 436k$Omega $ was recorded. In addition, the current amplifier achieves an output-referred noise current of 2.5$text {nA}/sqrt {text {Hz}}$ , resulting in a measured signal-to-noise ratio of more than 105.2 dB for a bandwidth of 512 Hz at an output current of 9$text {mA}_{text {p-p}}$ .
{"title":"An Integrated Fully Differential Current Amplifier With Frequency Compensation for Inductive Sensor Excitation","authors":"Maximilian Scherzer;Mario Auer","doi":"10.1109/OJCAS.2025.3546464","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3546464","url":null,"abstract":"In this article an integrated fully differential current amplifier is presented. It was designed for inductive sensor excitation, in this case for a fluxgate sensor, however the concept is applicable wherever a low noise and precise current is required. A brief review of some of the basic elements of the circuit is given, followed by the development of a model that takes into account output impedance limitations due to mismatch and stability criteria, an essential consideration in the design of a stable current amplifier for inductive loads. Based on the proposed model, the design and implementation of the current amplifier is outlined, identifying potential difficulties for on-chip integration. The final design was then fabricated using a standard 180nm CMOS technology. Measurement results show that the circuit draws only 2.8 mA from a 3.3V supply voltage and occupies a total area of 0.64 mm2. Special efforts were made to accurately evaluate the output impedance, whereby a value of 436k<inline-formula> <tex-math>$Omega $ </tex-math></inline-formula> was recorded. In addition, the current amplifier achieves an output-referred noise current of 2.5<inline-formula> <tex-math>$text {nA}/sqrt {text {Hz}}$ </tex-math></inline-formula>, resulting in a measured signal-to-noise ratio of more than 105.2 dB for a bandwidth of 512 Hz at an output current of 9<inline-formula> <tex-math>$text {mA}_{text {p-p}}$ </tex-math></inline-formula>.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"147-154"},"PeriodicalIF":2.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10906603","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-26DOI: 10.1109/OJCAS.2025.3546067
Amirhossein Rostami;Seyed Mohammad Ali Zeinolabedin;Liyuan Guo;Florian Kelber;Heiner Bauer;Andreas Dixius;Stefan Scholze;Marc Berthel;Dennis Walter;Johannes Uhlig;Bernhard Vogginger;Christian Mayr
Over the last few years, online training of deep neural networks (DNNs) on edge and mobile devices has attracted increasing interest in practical use cases due to their adaptability to new environments, personalization, and privacy preservation. Despite these advantages, online learning on resource-restricted devices is challenging. This work demonstrates a 16-bit floating-point, flexible, power- and memory-efficient neural learning unit (NLU) that can be integrated into processors to accelerate the learning process. To achieve this, we implemented three key strategies: a dynamic control unit, a tile allocation engine, and a neural compute pipeline, which together enhance data reuse and improve the flexibility of the NLU. The NLU was integrated into a system-on-chip (SoC) featuring a 32-bit RISC-V core and memory subsystems, fabricated using GlobalFoundries 22nm FDSOI technology. The design occupies just $0.015mm^{2}$ of silicon area and consumes only 0.379 mW of power. The results show that the NLU can accelerate the training process by up to $24.38times $ and reduce energy consumption by up to $37.37times $ compared to a RISC-V implementation with a floating-point unit (FPU). Additionally, compared to the state-of-the-art RISC-V with vector coprocessor, the NLU achieves $4.2times $ higher energy efficiency (measured in GFLOPS/W). These results demonstrate the feasibility of our design for edge and IoT devices, positioning it favorably among state-of-the-art on-chip learning solutions. Furthermore, we performed mixed-precision on-chip training from scratch for keyword spotting tasks using the Google Speech Commands (GSC) dataset. Training on just 40% of the dataset, the NLU achieved a training accuracy of 89.34% with stochastic rounding.
{"title":"NLU: An Adaptive, Small-Footprint, Low-Power Neural Learning Unit for Edge and IoT Applications","authors":"Amirhossein Rostami;Seyed Mohammad Ali Zeinolabedin;Liyuan Guo;Florian Kelber;Heiner Bauer;Andreas Dixius;Stefan Scholze;Marc Berthel;Dennis Walter;Johannes Uhlig;Bernhard Vogginger;Christian Mayr","doi":"10.1109/OJCAS.2025.3546067","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3546067","url":null,"abstract":"Over the last few years, online training of deep neural networks (DNNs) on edge and mobile devices has attracted increasing interest in practical use cases due to their adaptability to new environments, personalization, and privacy preservation. Despite these advantages, online learning on resource-restricted devices is challenging. This work demonstrates a 16-bit floating-point, flexible, power- and memory-efficient neural learning unit (NLU) that can be integrated into processors to accelerate the learning process. To achieve this, we implemented three key strategies: a dynamic control unit, a tile allocation engine, and a neural compute pipeline, which together enhance data reuse and improve the flexibility of the NLU. The NLU was integrated into a system-on-chip (SoC) featuring a 32-bit RISC-V core and memory subsystems, fabricated using GlobalFoundries 22nm FDSOI technology. The design occupies just <inline-formula> <tex-math>$0.015mm^{2}$ </tex-math></inline-formula> of silicon area and consumes only 0.379 mW of power. The results show that the NLU can accelerate the training process by up to <inline-formula> <tex-math>$24.38times $ </tex-math></inline-formula> and reduce energy consumption by up to <inline-formula> <tex-math>$37.37times $ </tex-math></inline-formula> compared to a RISC-V implementation with a floating-point unit (FPU). Additionally, compared to the state-of-the-art RISC-V with vector coprocessor, the NLU achieves <inline-formula> <tex-math>$4.2times $ </tex-math></inline-formula> higher energy efficiency (measured in GFLOPS/W). These results demonstrate the feasibility of our design for edge and IoT devices, positioning it favorably among state-of-the-art on-chip learning solutions. Furthermore, we performed mixed-precision on-chip training from scratch for keyword spotting tasks using the Google Speech Commands (GSC) dataset. Training on just 40% of the dataset, the NLU achieved a training accuracy of 89.34% with stochastic rounding.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"85-99"},"PeriodicalIF":2.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10904478","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143637834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-26DOI: 10.1109/OJCAS.2025.3545904
Konstantinos Metaxas;Vassilis Alimisis;Costas Oustoglou;Yannis Kominis;Paul P. Sotiriadis
A comprehensive nonlinear analysis of autonomous and periodically forced fully-differential, negative-resistor LC oscillators is presented. Through nonlinear transformations in the state space, it is shown that oscillators within this class exhibit qualitatively similar dynamical behavior in terms of their limit cycles and bifurcation curves, at least within an open region containing the origin. The case of autonomous, complementary BJT oscillators is used to validate the qualitative analysis and demonstrate a general approach of how to numerically extend the bifurcation curves away from the equilibrium point and determine the oscillatory conditions. When external periodic force is present, we focus on the special case of periodically multiplicatively-forced fully-differential, negative-resistor, LC oscillators and use Harmonic Balance techniques to derive analytical expressions estimating the locking range in the weak injection regime. The results are used to calculate the locking range of a harmonically forced complementary BJT oscillator yielding explicit expressions closely aligned with experimental measurements, thus verifying the validity of the analysis.
{"title":"Nonlinear Analysis of Differential LC Oscillators and Injection Locked Frequency Dividers","authors":"Konstantinos Metaxas;Vassilis Alimisis;Costas Oustoglou;Yannis Kominis;Paul P. Sotiriadis","doi":"10.1109/OJCAS.2025.3545904","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3545904","url":null,"abstract":"A comprehensive nonlinear analysis of autonomous and periodically forced fully-differential, negative-resistor LC oscillators is presented. Through nonlinear transformations in the state space, it is shown that oscillators within this class exhibit qualitatively similar dynamical behavior in terms of their limit cycles and bifurcation curves, at least within an open region containing the origin. The case of autonomous, complementary BJT oscillators is used to validate the qualitative analysis and demonstrate a general approach of how to numerically extend the bifurcation curves away from the equilibrium point and determine the oscillatory conditions. When external periodic force is present, we focus on the special case of periodically multiplicatively-forced fully-differential, negative-resistor, LC oscillators and use Harmonic Balance techniques to derive analytical expressions estimating the locking range in the weak injection regime. The results are used to calculate the locking range of a harmonically forced complementary BJT oscillator yielding explicit expressions closely aligned with experimental measurements, thus verifying the validity of the analysis.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"100-109"},"PeriodicalIF":2.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10904493","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143637833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/OJCAS.2025.3538707
Yufei Xiao;Kai Cai;Xiaohu Ge;Yong Xiao
Energy consumption evaluation for data processing tasks, such as encoding and decoding, is a critical consideration in designing very large scale integration (VLSI) circuits. Incorporating both information theory and circuit perspectives, a new general energy consumption model is proposed to capture the energy consumption of channel decoder circuits. For the binary erasure channel, lower bounds of energy consumption are derived for two-dimensional (2D) and three-dimensional (3D) decoder circuits under specified error probabilities, along with scaling rules for energy consumption in each case. Based on the proposed model, the lower bounds of energy consumption for staged serial and parallel implementations are derived, and a specific threshold value is identified to determine the parallel or serial decoding in decoder circuits. Staged serial implementations in 3D decoder circuits achieve a higher energy efficiency than fully parallel implementations when the processed data exceed 48 bits. Simulation results further demonstrate that the energy efficiency of 3D decoders improves with increasing data volume. When the number of input bits is 648, 1296 and 1944, the energy consumption of 3D decoders is reduced by 11.58%, 13.07%, and 13.86% compared to 2D decoders, respectively. The energy consumption of 3D decoders surpasses that of 2D decoders when the decoding error probability falls below a specific threshold of 0.035492. These results provide a foundational framework and benchmarks for analyzing and optimizing the energy consumption of 2D and 3D channel decoder circuits, enabling more efficient VLSI circuit designs.
{"title":"Energy Consumption Modeling of 2-D and 3-D Decoder Circuits","authors":"Yufei Xiao;Kai Cai;Xiaohu Ge;Yong Xiao","doi":"10.1109/OJCAS.2025.3538707","DOIUrl":"https://doi.org/10.1109/OJCAS.2025.3538707","url":null,"abstract":"Energy consumption evaluation for data processing tasks, such as encoding and decoding, is a critical consideration in designing very large scale integration (VLSI) circuits. Incorporating both information theory and circuit perspectives, a new general energy consumption model is proposed to capture the energy consumption of channel decoder circuits. For the binary erasure channel, lower bounds of energy consumption are derived for two-dimensional (2D) and three-dimensional (3D) decoder circuits under specified error probabilities, along with scaling rules for energy consumption in each case. Based on the proposed model, the lower bounds of energy consumption for staged serial and parallel implementations are derived, and a specific threshold value is identified to determine the parallel or serial decoding in decoder circuits. Staged serial implementations in 3D decoder circuits achieve a higher energy efficiency than fully parallel implementations when the processed data exceed 48 bits. Simulation results further demonstrate that the energy efficiency of 3D decoders improves with increasing data volume. When the number of input bits is 648, 1296 and 1944, the energy consumption of 3D decoders is reduced by 11.58%, 13.07%, and 13.86% compared to 2D decoders, respectively. The energy consumption of 3D decoders surpasses that of 2D decoders when the decoding error probability falls below a specific threshold of 0.035492. These results provide a foundational framework and benchmarks for analyzing and optimizing the energy consumption of 2D and 3D channel decoder circuits, enabling more efficient VLSI circuit designs.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"74-84"},"PeriodicalIF":2.4,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10870295","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143553307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}