IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献_第8页

55–100-GHz Enhanced Gilbert Cell Mixer Design in 22-nm FDSOI CMOS 采用 22 纳米 FDSOI CMOS 的 55-100-GHz 增强型吉尔伯特单元混频器设计

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-17 DOI: 10.1109/TVLSI.2024.3454350

Kimi Jokiniemi;Kaisa Ryynänen;Joni Vähä;Elmo Kankkunen;Kari Stadius;Jussi Ryynänen

This article presents a wideband active millimeter wave (mmWave) CMOS downconversion mixer preceded by thorough analysis. This article aims to provide solid reasoning for the proper choice of mixer topology and present methods to achieve high mixer performance, guiding mmWave mixer design. The article first analyses passive and active mixer input impedance and switching performance with a weak sinusoidal local oscillator (LO) signal, demonstrating that passive mixer switching performance is far more dependent on the LO signal. The article then introduces different active mixer design enhancement techniques, namely, peaking inductances and individual mixer stage biasing. The article proposes an enhanced Gilbert cell mixer that uses transformer coupling between the transconductance and switching stages. The complete mixer structure with an LO buffer and an IF amplifier consumes an area of only 0.13 mm2 fabricated in a 22-nm FDSOI process. The design achieves a measured peak voltage conversion gain (CG) of 3.5 dB, an exceptionally wide 55–100-GHz RF bandwidth, and a 10-GHz IF bandwidth. The complete mixer consumes 33 mW of power from a low 0.8-V supply voltage and demonstrates an input 1-dB gain compression point of −6 dBm.

本文提出了一种宽带有源毫米波（mmWave） CMOS下变频混频器，并进行了深入的分析。本文旨在为混频器拓扑的合理选择提供坚实的推理，并提出实现高混频器性能的方法，指导毫米波混频器的设计。本文首先分析了弱正弦本振（LO）信号下无源混频器和有源混频器的输入阻抗和开关性能，表明无源混频器的开关性能更依赖于LO信号。然后介绍了不同的有源混频器设计增强技术，即峰值电感和单个混频器级偏置。本文提出了一种增强的吉尔伯特单元混频器，它在跨导和开关级之间使用变压器耦合。采用22nm FDSOI工艺制造的具有LO缓冲器和IF放大器的完整混频器结构仅消耗0.13 mm2的面积。该设计实现了3.5 dB的峰值电压转换增益（CG）， 55 - 100 ghz的异常宽RF带宽和10 ghz的中频带宽。完整的混频器从低0.8 v电源电压消耗33 mW的功率，并显示输入1-dB增益压缩点为- 6 dBm。

{"title":"55–100-GHz Enhanced Gilbert Cell Mixer Design in 22-nm FDSOI CMOS","authors":"Kimi Jokiniemi;Kaisa Ryynänen;Joni Vähä;Elmo Kankkunen;Kari Stadius;Jussi Ryynänen","doi":"10.1109/TVLSI.2024.3454350","DOIUrl":"10.1109/TVLSI.2024.3454350","url":null,"abstract":"This article presents a wideband active millimeter wave (mmWave) CMOS downconversion mixer preceded by thorough analysis. This article aims to provide solid reasoning for the proper choice of mixer topology and present methods to achieve high mixer performance, guiding mmWave mixer design. The article first analyses passive and active mixer input impedance and switching performance with a weak sinusoidal local oscillator (LO) signal, demonstrating that passive mixer switching performance is far more dependent on the LO signal. The article then introduces different active mixer design enhancement techniques, namely, peaking inductances and individual mixer stage biasing. The article proposes an enhanced Gilbert cell mixer that uses transformer coupling between the transconductance and switching stages. The complete mixer structure with an LO buffer and an IF amplifier consumes an area of only 0.13 mm2 fabricated in a 22-nm FDSOI process. The design achieves a measured peak voltage conversion gain (CG) of 3.5 dB, an exceptionally wide 55–100-GHz RF bandwidth, and a 10-GHz IF bandwidth. The complete mixer consumes 33 mW of power from a low 0.8-V supply voltage and demonstrates an input 1-dB gain compression point of −6 dBm.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2186-2197"},"PeriodicalIF":2.8,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

46-nA High-PSR CMOS Buffered Voltage Reference With 1.2–5 V and −40 ◦C to 125 ◦C Operating Range 46-nA 高PSR CMOS 缓冲电压基准，工作范围为 1.2-5 V 和 $-$40 $^{circ}$C 至 125 $^{circ}$C

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-17 DOI: 10.1109/TVLSI.2024.3455428

Chiara Venezia;Andrea Ballo;Alfio Dario Grasso;Alessandro Rizzo;Calogero Ribellino;Salvatore Pennisi

A nanopower, buffered CMOS voltage reference designed to operate across the entire industrial temperature range (from

$- 40~^{circ }$

C to

$125~^{circ }$

C) and with an input voltage range from 1.2 to 5 V (automotive applications) is presented. The solution provides 390 mV at

$20~^{circ }$

C and is implemented in a standard BCD technology featuring 160-nm CMOS devices. It is characterized by an average temperature coefficient of 200 ppm/°C, a line sensitivity (LS) of 0.138%/V, and a power supply rejection of −83 dB at 100 Hz. In addition, the circuit occupies a die area of 0.146 mm2 (with the reference circuit alone covering 0.043 mm2) and maintains a highly stable current consumption of around 45 nA across various process and input voltage conditions (25 nA for the reference circuit alone) while providing a maximum output current of

$630~mu $

A with a load regulation of 0.016 mV/

$mu $

A.

提出了一种纳米级缓冲CMOS电压基准，设计用于整个工业温度范围（从$- 40~^{circ}$ C到$125~^{circ}$ C），输入电压范围为1.2至5 V（汽车应用）。该解决方案可在$20~^{circ}$ C下提供390 mV，并采用具有160纳米CMOS器件的标准BCD技术实现。其特点是平均温度系数为200 ppm/°C，线路灵敏度（LS）为0.138%/V， 100 Hz时电源抑制为−83 dB。此外，该电路的芯片面积为0.146 mm2（仅参考电路占地0.043 mm2），在各种工艺和输入电压条件下（仅参考电路为25 nA）保持45 nA左右的高稳定电流消耗，同时提供630~mu $ a的最大输出电流，负载调节为0.016 mV/ $mu $ a。

{"title":"46-nA High-PSR CMOS Buffered Voltage Reference With 1.2–5 V and −40 ◦C to 125 ◦C Operating Range","authors":"Chiara Venezia;Andrea Ballo;Alfio Dario Grasso;Alessandro Rizzo;Calogero Ribellino;Salvatore Pennisi","doi":"10.1109/TVLSI.2024.3455428","DOIUrl":"10.1109/TVLSI.2024.3455428","url":null,"abstract":"A nanopower, buffered CMOS voltage reference designed to operate across the entire industrial temperature range (from <inline-formula> <tex-math>$- 40~^{circ }$ </tex-math></inline-formula>C to <inline-formula> <tex-math>$125~^{circ }$ </tex-math></inline-formula>C) and with an input voltage range from 1.2 to 5 V (automotive applications) is presented. The solution provides 390 mV at <inline-formula> <tex-math>$20~^{circ }$ </tex-math></inline-formula>C and is implemented in a standard BCD technology featuring 160-nm CMOS devices. It is characterized by an average temperature coefficient of 200 ppm/°C, a line sensitivity (LS) of 0.138%/V, and a power supply rejection of −83 dB at 100 Hz. In addition, the circuit occupies a die area of 0.146 mm2 (with the reference circuit alone covering 0.043 mm2) and maintains a highly stable current consumption of around 45 nA across various process and input voltage conditions (25 nA for the reference circuit alone) while providing a maximum output current of <inline-formula> <tex-math>$630~mu $ </tex-math></inline-formula>A with a load regulation of 0.016 mV/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>A.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"326-336"},"PeriodicalIF":2.8,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10682102","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Architectural Exploration for Waferscale Switching System 晶圆级交换系统架构探索

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-17 DOI: 10.1109/TVLSI.2024.3455332

Zhiquan Wan;Zhipeng Cao;Shunbin Li;Peijie Li;Qingwen Deng;Weihao Wang;Kun Zhang;Guandong Liu;Ruyun Zhang;Qinrang Liu

With the end of Moore’s law and Dennard scaling, waferscale systems or processors that integrate multiple pre-tested known good dies (KGDs) on a waferscale-interposer are new approaches to further improve the chiplet-based system’s performance. This article explores the network on wafer (NoW) architecture of waferscale switching system under several physical constraints. A software-based approach is proposed to redefine the topological property. A five-level butterfly fat-tree (BFT)-like logical topology with 8.96-Tb/s (896 ports

$times 10$

Gb/s/port) switching bandwidth is achieved based on 2-D-mesh-like physical topology. We show that the proposed BFT-like topology with breadth-first-search (BFS) based traffic balanced routing algorithm reduces 55.6% hops, 41.4% transmission delay, and improves 24.2% throughput compared to 2-D-mesh-like topology under different traffic distributions. This BFT-like waferscale switching system is suitable for high-performance computing and data centers. In addition, the numerical analysis shows that the waferscale package can provide significant power efficiency and latency advantages compared to the typical single-chip package, which mainly benefits from the short-reach IO requirements. Note that the proposed waferscale switching system is compatible with high-switch-capacity dies with advanced process technology, which can further improve system performance. Finally, we present the physical implementations for the waferscale system with heterogeneous dies.

随着摩尔定律和登纳德缩放的终结，将多个预先测试的已知良好芯片（KGDs）集成在晶圆级中间层上的晶圆级系统或处理器是进一步提高基于芯片的系统性能的新方法。本文探讨了在几种物理约束条件下，晶圆级交换系统的NoW架构。提出了一种基于软件的拓扑属性重定义方法。基于二维网格状物理拓扑，实现了具有8.96 tb /s（896个端口$乘以10$ Gb/s/端口）交换带宽的五层蝶形胖树（BFT）逻辑拓扑。研究表明，在不同的流量分布下，基于广度优先搜索（BFS）的流量均衡路由算法与类bft拓扑相比，减少了55.6%的跳数，41.4%的传输延迟，提高了24.2%的吞吐量。这种类似bft的晶圆级交换系统适用于高性能计算和数据中心。此外，数值分析表明，与典型的单芯片封装相比，晶圆级封装可以提供显着的功耗效率和延迟优势，这主要得益于短距离IO要求。值得注意的是，所提出的晶圆级开关系统兼容具有先进工艺技术的高开关容量芯片，可以进一步提高系统性能。最后，我们提出了异构晶片系统的物理实现。

{"title":"Architectural Exploration for Waferscale Switching System","authors":"Zhiquan Wan;Zhipeng Cao;Shunbin Li;Peijie Li;Qingwen Deng;Weihao Wang;Kun Zhang;Guandong Liu;Ruyun Zhang;Qinrang Liu","doi":"10.1109/TVLSI.2024.3455332","DOIUrl":"10.1109/TVLSI.2024.3455332","url":null,"abstract":"With the end of Moore’s law and Dennard scaling, waferscale systems or processors that integrate multiple pre-tested known good dies (KGDs) on a waferscale-interposer are new approaches to further improve the chiplet-based system’s performance. This article explores the network on wafer (NoW) architecture of waferscale switching system under several physical constraints. A software-based approach is proposed to redefine the topological property. A five-level butterfly fat-tree (BFT)-like logical topology with 8.96-Tb/s (896 ports <inline-formula> <tex-math>$times 10$ </tex-math></inline-formula> Gb/s/port) switching bandwidth is achieved based on 2-D-mesh-like physical topology. We show that the proposed BFT-like topology with breadth-first-search (BFS) based traffic balanced routing algorithm reduces 55.6% hops, 41.4% transmission delay, and improves 24.2% throughput compared to 2-D-mesh-like topology under different traffic distributions. This BFT-like waferscale switching system is suitable for high-performance computing and data centers. In addition, the numerical analysis shows that the waferscale package can provide significant power efficiency and latency advantages compared to the typical single-chip package, which mainly benefits from the short-reach IO requirements. Note that the proposed waferscale switching system is compatible with high-switch-capacity dies with advanced process technology, which can further improve system performance. Finally, we present the physical implementations for the waferscale system with heterogeneous dies.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"512-524"},"PeriodicalIF":2.8,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10682064","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective Parallel Redundancy Analysis Using GPU for Memory Repair 利用 GPU 对内存修复进行有效的并行冗余分析

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-13 DOI: 10.1109/TVLSI.2024.3454286

Seung Ho Shin;Hayoung Lee;Sungho Kang

The rapid increment of the memory density leads to an increment of fault occurrence in memory cells. To improve the memory yield, effective memory test and repair methodologies for automatic test equipment (ATE) have been studied. Multiple memory chips are tested simultaneously by the ATE to improve throughput and reduce costs. In general, redundancy analysis (RA) is used for memory repair. However, since conventional RA methods store fault information in the respective failure bitmaps and operate sequentially, those have limitations due to the high area and analysis time. To address these problems, a novel graphic processing unit (GPU)-based RA method has been proposed which significantly enhances the efficiency of searching for repair solutions for multiple memories. The proposed RA method strategically focuses on the pivot line to efficiently utilize parallel processing and reduce the solution search space. Moreover, the proposed method does not require the extensive use of failure bitmaps since all process is conducted on the GPU. The process involves real-time fault collection, analysis, spare allocation, and solution decision process dynamically during the memory test. Experimental results demonstrate that the performance of the proposed RA method achieves an optimal repair rate and high analysis speed for multiple memories.

存储密度的快速增加导致存储单元中故障发生的增加。为了提高自动测试设备的记忆成品率，研究了有效的记忆测试和维修方法。多个存储芯片同时测试，以提高吞吐量和降低成本。通常，冗余分析（RA）用于内存修复。然而，由于传统的RA方法将故障信息存储在各自的故障位图中，并按顺序进行操作，因此由于面积大、分析时间长，存在一定的局限性。针对这些问题，提出了一种基于图形处理单元（GPU）的RA方法，该方法显著提高了多存储器修复解的搜索效率。提出的RA方法策略性地聚焦于支点线，有效地利用并行处理，减小了解的搜索空间。此外，所提出的方法不需要大量使用故障位图，因为所有过程都是在GPU上进行的。该过程包括在内存测试过程中实时的故障收集、分析、空闲分配和动态的解决方案决策过程。实验结果表明，该方法对多记忆体具有最佳修复率和较高的分析速度。

{"title":"Effective Parallel Redundancy Analysis Using GPU for Memory Repair","authors":"Seung Ho Shin;Hayoung Lee;Sungho Kang","doi":"10.1109/TVLSI.2024.3454286","DOIUrl":"10.1109/TVLSI.2024.3454286","url":null,"abstract":"The rapid increment of the memory density leads to an increment of fault occurrence in memory cells. To improve the memory yield, effective memory test and repair methodologies for automatic test equipment (ATE) have been studied. Multiple memory chips are tested simultaneously by the ATE to improve throughput and reduce costs. In general, redundancy analysis (RA) is used for memory repair. However, since conventional RA methods store fault information in the respective failure bitmaps and operate sequentially, those have limitations due to the high area and analysis time. To address these problems, a novel graphic processing unit (GPU)-based RA method has been proposed which significantly enhances the efficiency of searching for repair solutions for multiple memories. The proposed RA method strategically focuses on the pivot line to efficiently utilize parallel processing and reduce the solution search space. Moreover, the proposed method does not require the extensive use of failure bitmaps since all process is conducted on the GPU. The process involves real-time fault collection, analysis, spare allocation, and solution decision process dynamically during the memory test. Experimental results demonstrate that the performance of the proposed RA method achieves an optimal repair rate and high analysis speed for multiple memories.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"462-474"},"PeriodicalIF":2.8,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sophon: A Time-Repeatable and Low-Latency Architecture for Embedded Real-Time Systems Based on RISC-V Sophon：基于 RISC-V 的可重复时间和低延迟嵌入式实时系统架构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-11 DOI: 10.1109/TVLSI.2024.3447279

Zhe Huang;Xingyao Chen;Feng Gao;Ruige Li;Xiguang Wu;Fan Zhang

Embedded real-time systems impose rigorous timing constraints, where the failure to complete critical tasks within prescribed deadlines can lead to system crashes and catastrophic errors. Control latency, encompassing I/O and interrupt latency, significantly impacts system performance. Previous studies have primarily concentrated on architectural design to meet timing requirements or optimize for performance enhancement. Although the Ti programmable real-time unit (PRU) addresses both timing requirements and control latency, it remains a proprietary commercial chip. This article introduces a deterministic response architecture called Sophon, founded on the open and freely available reduced instruction set computer five (RISC-V). The essential part of this architecture is a tiny and flexible Sophon core that has fixed instruction latency. We propose an enhanced instruction set architecture (ISA) extension interface (EEI) capable of transmitting up to 32 operands in a single instruction, facilitating the development of domain-specific applications. In addition, we have devised two custom instructions to minimize control latency. The Sophon core requires a minimum of 28.6k gate equivalents. Experimental results demonstrate that the Sophon architecture eliminates execution time deviations while preserving low control latency. The highest achievable general purpose I/O (GPIO) flipping frequency is half of the core frequency, and the fastest interrupt latency is three clock cycles.

嵌入式实时系统施加了严格的时间限制，不能在规定的期限内完成关键任务可能导致系统崩溃和灾难性错误。控制延迟，包括I/O和中断延迟，会显著影响系统性能。以前的研究主要集中在满足时间要求或优化性能增强的建筑设计上。虽然Ti可编程实时单元（PRU）解决了时间要求和控制延迟，但它仍然是专有的商业芯片。本文介绍了一种名为Sophon的确定性响应体系结构，它建立在开放和免费的精简指令集计算机五（RISC-V）上。该架构的核心部分是一个微小而灵活的sopon内核，具有固定的指令延迟。我们提出了一种增强型指令集架构（ISA）扩展接口（EEI），能够在单个指令中传输多达32个操作数，从而促进了特定领域应用程序的开发。此外，我们还设计了两个自定义指令来最小化控制延迟。智子核心至少需要28.6k的栅极当量。实验结果表明，sopon架构在保持低控制延迟的同时消除了执行时间偏差。最高可实现的通用I/O （GPIO）翻转频率是核心频率的一半，最快的中断延迟是三个时钟周期。

{"title":"Sophon: A Time-Repeatable and Low-Latency Architecture for Embedded Real-Time Systems Based on RISC-V","authors":"Zhe Huang;Xingyao Chen;Feng Gao;Ruige Li;Xiguang Wu;Fan Zhang","doi":"10.1109/TVLSI.2024.3447279","DOIUrl":"10.1109/TVLSI.2024.3447279","url":null,"abstract":"Embedded real-time systems impose rigorous timing constraints, where the failure to complete critical tasks within prescribed deadlines can lead to system crashes and catastrophic errors. Control latency, encompassing I/O and interrupt latency, significantly impacts system performance. Previous studies have primarily concentrated on architectural design to meet timing requirements or optimize for performance enhancement. Although the Ti programmable real-time unit (PRU) addresses both timing requirements and control latency, it remains a proprietary commercial chip. This article introduces a deterministic response architecture called Sophon, founded on the open and freely available reduced instruction set computer five (RISC-V). The essential part of this architecture is a tiny and flexible Sophon core that has fixed instruction latency. We propose an enhanced instruction set architecture (ISA) extension interface (EEI) capable of transmitting up to 32 operands in a single instruction, facilitating the development of domain-specific applications. In addition, we have devised two custom instructions to minimize control latency. The Sophon core requires a minimum of 28.6k gate equivalents. Experimental results demonstrate that the Sophon architecture eliminates execution time deviations while preserving low control latency. The highest achievable general purpose I/O (GPIO) flipping frequency is half of the core frequency, and the fastest interrupt latency is three clock cycles.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"221-233"},"PeriodicalIF":2.8,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CR-DRAM: Improving DRAM Refresh Energy Efficiency With Inter-Subarray Charge Recycling CR-DRAM：利用子阵列间电荷回收提高 DRAM 刷新能效

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-11 DOI: 10.1109/TVLSI.2024.3445631

Haitao Du;Hairui Zhu;Song Chen;Yi Kang

A dynamic random access memory (DRAM) relies on periodic refresh operations to prevent data loss caused by charge leakage. As memory capacities continue to grow, refresh power consumption accounts for an increasing proportion of the total DRAM power, and in some contexts, it even becomes a major contributor to power consumption. To address this issue, previous research has explored the tradeoff between DRAM reliability and refresh overhead. However, DRAM reliability degrades as technology nodes advance, making these approaches inapplicable in scenarios, such as servers, where high data reliability is critical. Furthermore, these approaches require modifications to the standard DRAM interface protocol and memory controller (MC), rendering them infeasible for standalone use in computer systems. In this article, we propose an energy-efficient charge-recycling DRAM (CR-DRAM), which enables multiple rounds of charge (i.e., energy) recycling between subarrays within a single autorefresh (AR) process. After refreshing a row, CR-DRAM reuses the charge stored in the bitline (BL) capacitors to supply power for refreshing the next row in another subarray, rather than discharging them directly. Since CR-DRAM is compatible with the joint electron device engineering council (JEDEC) interface standard, it can be easily integrated into modern computer systems. Our circuit-level simulation shows that CR-DRAM significantly reduces AR power consumption by 33.9% compared with conventional DRAM, with a modest area overhead of less than 0.9%. Furthermore, our system-level evaluation shows that CR-DRAM offers an average energy savings of 9.2% (maximum of 11.9%) compared with 8-Gb double data rate 4 (DDR4) DRAM across SPEC-2006 benchmark workloads.

动态随机存取存储器（DRAM）依靠周期性的刷新操作来防止电荷泄漏造成的数据丢失。随着内存容量的不断增长，刷新功耗在DRAM总功耗中所占的比例越来越大，在某些情况下，刷新功耗甚至成为功耗的主要贡献者。为了解决这个问题，以前的研究已经探索了DRAM可靠性和刷新开销之间的权衡。然而，随着技术节点的进步，DRAM的可靠性会降低，这使得这些方法不适用于服务器等对数据可靠性要求很高的场景。此外，这些方法需要修改标准的DRAM接口协议和内存控制器（MC），使得它们无法在计算机系统中独立使用。在本文中，我们提出了一种节能的电荷回收DRAM (CR-DRAM)，它可以在单个自动刷新（AR）过程中实现子阵列之间的多轮电荷（即能量）回收。刷新一行后，CR-DRAM重用存储在位线（BL）电容中的电荷，为刷新另一子阵列中的下一行供电，而不是直接放电。由于CR-DRAM与联合电子器件工程委员会（JEDEC）接口标准兼容，因此可以轻松集成到现代计算机系统中。我们的电路级模拟表明，与传统DRAM相比，CR-DRAM显着降低了AR功耗33.9%，面积开销小于0.9%。此外，我们的系统级评估显示，在SPEC-2006基准工作负载中，与8gb双数据速率4 (DDR4) DRAM相比，CR-DRAM平均节能9.2%（最高11.9%）。

{"title":"CR-DRAM: Improving DRAM Refresh Energy Efficiency With Inter-Subarray Charge Recycling","authors":"Haitao Du;Hairui Zhu;Song Chen;Yi Kang","doi":"10.1109/TVLSI.2024.3445631","DOIUrl":"10.1109/TVLSI.2024.3445631","url":null,"abstract":"A dynamic random access memory (DRAM) relies on periodic refresh operations to prevent data loss caused by charge leakage. As memory capacities continue to grow, refresh power consumption accounts for an increasing proportion of the total DRAM power, and in some contexts, it even becomes a major contributor to power consumption. To address this issue, previous research has explored the tradeoff between DRAM reliability and refresh overhead. However, DRAM reliability degrades as technology nodes advance, making these approaches inapplicable in scenarios, such as servers, where high data reliability is critical. Furthermore, these approaches require modifications to the standard DRAM interface protocol and memory controller (MC), rendering them infeasible for standalone use in computer systems. In this article, we propose an energy-efficient charge-recycling DRAM (CR-DRAM), which enables multiple rounds of charge (i.e., energy) recycling between subarrays within a single autorefresh (AR) process. After refreshing a row, CR-DRAM reuses the charge stored in the bitline (BL) capacitors to supply power for refreshing the next row in another subarray, rather than discharging them directly. Since CR-DRAM is compatible with the joint electron device engineering council (JEDEC) interface standard, it can be easily integrated into modern computer systems. Our circuit-level simulation shows that CR-DRAM significantly reduces AR power consumption by 33.9% compared with conventional DRAM, with a modest area overhead of less than 0.9%. Furthermore, our system-level evaluation shows that CR-DRAM offers an average energy savings of 9.2% (maximum of 11.9%) compared with 8-Gb double data rate 4 (DDR4) DRAM across SPEC-2006 benchmark workloads.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"21-34"},"PeriodicalIF":2.8,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel TriNet Architecture for Enhanced Analog IC Design Automation 用于增强模拟集成电路设计自动化的新型 TriNet 架构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-10 DOI: 10.1109/TVLSI.2024.3452032

Arunkumar P Chavan;Shrish Shrinath Vaidya;Sanket M. Mantrashetti;Abhishek Gurunath Dastikopp;Kishan S. Murthy;H. V. Ravish Aradhya;Prakash Pawar

Analog integrated circuit (IC) design and its automation pose significant challenges due to the time-consuming mathematical computations and complexity of circuit design. Though efforts have been made to automate the analog design flow, the current approach falls short in meeting the exact design requirements and plagued by inaccuracies, highlighting the necessity for a more robust approach capable of accurately predicting circuits. In addition, there is a need for an improved dataset collection technique to enhance the overall performance of the automation process. The power management unit (PMU) is a crucial block in any IC that requires the design of low-dropout regulator (LDO) for which amplifiers are fundamental blocks. This research harnesses the capabilities of deep neural networks (DNNs) to automate essential amplifier blocks, such as the differential amplifier (DiffAmp) and two-stage operational amplifier (OpAmp). In addition, it proposes an automation framework for the higher level circuitry of the LDO. This article introduces a novel “TriNet” architecture designed for various parameters of amplifiers, including gain, bandwidth, and power facilitating precise predictions for DiffAmp and OpAmp, and presents a decoder architecture tailored for LDO. A notable aspect is the development of an efficient technique for acquiring larger datasets in a condensed timeframe. The presented methodologies demonstrate high accuracy rates, achieving 97.3% for DiffAmp and OpAmp circuits and 94.3% for LDO design.

由于耗时的数学计算和电路设计的复杂性，模拟集成电路 (IC) 设计及其自动化面临巨大挑战。尽管人们一直在努力实现模拟设计流程的自动化，但目前的方法无法满足精确的设计要求，而且存在误差，这就凸显出需要一种能够准确预测电路的更强大的方法。此外，还需要改进数据集收集技术，以提高自动化流程的整体性能。电源管理单元（PMU）是任何集成电路中的关键模块，需要设计低压差稳压器（LDO），而放大器是其基本模块。本研究利用深度神经网络（DNN）的功能来自动化重要的放大器模块，如差分放大器（DiffAmp）和两级运算放大器（OpAmp）。此外，它还为 LDO 的高层电路提出了一个自动化框架。本文介绍了针对放大器各种参数（包括增益、带宽和功率）设计的新型 "TriNet "架构，有助于对 DiffAmp 和 OpAmp 进行精确预测，并提出了专为 LDO 量身定制的解码器架构。一个值得注意的方面是开发了一种高效技术，可在较短的时间内获取较大的数据集。所介绍的方法具有很高的准确率，DiffAmp 和 OpAmp 电路的准确率达到 97.3%，LDO 设计的准确率达到 94.3%。

{"title":"A Novel TriNet Architecture for Enhanced Analog IC Design Automation","authors":"Arunkumar P Chavan;Shrish Shrinath Vaidya;Sanket M. Mantrashetti;Abhishek Gurunath Dastikopp;Kishan S. Murthy;H. V. Ravish Aradhya;Prakash Pawar","doi":"10.1109/TVLSI.2024.3452032","DOIUrl":"10.1109/TVLSI.2024.3452032","url":null,"abstract":"Analog integrated circuit (IC) design and its automation pose significant challenges due to the time-consuming mathematical computations and complexity of circuit design. Though efforts have been made to automate the analog design flow, the current approach falls short in meeting the exact design requirements and plagued by inaccuracies, highlighting the necessity for a more robust approach capable of accurately predicting circuits. In addition, there is a need for an improved dataset collection technique to enhance the overall performance of the automation process. The power management unit (PMU) is a crucial block in any IC that requires the design of low-dropout regulator (LDO) for which amplifiers are fundamental blocks. This research harnesses the capabilities of deep neural networks (DNNs) to automate essential amplifier blocks, such as the differential amplifier (DiffAmp) and two-stage operational amplifier (OpAmp). In addition, it proposes an automation framework for the higher level circuitry of the LDO. This article introduces a novel “TriNet” architecture designed for various parameters of amplifiers, including gain, bandwidth, and power facilitating precise predictions for DiffAmp and OpAmp, and presents a decoder architecture tailored for LDO. A notable aspect is the development of an efficient technique for acquiring larger datasets in a condensed timeframe. The presented methodologies demonstrate high accuracy rates, achieving 97.3% for DiffAmp and OpAmp circuits and 94.3% for LDO design.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"2046-2059"},"PeriodicalIF":2.8,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10672521","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Two-Channel Interleaved ADC With Fast-Converging Foreground Time Calibration and Comparison-Based Control Logic 具有快速转换前景时间校准和基于比较的控制逻辑的双通道交错 ADC

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-09 DOI: 10.1109/TVLSI.2024.3449293

Xiang Yan;Kefan Qin;Xinyue Zheng;Weibo Hu;Wei Ma;Haitao Cui

A dual-channel interleaved analog-to-digital converter (ADC) operating at 320 MS/s is prototyped to validate a fast-converging foreground time calibration algorithm that is independent of ADC offset errors. An input polarity switching technique is introduced to eliminate the impact of sub-ADC offset mismatches during foreground time calibration. After foreground calibration, the signal-to-noise and distortion ratio (SNDR) and spurious free dynamic range (SFDR) are improved by 8.6 and 18.4 dB, respectively. In the sub-ADC design, a comparison functionality is enabled in the digital circuits to prevent metastability and expedite data conversion. The single-channel conversion rates reach 160 MS/s. The ADC is implemented via 40-nm digital CMOS technology, achieving a 52.01 dB signal-to-noise plus distortion ratio (SNDR) at near-Nyquist input while sampling at 320 MS/s. The overall power consumption is 3.65 mW, which includes an on-chip reference buffer and a clock circuit.

对工作频率为 320 MS/s 的双通道交错模数转换器 (ADC) 进行了原型验证，以验证独立于 ADC 偏移误差的快速转换前景时间校准算法。在前景时间校准过程中，引入了一种输入极性切换技术，以消除次 ADC 偏移失配的影响。前景校准后，信噪比和失真比 (SNDR) 以及无杂散动态范围 (SFDR) 分别提高了 8.6 和 18.4 dB。在次级 ADC 设计中，数字电路启用了比较功能，以防止不稳定性并加快数据转换。单通道转换速率达到 160 MS/s。ADC 采用 40 纳米数字 CMOS 技术实现，在接近奈奎斯特输入时达到 52.01 dB 的信噪比加失真比 (SNDR)，同时采样率为 320 MS/s。总体功耗为 3.65 mW，其中包括一个片上基准缓冲器和一个时钟电路。

引用次数: 0

Multiobjective Optimization of Class-F Oscillators F 类振荡器的多目标优化

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-05 DOI: 10.1109/TVLSI.2024.3449567

Zhan Qu;Zhenjiao Chen;Xingqiang Shi;Ya Zhao;Guohe Zhang;Feng Liang

To address the complex nonlinear problem of determining class-F voltage-controlled oscillator (VCO) dimensions, this article introduces an electronic design automation (EDA) framework that rapidly optimizes multiple design objectives yielding superior outcomes. The framework incorporates fast frequency determination, harmonic alignment, and extremal optimization of multiobjective particle swarm optimization with crowding distance (FHE-MOPSO-CD), an efficient algorithm we developed specifically for class-F VCOs, which includes transformer-based tank circuit strategies and extremum optimization techniques. Using a 55-nm CMOS process, this algorithm optimized various class-F VCO topologies, achieving excellent metrics and confirming its versatility. Optimization results indicate that at a 10-MHz offset, the figure of merit (FoM) is at least 8.81 dBc/Hz higher than values reported in the literature. Compared with other analog/RF dimension optimization methods, our approach yielded a higher hypervolume, indicating better convergence and greater diversity of solutions.

为了解决确定f类压控振荡器（VCO）尺寸的复杂非线性问题，本文介绍了一种电子设计自动化（EDA）框架，该框架可以快速优化多个设计目标，从而产生卓越的结果。该框架结合了基于拥挤距离的多目标粒子群优化（FHE-MOPSO-CD）的快速频率确定、谐波对齐和极值优化，这是我们专门为f类vco开发的一种高效算法，包括基于变压器的油箱电路策略和极值优化技术。该算法采用55纳米CMOS工艺，优化了各种f类VCO拓扑结构，获得了出色的指标并证实了其通用性。优化结果表明，在10 mhz偏移时，性能值（FoM）比文献中报道的值至少高8.81 dBc/Hz。与其他模拟/射频尺寸优化方法相比，我们的方法产生了更高的hypervolume，表明更好的收敛性和更大的解决方案多样性。

引用次数: 0

A Post-Bond ILV Test Method in Monolithic 3-D ICs 单片三维集成电路中的粘接后 ILV 测试方法

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-09-05 DOI: 10.1109/TVLSI.2024.3450452

Xu Fang;Xiaofeng Zhao

Monolithic 3-D (M3D) integrated circuits (ICs) have the potential to achieve significantly higher device density compared with conventional ICs. The implementation of nanoscale interlayer vias (ILVs) in M3D plays a pivotal role in achieving enhanced transistor density and increased flexibility for circuit design. However, the high integration density and aggressive scaling of the interlayer dielectric make ILVs especially prone to defects. In this article, we propose a post-bond ILV test method for the detection and diagnosis of ILVs’ open faults, stuck-at faults (SAFs), and short faults in the fabrication process of M3D ICs. The proposed method is well-suited for post-bond ILV test, which can significantly save the cost and improve the yield. The HSPICE simulation results demonstrate that the proposed method can effectively detect and localize ILVs’ open, stuck-at 0, stuck-at 1, and short faults. Evaluation results for M3D benchmarks demonstrate the proposed method has a quite small power-performance–area (PPA) overhead and a relatively low test-time overhead.

与传统集成电路相比，单片3-D （M3D）集成电路（ic）具有显著提高器件密度的潜力。纳米层间通孔（ILVs）在M3D中实现对于提高晶体管密度和提高电路设计的灵活性起着关键作用。然而，高集成度和层间介电的严重结垢使得ilv特别容易产生缺陷。本文提出了一种粘结后ILV测试方法，用于M3D集成电路制造过程中ILV的开路故障、卡死故障和短故障的检测和诊断。该方法适用于键后ILV测试，可显著节省成本，提高成品率。HSPICE仿真结果表明，该方法可以有效地检测和定位ilv的开、卡0、卡1和短故障。M3D基准测试的评估结果表明，所提出的方法具有相当小的功率性能区域（PPA）开销和相对较低的测试时间开销。

引用次数: 0