IEEE Journal on Exploratory Solid-State Computational Devices and Circuits最新文献_第2页

SpecPCM: A Low-Power PCM-Based In-Memory Computing Accelerator for Full-Stack Mass Spectrometry Analysis SpecPCM：用于全栈质谱分析的低功耗pcm内存计算加速器

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-11-15 DOI: 10.1109/JXCDC.2024.3498837

Keming Fan;Ashkan Moradifirouzabadi;Xiangjin Wu;Zheyu Li;Flavio Ponzina;Anton Persson;Eric Pop;Tajana Rosing;Mingu Kang

Mass spectrometry (MS) is essential for proteomics and metabolomics but faces impending challenges in efficiently processing the vast volumes of data. This article introduces SpecPCM, an in-memory computing (IMC) accelerator designed to achieve substantial improvements in energy and delay efficiency for both MS spectral clustering and database (DB) search. SpecPCM employs analog processing with low-voltage swing and utilizes recently introduced phase change memory (PCM) devices based on superlattice materials, optimized for low-voltage and low-power programming. Our approach integrates contributions across multiple levels: application, algorithm, circuit, device, and instruction sets. We leverage a robust hyperdimensional computing (HD) algorithm with a novel dimension-packing method and develop specialized hardware for the end-to-end MS pipeline to overcome the nonideal behavior of PCM devices. We further optimize multilevel PCM devices for different tasks by using different materials. We also perform a comprehensive design exploration to improve energy and delay efficiency while maintaining accuracy, exploring various combinations of hardware and software parameters controlled by the instruction set architecture (ISA). SpecPCM, with up to three bits per cell, achieves speedups of up to

$82times $

and

$143times $

for MS clustering and DB search tasks, respectively, along with a four-orders-of-magnitude improvement in energy efficiency compared with state-of-the-art (SoA) CPU/GPU tools.

质谱（MS）是蛋白质组学和代谢组学必不可少的，但在有效处理大量数据方面面临着迫在眉睫的挑战。本文介绍了SpecPCM，一个内存计算（IMC）加速器，旨在实现MS谱聚类和数据库（DB）搜索的能量和延迟效率的实质性改进。SpecPCM采用低电压摆动的模拟处理，并利用最近推出的基于超晶格材料的相变存储器（PCM）器件，针对低电压和低功耗编程进行了优化。我们的方法集成了多个层次的贡献：应用，算法，电路，设备和指令集。我们利用一种鲁棒的超维计算（HD）算法和一种新颖的维度填充方法，并为端到端MS管道开发专门的硬件，以克服PCM器件的非理想行为。我们通过使用不同的材料进一步优化了多电平PCM器件，以适应不同的任务。我们还进行了全面的设计探索，以提高能源和延迟效率，同时保持准确性，探索由指令集架构（ISA）控制的硬件和软件参数的各种组合。SpecPCM每单元最多3位，分别为MS集群和DB搜索任务实现高达82倍和143倍的加速，同时与最先进的（SoA） CPU/GPU工具相比，能效提高了4个数量级。

{"title":"SpecPCM: A Low-Power PCM-Based In-Memory Computing Accelerator for Full-Stack Mass Spectrometry Analysis","authors":"Keming Fan;Ashkan Moradifirouzabadi;Xiangjin Wu;Zheyu Li;Flavio Ponzina;Anton Persson;Eric Pop;Tajana Rosing;Mingu Kang","doi":"10.1109/JXCDC.2024.3498837","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3498837","url":null,"abstract":"Mass spectrometry (MS) is essential for proteomics and metabolomics but faces impending challenges in efficiently processing the vast volumes of data. This article introduces SpecPCM, an in-memory computing (IMC) accelerator designed to achieve substantial improvements in energy and delay efficiency for both MS spectral clustering and database (DB) search. SpecPCM employs analog processing with low-voltage swing and utilizes recently introduced phase change memory (PCM) devices based on superlattice materials, optimized for low-voltage and low-power programming. Our approach integrates contributions across multiple levels: application, algorithm, circuit, device, and instruction sets. We leverage a robust hyperdimensional computing (HD) algorithm with a novel dimension-packing method and develop specialized hardware for the end-to-end MS pipeline to overcome the nonideal behavior of PCM devices. We further optimize multilevel PCM devices for different tasks by using different materials. We also perform a comprehensive design exploration to improve energy and delay efficiency while maintaining accuracy, exploring various combinations of hardware and software parameters controlled by the instruction set architecture (ISA). SpecPCM, with up to three bits per cell, achieves speedups of up to \u0000<inline-formula> <tex-math>$82times $ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>$143times $ </tex-math></inline-formula>\u0000 for MS clustering and DB search tasks, respectively, along with a four-orders-of-magnitude improvement in energy efficiency compared with state-of-the-art (SoA) CPU/GPU tools.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"161-169"},"PeriodicalIF":2.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10753646","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142859023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

X-TIME: Accelerating Large Tree Ensembles Inference for Tabular Data With Analog CAMs X-TIME：用模拟cam加速表格数据的大型树集合推断

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-11-14 DOI: 10.1109/JXCDC.2024.3495634

Giacomo Pedretti;John Moon;Pedro Bruel;Sergey Serebryakov;Ron M. Roth;Luca Buonanno;Archit Gajjar;Lei Zhao;Tobias Ziegler;Cong Xu;Martin Foltin;Paolo Faraboschi;Jim Ignowski;Catherine E. Graves

Structured, or tabular, data are the most common format in data science. While deep learning models have proven formidable in learning from unstructured data such as images or speech, they are less accurate than simpler approaches when learning from tabular data. In contrast, modern tree-based machine learning (ML) models shine in extracting relevant information from structured data. An essential requirement in data science is to reduce model inference latency in cases where, for example, models are used in a closed loop with simulation to accelerate scientific discovery. However, the hardware acceleration community has mostly focused on deep neural networks and largely ignored other forms of ML. Previous work has described the use of an analog content addressable memory (CAM) component for efficiently mapping random forests (RFs). In this work, we develop an analog-digital architecture that implements a novel increased precision analog CAM and a programmable chip for inference of state-of-the-art tree-based ML models, such as eXtreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and others. Thanks to hardware-aware training, X-TIME reaches state-of-the-art accuracy and

$119times $

higher throughput at

$9740times $

lower latency with

${gt }150times $

improved energy efficiency compared with a state-of-the-art GPU for models with up to 4096 trees and depth of 8, with a 19-W peak power consumption.

结构化或表格数据是数据科学中最常见的格式。虽然深度学习模型在从图像或语音等非结构化数据中学习方面已经被证明是强大的，但在从表格数据中学习时，它们不如简单的方法准确。相比之下，现代基于树的机器学习（ML）模型在从结构化数据中提取相关信息方面表现出色。数据科学的一个基本要求是在某些情况下减少模型推理延迟，例如，将模型用于具有仿真的闭环中以加速科学发现。然而，硬件加速社区主要关注深度神经网络，而在很大程度上忽略了其他形式的机器学习。以前的工作描述了使用模拟内容可寻址存储器（CAM）组件来有效地映射随机森林（rf）。在这项工作中，我们开发了一种模拟数字架构，该架构实现了一种新型的提高精度的模拟CAM和可编程芯片，用于推断最先进的基于树的ML模型，如极限梯度增强（XGBoost），分类增强（CatBoost）等。由于硬件感知训练，X-TIME达到了最先进的精度，吞吐量提高了119倍，延迟降低了9740倍，能源效率提高了150倍，与最先进的GPU相比，可用于多达4096棵树和深度为8的模型，峰值功耗为19 w。

{"title":"X-TIME: Accelerating Large Tree Ensembles Inference for Tabular Data With Analog CAMs","authors":"Giacomo Pedretti;John Moon;Pedro Bruel;Sergey Serebryakov;Ron M. Roth;Luca Buonanno;Archit Gajjar;Lei Zhao;Tobias Ziegler;Cong Xu;Martin Foltin;Paolo Faraboschi;Jim Ignowski;Catherine E. Graves","doi":"10.1109/JXCDC.2024.3495634","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3495634","url":null,"abstract":"Structured, or tabular, data are the most common format in data science. While deep learning models have proven formidable in learning from unstructured data such as images or speech, they are less accurate than simpler approaches when learning from tabular data. In contrast, modern tree-based machine learning (ML) models shine in extracting relevant information from structured data. An essential requirement in data science is to reduce model inference latency in cases where, for example, models are used in a closed loop with simulation to accelerate scientific discovery. However, the hardware acceleration community has mostly focused on deep neural networks and largely ignored other forms of ML. Previous work has described the use of an analog content addressable memory (CAM) component for efficiently mapping random forests (RFs). In this work, we develop an analog-digital architecture that implements a novel increased precision analog CAM and a programmable chip for inference of state-of-the-art tree-based ML models, such as eXtreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and others. Thanks to hardware-aware training, X-TIME reaches state-of-the-art accuracy and \u0000<inline-formula> <tex-math>$119times $ </tex-math></inline-formula>\u0000 higher throughput at \u0000<inline-formula> <tex-math>$9740times $ </tex-math></inline-formula>\u0000 lower latency with \u0000<inline-formula> <tex-math>${gt }150times $ </tex-math></inline-formula>\u0000 improved energy efficiency compared with a state-of-the-art GPU for models with up to 4096 trees and depth of 8, with a 19-W peak power consumption.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"116-124"},"PeriodicalIF":2.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10753423","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approximated 2-Bit Adders for Parallel In-Memristor Computing With a Novel Sum-of-Product Architecture 一种新的和积结构的近似2位加法器用于并行忆阻器计算

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-11-13 DOI: 10.1109/JXCDC.2024.3497720

Christian Simonides;Dominik Gausepohl;Peter M. Hinkel;Fabian Seiler;Nima Taherinejad

Conventional computing methods struggle with the exponentially increasing demand for computational power, caused by applications including image processing and machine learning (ML). Novel computing paradigms such as in-memory computing (IMC) and approximate computing (AxC) provide promising solutions to this problem. Due to their low energy consumption and inherent ability to store data in a nonvolatile fashion, memristors are an increasingly popular choice in these fields. There is a wide range of logic forms compatible with memristive IMC, each offering different advantages. We present a novel mixed-logic solution that utilizes properties of the sum-of-product (SOP) representation and propose a full-adder circuit that works efficiently in 2-bit units. To further improve the speed, area usage, and energy consumption, we propose two additional approximate (Ax) 2-bit adders that exhibit inherent parallelization capabilities. We apply the proposed adders in selected image processing applications, where our Ax approach reduces the energy consumption by

$mathrm {31~!%}$

–

$mathrm {40~!%}$

and improves the speed by

$mathrm {50~!%}$

. To demonstrate the potential gains of our approximations in more complex applications, we applied them in ML. Our experiments indicate that with up to

$6/16$

Ax adders, there is no accuracy degradation when applied in a convolutional neural network (CNN) that is evaluated on MNIST. Our approach can save up to 125.6 mJ of energy and 505 million steps compared to our exact approach.

由于图像处理和机器学习（ML）等应用的出现，传统的计算方法难以应对以指数级增长的计算能力需求。新的计算范式，如内存计算（IMC）和近似计算（AxC），为这一问题提供了有希望的解决方案。由于其低能耗和以非易失性方式存储数据的固有能力，记忆电阻器在这些领域越来越受欢迎。记忆式IMC有多种逻辑形式，每种形式都有不同的优点。我们提出了一种新的混合逻辑解决方案，利用乘积和（SOP）表示的特性，并提出了一个在2位单元中有效工作的全加法器电路。为了进一步提高速度、面积使用和能耗，我们提出了两个额外的近似（Ax） 2位加法器，它们具有固有的并行化能力。我们将提出的加法器应用于选定的图像处理应用中，其中我们的Ax方法减少了能耗$ mathm {31~！%}$ - $ mathm {40~！%}$并提高$ mathm {50~！ %} $。为了证明我们的近似在更复杂的应用中的潜在收益，我们将它们应用于ML中。我们的实验表明，使用高达$6/16$ Ax加法器，当应用于在MNIST上评估的卷积神经网络（CNN）时，没有精度下降。与我们的方法相比，我们的方法可以节省多达125.6兆焦耳的能量和5.05亿步。

{"title":"Approximated 2-Bit Adders for Parallel In-Memristor Computing With a Novel Sum-of-Product Architecture","authors":"Christian Simonides;Dominik Gausepohl;Peter M. Hinkel;Fabian Seiler;Nima Taherinejad","doi":"10.1109/JXCDC.2024.3497720","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3497720","url":null,"abstract":"Conventional computing methods struggle with the exponentially increasing demand for computational power, caused by applications including image processing and machine learning (ML). Novel computing paradigms such as in-memory computing (IMC) and approximate computing (AxC) provide promising solutions to this problem. Due to their low energy consumption and inherent ability to store data in a nonvolatile fashion, memristors are an increasingly popular choice in these fields. There is a wide range of logic forms compatible with memristive IMC, each offering different advantages. We present a novel mixed-logic solution that utilizes properties of the sum-of-product (SOP) representation and propose a full-adder circuit that works efficiently in 2-bit units. To further improve the speed, area usage, and energy consumption, we propose two additional approximate (Ax) 2-bit adders that exhibit inherent parallelization capabilities. We apply the proposed adders in selected image processing applications, where our Ax approach reduces the energy consumption by \u0000<inline-formula> <tex-math>$mathrm {31~!%}$ </tex-math></inline-formula>\u0000–\u0000<inline-formula> <tex-math>$mathrm {40~!%}$ </tex-math></inline-formula>\u0000 and improves the speed by \u0000<inline-formula> <tex-math>$mathrm {50~!%}$ </tex-math></inline-formula>\u0000. To demonstrate the potential gains of our approximations in more complex applications, we applied them in ML. Our experiments indicate that with up to \u0000<inline-formula> <tex-math>$6/16$ </tex-math></inline-formula>\u0000 Ax adders, there is no accuracy degradation when applied in a convolutional neural network (CNN) that is evaluated on MNIST. Our approach can save up to 125.6 mJ of energy and 505 million steps compared to our exact approach.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"135-143"},"PeriodicalIF":2.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10752571","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A High-Efficiency Charge-Domain Compute-in-Memory 1F1C Macro Using 2-bit FeFET Cells for DNN Processing 使用 2 位 FeFET 单元的高效电荷域内存计算 1F1C 宏用于 DNN 处理

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-11-11 DOI: 10.1109/JXCDC.2024.3495612

Nellie Laleni;Franz Müller;Gonzalo Cuñarro;Thomas Kämpfe;Taekwang Jang

This article introduces a 1FeFET-1Capacitance (1F1C) macro based on a 2-bit ferroelectric field-effect transistor (FeFET) cell operating in the charge domain, marking a significant advancement in nonvolatile memory (NVM) and compute-in-memory (CIM). Traditionally, NVMs, such as FeFETs or resistive RAMs (RRAMs), have operated in a single-bit fashion, limiting their computational density and throughput. In contrast, the proposed 2-bit FeFET cell enables higher storage density and improves the computational efficiency in CIM architectures. The macro achieves 111.6 TOPS/W, highlighting its energy efficiency, and demonstrates robust performance on the CIFAR-10 dataset, achieving 89% accuracy with a VGG-8 neural network. These findings underscore the potential of charge-domain, multilevel NVM cells in pushing the boundaries of artificial intelligence (AI) acceleration and energy-efficient computing.

本文介绍的 1FeFET-1Capacitance (1F1C) 宏基于在电荷域工作的 2 位铁电场效应晶体管 (FeFET) 单元，标志着非易失性存储器 (NVM) 和内存计算 (CIM) 领域的重大进展。传统上，非易失性存储器（如场效应晶体管或电阻式 RAM (RRAM)）以单比特方式运行，限制了其计算密度和吞吐量。相比之下，所提出的 2 位 FeFET 单元可实现更高的存储密度，并提高 CIM 架构的计算效率。该宏实现了 111.6 TOPS/W，突出了其能效，并在 CIFAR-10 数据集上表现出强劲的性能，使用 VGG-8 神经网络实现了 89% 的准确率。这些发现凸显了电荷域多级 NVM 单元在推动人工智能 (AI) 加速和高能效计算方面的潜力。

引用次数: 0

System-Technology Co-Optimization for Dense Edge Architectures Using 3-D Integration and Nonvolatile Memory 利用三维集成和非易失性存储器实现密集边缘架构的系统技术协同优化

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-11-11 DOI: 10.1109/JXCDC.2024.3496118

Leandro M. Giacomini Rocha;Mohamed Naeim;Guilherme Paim;Moritz Brunion;Priya Venugopal;Dragomir Milojevic;James Myers;Mustafa Badaroglu;Marian Verhelst;Julien Ryckaert;Dwaipayan Biswas

High-performance edge artificial intelligence (Edge-AI) inference applications aim for high energy efficiency, memory density, and small form factor, requiring a design-space exploration across the whole stack—workloads, architecture, mapping, and co-optimization with emerging technology. In this article, we present a system-technology co-optimization (STCO) framework that interfaces with workload-driven system scaling challenges and physical design-enabled technology offerings. The framework is built on three engines that provide the physical design characterization, dataflow mapping optimizer, and system efficiency predictor. The framework builds on a systolic array accelerator to provide the design-technology characterization points using advanced imec A10 nanosheet CMOS node along with emerging, high-density voltage-gated spin-orbit torque (VGSOT) magnetic memories (MRAM), combined with memory-on-logic fine-pitch 3-D wafer-to-wafer hybrid bonding. We observe that the 3-D system integration of static random-access memory (SRAM)-based design leads to 9% power savings with 53% footprint reduction at iso-frequency with respect to 2-D implementation for the same memory capacity. Three-dimensional nonvolatile memory (NVM)-VGSOT allows

$4times $

memory capacity increase with 30% footprint reduction at iso-power compared with 2-D SRAM

$1times $

. Our exploration with two diverse workloads—image resolution enhancement (FSRCNN) and eye tracking (EDSNet)—shows that more resources allow better workload mapping possibilities, which are able to compensate peak system energy efficiency degradation on high memory capacity cases. We show that a 25% peak efficiency reduction on a

$32times $

memory capacity can lead to a

$7.4times $

faster execution with

$5.7times $

higher effective TOPS/W than the

$1times $

memory capacity case on the same technology.

高性能边缘人工智能（edge - ai）推理应用旨在实现高能效、内存密度和小尺寸，需要在整个堆栈中进行设计空间探索——工作负载、架构、映射以及与新兴技术的协同优化。在本文中，我们提出了一个系统技术协同优化（STCO）框架，该框架与工作负载驱动的系统扩展挑战和支持物理设计的技术产品相结合。该框架建立在三个引擎上，它们提供物理设计特性、数据流映射优化器和系统效率预测器。该框架建立在收缩阵列加速器的基础上，利用先进的imec A10纳米片CMOS节点，以及新兴的高密度电压门控自旋轨道扭矩（VGSOT）磁存储器（MRAM），结合存储逻辑上的小间距3d晶圆间混合键合，提供设计技术表征点。我们观察到，基于静态随机存取存储器（SRAM）的3-D系统集成设计在相同内存容量的情况下，相对于2-D实现，可在等频下节省9%的功耗，减少53%的占用空间。三维非易失性存储器(NVM)-VGSOT与2d SRAM相比，在同等功耗下，内存容量增加了4倍，占用空间减少了30%。我们对两种不同工作负载——图像分辨率增强（FSRCNN）和眼动追踪（EDSNet）——的探索表明，更多的资源允许更好的工作负载映射可能性，这能够补偿在高内存容量情况下的峰值系统能效下降。我们表明，与相同技术上的1倍内存容量相比，在32倍内存容量上降低25%的峰值效率可以使执行速度提高7.4倍，有效TOPS/W提高5.7倍。

{"title":"System-Technology Co-Optimization for Dense Edge Architectures Using 3-D Integration and Nonvolatile Memory","authors":"Leandro M. Giacomini Rocha;Mohamed Naeim;Guilherme Paim;Moritz Brunion;Priya Venugopal;Dragomir Milojevic;James Myers;Mustafa Badaroglu;Marian Verhelst;Julien Ryckaert;Dwaipayan Biswas","doi":"10.1109/JXCDC.2024.3496118","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3496118","url":null,"abstract":"High-performance edge artificial intelligence (Edge-AI) inference applications aim for high energy efficiency, memory density, and small form factor, requiring a design-space exploration across the whole stack—workloads, architecture, mapping, and co-optimization with emerging technology. In this article, we present a system-technology co-optimization (STCO) framework that interfaces with workload-driven system scaling challenges and physical design-enabled technology offerings. The framework is built on three engines that provide the physical design characterization, dataflow mapping optimizer, and system efficiency predictor. The framework builds on a systolic array accelerator to provide the design-technology characterization points using advanced imec A10 nanosheet CMOS node along with emerging, high-density voltage-gated spin-orbit torque (VGSOT) magnetic memories (MRAM), combined with memory-on-logic fine-pitch 3-D wafer-to-wafer hybrid bonding. We observe that the 3-D system integration of static random-access memory (SRAM)-based design leads to 9% power savings with 53% footprint reduction at iso-frequency with respect to 2-D implementation for the same memory capacity. Three-dimensional nonvolatile memory (NVM)-VGSOT allows \u0000<inline-formula> <tex-math>$4times $ </tex-math></inline-formula>\u0000 memory capacity increase with 30% footprint reduction at iso-power compared with 2-D SRAM \u0000<inline-formula> <tex-math>$1times $ </tex-math></inline-formula>\u0000. Our exploration with two diverse workloads—image resolution enhancement (FSRCNN) and eye tracking (EDSNet)—shows that more resources allow better workload mapping possibilities, which are able to compensate peak system energy efficiency degradation on high memory capacity cases. We show that a 25% peak efficiency reduction on a \u0000<inline-formula> <tex-math>$32times $ </tex-math></inline-formula>\u0000 memory capacity can lead to a \u0000<inline-formula> <tex-math>$7.4times $ </tex-math></inline-formula>\u0000 faster execution with \u0000<inline-formula> <tex-math>$5.7times $ </tex-math></inline-formula>\u0000 higher effective TOPS/W than the \u0000<inline-formula> <tex-math>$1times $ </tex-math></inline-formula>\u0000 memory capacity case on the same technology.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"125-134"},"PeriodicalIF":2.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10750212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design Considerations for Sub-1-V 1T1C FeRAM Memory Circuits 亚 1-V 1T1C FeRAM 存储器电路的设计考虑因素

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-10-30 DOI: 10.1109/JXCDC.2024.3488578

Mohammad Adnaan;Sou-Chi Chang;Hai Li;Yu-Ching Liao;Ian A. Young;Azad Naeemi

We present a comprehensive benchmarking framework for one transistor-one capacitor (1T1C) low-voltage ferroelectric random access memory (FeRAM) circuits. We focus on the most promising ferroelectric materials, hafnium zirconium oxide (HZO) and barium titanate (BTO), known for their fast switching speeds and low coercive voltages. We model ferroelectric capacitors using physics-based phase-field models and calibrate the polarization switching speed and hysteresis loop versus experimental data. Ferroelectric memory cells are designed using a 28-nm process design kit (PDK), incorporating peripheral circuitry and interconnect parasitics. We set up the memory array circuit design and analyze its performance by varying the row/column size of the memory array, as well as driver and capacitor sizes. Our results are compared with other emerging memory technologies, particularly magnetic/spintronic memories, in terms of read/write latencies and energy consumption. We identify the critical aspects of the ferroelectric memory array performance, such as the effect of plateline driver and bitline capacitances, and provide recommendations to further optimize the performance of such low operating voltage ferroelectric memory circuits.

我们为一个晶体管一个电容器（1T1C）低压铁电随机存取存储器（FeRAM）电路提出了一个全面的基准测试框架。我们重点研究了最有前途的铁电材料--氧化锆铪（HZO）和钛酸钡（BTO），它们以快速开关速度和低矫顽力电压而著称。我们利用基于物理的相场模型对铁电电容器进行建模，并根据实验数据对极化开关速度和磁滞环进行校准。我们使用 28 纳米工艺设计工具包 (PDK) 设计了铁电存储器单元，其中包含外围电路和互连寄生。我们建立了存储器阵列电路设计，并通过改变存储器阵列的行/列尺寸以及驱动器和电容器尺寸来分析其性能。在读写延迟和能耗方面，我们将结果与其他新兴存储器技术（尤其是磁性/闪存）进行了比较。我们确定了铁电存储器阵列性能的关键方面，例如压线驱动器和位线电容的影响，并提出了进一步优化此类低工作电压铁电存储器电路性能的建议。

{"title":"Design Considerations for Sub-1-V 1T1C FeRAM Memory Circuits","authors":"Mohammad Adnaan;Sou-Chi Chang;Hai Li;Yu-Ching Liao;Ian A. Young;Azad Naeemi","doi":"10.1109/JXCDC.2024.3488578","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3488578","url":null,"abstract":"We present a comprehensive benchmarking framework for one transistor-one capacitor (1T1C) low-voltage ferroelectric random access memory (FeRAM) circuits. We focus on the most promising ferroelectric materials, hafnium zirconium oxide (HZO) and barium titanate (BTO), known for their fast switching speeds and low coercive voltages. We model ferroelectric capacitors using physics-based phase-field models and calibrate the polarization switching speed and hysteresis loop versus experimental data. Ferroelectric memory cells are designed using a 28-nm process design kit (PDK), incorporating peripheral circuitry and interconnect parasitics. We set up the memory array circuit design and analyze its performance by varying the row/column size of the memory array, as well as driver and capacitor sizes. Our results are compared with other emerging memory technologies, particularly magnetic/spintronic memories, in terms of read/write latencies and energy consumption. We identify the critical aspects of the ferroelectric memory array performance, such as the effect of plateline driver and bitline capacitances, and provide recommendations to further optimize the performance of such low operating voltage ferroelectric memory circuits.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"107-115"},"PeriodicalIF":2.0,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10738514","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Integration Technologies for Artificial Intelligence Applications 人工智能应用的异构集成技术

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-10-23 DOI: 10.1109/JXCDC.2024.3484958

Madison Manley;Ashita Victor;Hyunggyu Park;Ankit Kaul;Mohanalingam Kathaperumal;Muhannad S. Bakir

The rapid advancement of artificial intelligence (AI) has been enabled by semiconductor-based electronics. However, the conventional methods of transistor scaling are not enough to meet the exponential demand for computing power driven by AI. This has led to a technological shift toward system-level scaling approaches, such as heterogeneous integration (HI). HI is becoming increasingly implemented in many AI accelerator products due to its potential to enhance overall system performance while also reducing electrical interconnect delays and energy consumption, which are critical for supporting data-intensive AI workloads. In this review, we introduce current and emerging HI technologies and their potential for high-performance systems. We then survey recent industrial and research progress in 3-D HI technologies that enable high bandwidth systems and finally present the emergence of glass core packaging for high-performance AI chip packages.

基于半导体的电子技术推动了人工智能（AI）的快速发展。然而，传统的晶体管扩展方法不足以满足人工智能对计算能力的指数级需求。这导致技术转向系统级扩展方法，如异构集成（HI）。由于异构集成具有提高整体系统性能的潜力，同时还能减少电气互连延迟和能耗，这对于支持数据密集型人工智能工作负载至关重要，因此越来越多的人工智能加速器产品开始采用异构集成。在本综述中，我们将介绍当前和新兴的 HI 技术及其在高性能系统中的应用潜力。然后，我们考察了实现高带宽系统的三维 HI 技术的最新工业和研究进展，最后介绍了用于高性能人工智能芯片封装的玻璃芯封装的出现。

引用次数: 0

Scaling Logic Area With Multitier Standard Cells 利用多层标准单元扩展逻辑区域

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-10-17 DOI: 10.1109/JXCDC.2024.3482464

Florian Freye;Christian Lanius;Hossein Hashemi Shadmehri;Diana Göhringer;Tobias Gemmeke

While the footprint of digital complementary metal-oxide–semiconductor (CMOS) circuits has continued to decrease over the years, physical limitations for further intralayer geometric scaling become apparent. To further increase the logic density, the international roadmap for devices and systems (IRDS) predicts a transition from a single layer of transistors per die to monolithically stacking transistors in multiple tiers starting in 2031. This raises the question of the extent to which these can be exploited in 3-D standard cells to improve logic density. In this work, we investigate the scaling potential of realizing standard cells employing two or three dedicated tiers. For this, specific multitier virtual physical design kits are derived based on the open ASAP7. A typical RISC-V implementation realized in a classic standard cell library is used to identify the subset of the most relevant standard cells. In accordance with the virtual physical design kit (PDK), 3-D derivatives of the single-tier standard cells are crafted and evaluated with respect to achievable logic density considering standard synthesis benchmarks and blocks on the architecture level.

多年来，数字互补金属氧化物半导体（CMOS）电路的占地面积不断缩小，但进一步扩大层内几何尺寸的物理限制也变得显而易见。为了进一步提高逻辑密度，国际器件与系统路线图（IRDS）预测，从 2031 年开始，每个芯片将从单层晶体管过渡到多层晶体管的单片堆叠。这就提出了一个问题：在三维标准单元中可以在多大程度上利用这些晶体管来提高逻辑密度。在这项工作中，我们研究了实现采用两层或三层专用层的标准单元的扩展潜力。为此，我们在开放式 ASAP7 的基础上开发了特定的多层虚拟物理设计工具包。在经典标准单元库中实现的典型 RISC-V 实现用于确定最相关的标准单元子集。根据虚拟物理设计工具包 (PDK)，制作了单层标准单元的三维衍生物，并根据可实现的逻辑密度（考虑标准综合基准和架构级模块）进行了评估。

{"title":"Scaling Logic Area With Multitier Standard Cells","authors":"Florian Freye;Christian Lanius;Hossein Hashemi Shadmehri;Diana Göhringer;Tobias Gemmeke","doi":"10.1109/JXCDC.2024.3482464","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3482464","url":null,"abstract":"While the footprint of digital complementary metal-oxide–semiconductor (CMOS) circuits has continued to decrease over the years, physical limitations for further intralayer geometric scaling become apparent. To further increase the logic density, the international roadmap for devices and systems (IRDS) predicts a transition from a single layer of transistors per die to monolithically stacking transistors in multiple tiers starting in 2031. This raises the question of the extent to which these can be exploited in 3-D standard cells to improve logic density. In this work, we investigate the scaling potential of realizing standard cells employing two or three dedicated tiers. For this, specific multitier virtual physical design kits are derived based on the open ASAP7. A typical RISC-V implementation realized in a classic standard cell library is used to identify the subset of the most relevant standard cells. In accordance with the virtual physical design kit (PDK), 3-D derivatives of the single-tier standard cells are crafted and evaluated with respect to achievable logic density considering standard synthesis benchmarks and blocks on the architecture level.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"82-88"},"PeriodicalIF":2.0,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10720813","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Energy-/Carbon-Aware Evaluation and Optimization of 3-D IC Architecture With Digital Compute-in-Memory Designs 采用数字内存计算设计的三维集成电路架构的能源/碳意识评估与优化

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-10-11 DOI: 10.1109/JXCDC.2024.3479100

Hyung Joon Byun;Udit Gupta;Jae-Sun Seo

Several 2-D architectures have been presented, including systolic arrays or compute-in-memory (CIM) arrays for energy-efficient artificial intelligence (AI) inference. To increase the energy efficiency within constrained area, 3-D technologies have been actively investigated, which have the potential to decrease the data path length or increase the activation buffer size, enabling higher energy efficiency. Several works have reported the 3-D architectures using non-CIM designs, but investigations on 3-D architectures with CIM macros have not been well studied in prior works. In this article, we investigate digital CIM (DCIM) macros and various 3-D architectures to find the opportunity of increased energy efficiency compared with 2-D structures. Moreover, we also investigated the carbon footprint of 3-D architectures. We have built in-house simulators calculating energy and area given high-level hardware descriptions and DNN workloads and integrated with carbon estimation tool to analyze the embodied carbon of various hardware designs. We have investigated different types of 3-D DCIM architectures and dataflows, which have shown 42.5% energy savings compared with 2-D systolic arrays on average. Also, we have analyzed the tradeoff between performance and carbon footprint and their optimization opportunities.

目前已经提出了几种二维架构，包括用于高能效人工智能（AI）推理的收缩阵列或内存计算（CIM）阵列。为了在有限的面积内提高能效，人们积极研究三维技术，这些技术有可能减少数据路径长度或增加激活缓冲区大小，从而实现更高的能效。有几篇文章报道了使用非 CIM 设计的三维架构，但之前的文章对使用 CIM 宏的三维架构研究不多。在本文中，我们研究了数字 CIM（DCIM）宏和各种三维架构，以寻找与二维结构相比提高能效的机会。此外，我们还研究了三维结构的碳足迹。我们建立了内部模拟器，根据高级硬件描述和 DNN 工作负载计算能量和面积，并与碳估算工具集成，以分析各种硬件设计的含碳量。我们研究了不同类型的三维 DCIM 架构和数据流，结果表明，与二维收缩阵列相比，平均节能 42.5%。此外，我们还分析了性能与碳足迹之间的权衡及其优化机会。

{"title":"Energy-/Carbon-Aware Evaluation and Optimization of 3-D IC Architecture With Digital Compute-in-Memory Designs","authors":"Hyung Joon Byun;Udit Gupta;Jae-Sun Seo","doi":"10.1109/JXCDC.2024.3479100","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3479100","url":null,"abstract":"Several 2-D architectures have been presented, including systolic arrays or compute-in-memory (CIM) arrays for energy-efficient artificial intelligence (AI) inference. To increase the energy efficiency within constrained area, 3-D technologies have been actively investigated, which have the potential to decrease the data path length or increase the activation buffer size, enabling higher energy efficiency. Several works have reported the 3-D architectures using non-CIM designs, but investigations on 3-D architectures with CIM macros have not been well studied in prior works. In this article, we investigate digital CIM (DCIM) macros and various 3-D architectures to find the opportunity of increased energy efficiency compared with 2-D structures. Moreover, we also investigated the carbon footprint of 3-D architectures. We have built in-house simulators calculating energy and area given high-level hardware descriptions and DNN workloads and integrated with carbon estimation tool to analyze the embodied carbon of various hardware designs. We have investigated different types of 3-D DCIM architectures and dataflows, which have shown 42.5% energy savings compared with 2-D systolic arrays on average. Also, we have analyzed the tradeoff between performance and carbon footprint and their optimization opportunities.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"98-106"},"PeriodicalIF":2.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10714410","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accuracy Improvement With Weight Mapping Strategy and Output Transformation for STT-MRAM-Based Computing-in-Memory 利用权重映射策略和输出变换提高基于 STT-MRAM 的内存计算精度

IF 2 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Pub Date : 2024-10-11 DOI: 10.1109/JXCDC.2024.3478360

Xianggao Wang;Na Wei;Shifan Gao;Wenhao Wu;Yi Zhao

This work presents an analog computing-in-memory (CiM) macro with spin-transfer torque magnetic random access memory (STT-MRAM) and 28-nm CMOS technology. The adopted CiM bitcell uses a differential scheme and balances the input resistance to minimize the nonideal factors during multiply-accumulate (MAC) operations. Specialized peripheral circuits were designed for the current-scheme CiM architecture. More importantly, strategies of accuracy improvement were innovatively proposed as follows: 1) mapping most significant bit (MSB) to the far side of the MRAM array and 2) output linear transformation based on the reference columns. Circuit-level simulation verified the functionality and performance improvement of the CiM macro based on the MNIST and CIFAR-10 datasets, realizing a 3% and 5% accuracy loss compared with the benchmark, respectively. The 640-GOPS (8 bit) throughput, 34.6-TOPS/mm2 area compactness, and 83.3-TOPS/W energy efficiency demonstrate the advantages of STT-MRAM CiM in the coming AI era.

本研究采用自旋转移力矩磁性随机存取存储器（STT-MRAM）和 28 纳米 CMOS 技术，提出了一种模拟计算内存（CiM）宏。所采用的 CiM 位元组使用差分方案并平衡输入电阻，以最大限度地减少乘积 (MAC) 运算过程中的非理想因素。针对电流方案 CiM 架构设计了专用外围电路。更重要的是，创新性地提出了以下提高精度的策略：1) 将最重要位 (MSB) 映射到 MRAM 阵列的远端；2) 基于参考列的输出线性变换。电路级仿真验证了基于 MNIST 和 CIFAR-10 数据集的 CiM 宏的功能和性能改进，与基准相比分别实现了 3% 和 5% 的精度损失。640-GOPS（8 位）的吞吐量、34.6-TOPS/mm2 的面积紧凑性和 83.3-TOPS/W 的能效证明了 STT-MRAM CiM 在即将到来的人工智能时代的优势。

{"title":"Accuracy Improvement With Weight Mapping Strategy and Output Transformation for STT-MRAM-Based Computing-in-Memory","authors":"Xianggao Wang;Na Wei;Shifan Gao;Wenhao Wu;Yi Zhao","doi":"10.1109/JXCDC.2024.3478360","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3478360","url":null,"abstract":"This work presents an analog computing-in-memory (CiM) macro with spin-transfer torque magnetic random access memory (STT-MRAM) and 28-nm CMOS technology. The adopted CiM bitcell uses a differential scheme and balances the input resistance to minimize the nonideal factors during multiply-accumulate (MAC) operations. Specialized peripheral circuits were designed for the current-scheme CiM architecture. More importantly, strategies of accuracy improvement were innovatively proposed as follows: 1) mapping most significant bit (MSB) to the far side of the MRAM array and 2) output linear transformation based on the reference columns. Circuit-level simulation verified the functionality and performance improvement of the CiM macro based on the MNIST and CIFAR-10 datasets, realizing a 3% and 5% accuracy loss compared with the benchmark, respectively. The 640-GOPS (8 bit) throughput, 34.6-TOPS/mm2 area compactness, and 83.3-TOPS/W energy efficiency demonstrate the advantages of STT-MRAM CiM in the coming AI era.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"75-81"},"PeriodicalIF":2.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10714384","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0