IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems最新文献_第6页

AxOSpike: Spiking Neural Networks-Driven Approximate Operator Design AxOSpike：尖峰神经网络驱动的近似运算器设计

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443000

Salim Ullah;Siva Satyendra Sahoo;Akash Kumar

Approximate computing (AxC) is being widely researched as a viable approach to deploying compute-intensive artificial intelligence (AI) applications on resource-constrained embedded systems. In general, AxC aims to provide disproportionate gains in system-level power-performance-area (PPA) by leveraging the implicit error tolerance of an application. One of the more widely used methods in AxC involves circuit pruning of arithmetic operators used to process AI workloads. However, most related works adopt an application-agnostic approach to operator modeling for the design space exploration (DSE) of Approximate Operators (AxOs). To this end, we propose an application-driven approach to designing AxOs. Specifically, we use spiking neural network (SNN)-based inference to present an application-driven operator model resulting in AxOs with better-PPA-accuracy tradeoffs compared to traditional circuit pruning. Additionally, we present a novel FPGA-specific operator model to improve the quality of AxOs that can be obtained using circuit pruning. With the proposed methods, we report designs with up to 26.5% lower PDPxLUTs with similar application-level accuracy. Further, we report a considerably better set of design points than related works with up to 51% better-Pareto front hypervolume.

近似计算（AxC）作为在资源受限的嵌入式系统上部署计算密集型人工智能（AI）应用的一种可行方法，正在被广泛研究。一般来说，近似计算旨在利用应用程序的隐含容错能力，在系统级功耗-性能-面积（PPA）方面提供不成比例的收益。AxC 中较广泛使用的方法之一是对用于处理人工智能工作负载的算术运算符进行电路剪枝。然而，大多数相关工作都采用与应用无关的方法来为近似算子（AxOs）的设计空间探索（DSE）进行算子建模。为此，我们提出了一种应用驱动的 AxOs 设计方法。具体来说，我们使用基于尖峰神经网络（SNN）的推理方法，提出了一种应用驱动的算子模型，与传统的电路剪枝方法相比，这种算子模型能使近似算子（AxOs）具有更好的PPA-精度权衡。此外，我们还提出了一种新颖的 FPGA 特定运算器模型，以提高使用电路剪枝获得的 AxO 的质量。利用所提出的方法，我们报告的设计在应用级精度相似的情况下，PDPxLUT 降低了 26.5%。此外，我们还报告了一组比相关工作好得多的设计点，其帕雷托前沿超体积最多可提高 51%。

{"title":"AxOSpike: Spiking Neural Networks-Driven Approximate Operator Design","authors":"Salim Ullah;Siva Satyendra Sahoo;Akash Kumar","doi":"10.1109/TCAD.2024.3443000","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443000","url":null,"abstract":"Approximate computing (AxC) is being widely researched as a viable approach to deploying compute-intensive artificial intelligence (AI) applications on resource-constrained embedded systems. In general, AxC aims to provide disproportionate gains in system-level power-performance-area (PPA) by leveraging the implicit error tolerance of an application. One of the more widely used methods in AxC involves circuit pruning of arithmetic operators used to process AI workloads. However, most related works adopt an application-agnostic approach to operator modeling for the design space exploration (DSE) of Approximate Operators (AxOs). To this end, we propose an application-driven approach to designing AxOs. Specifically, we use spiking neural network (SNN)-based inference to present an application-driven operator model resulting in AxOs with better-PPA-accuracy tradeoffs compared to traditional circuit pruning. Additionally, we present a novel FPGA-specific operator model to improve the quality of AxOs that can be obtained using circuit pruning. With the proposed methods, we report designs with up to 26.5% lower PDPxLUTs with similar application-level accuracy. Further, we report a considerably better set of design points than related works with up to 51% better-Pareto front hypervolume.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3324-3335"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VALO: A Versatile Anytime Framework for LiDAR-Based Object Detection Deep Neural Networks VALO：基于激光雷达的物体探测多功能随时框架深度神经网络

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443774

Ahmet Soyyigit;Shuochao Yao;Heechul Yun

This work addresses the challenge of adapting dynamic deadline requirements for the LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, the state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on the resource-constrained edge platforms. Therefore, a tradeoff between the detection accuracy and latency should be dynamically managed at runtime to achieve the optimum results. In this article, we introduce versatile anytime algorithm for the LiDAR Object detection (VALO), a novel data-centric approach that enables anytime computing of 3-D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process the input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of the past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate the execution without sacrificing accuracy. We implement VALO on the state-of-the-art 3-D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available at https://github.com/CSL-KU/VALOgithub.com/CSL-KU/VALO.

这项研究解决了如何适应激光雷达物体检测深度神经网络（DNN）的动态截止时间要求这一难题。物体检测的计算延迟对于确保安全高效的导航至关重要。然而，最先进的激光雷达物体检测深度神经网络往往表现出明显的延迟，阻碍了其在资源有限的边缘平台上的实时性能。因此，应在运行时动态管理检测精度和延迟之间的权衡，以获得最佳结果。在本文中，我们介绍了用于激光雷达物体检测的多功能随时算法（VALO），这是一种以数据为中心的新方法，可实现三维激光雷达物体检测 DNN 的随时计算。VALO 采用截止日期感知调度器来有选择地处理输入区域，无需修改架构即可在执行时间和精度之间做出权衡。此外，它还对过去的检测结果进行了有效预测，以减少因部分处理输入而可能造成的精度损失。最后，它在检测头中使用了一种新颖的输入缩减技术，在不牺牲精度的情况下大大加快了执行速度。我们在最先进的三维激光雷达物体检测网络（即 CenterPoint 和 VoxelNext）上实现了 VALO，并展示了它对各种时间限制的动态适应能力，同时实现了比以前最先进的技术更高的精度。代码见 https://github.com/CSL-KU/VALOgithub.com/CSL-KU/VALO。

{"title":"VALO: A Versatile Anytime Framework for LiDAR-Based Object Detection Deep Neural Networks","authors":"Ahmet Soyyigit;Shuochao Yao;Heechul Yun","doi":"10.1109/TCAD.2024.3443774","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443774","url":null,"abstract":"This work addresses the challenge of adapting dynamic deadline requirements for the LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, the state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on the resource-constrained edge platforms. Therefore, a tradeoff between the detection accuracy and latency should be dynamically managed at runtime to achieve the optimum results. In this article, we introduce versatile anytime algorithm for the LiDAR Object detection (VALO), a novel data-centric approach that enables anytime computing of 3-D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process the input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of the past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate the execution without sacrificing accuracy. We implement VALO on the state-of-the-art 3-D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available at \u0000<uri>https://github.com/CSL-KU/VALOgithub.com/CSL-KU/VALO</uri>\u0000.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4045-4056"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture EQ-ViT：在 Versal ACAP 架构上端到端加速实时视觉变换器推理的算法-硬件协同设计

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443692

Peiyan Dong;Jinming Zhuang;Zhuoping Yang;Shixin Ji;Yanyu Li;Dongkuan Xu;Heng Huang;Jingtong Hu;Alex K. Jones;Yiyu Shi;Yanzhi Wang;Peipei Zhou

While vision transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (<1> $13.1times $ over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, AMD ZCU102, and U250 FPGAs. The energy efficiency gains are 62.2, 15.33, 12.82, 13.31, 13.5, and

$21.9times $

.

虽然视觉转换器（ViTs）在计算机视觉领域取得了持续的进步，但将其部署到实时决策场景中（与英特尔至强 8375C vCPU、Nvidia A10G、A100、Jetson AGX Orin GPU、AMD ZCU102 和 U250 FPGA 的计算解决方案相比，能效分别提高了 62.2、15.33、12.82、13.31、13.5 和 21.9 倍。能效收益分别为 62.2、15.33、12.82、13.31、13.5 和 21.9 美元。

引用次数: 0

NOBtree: A NUMA-Optimized Tree Index for Nonvolatile Memory NOBtree：非易失性内存的 NUMA 优化树索引

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3438111

Zhaole Chu;Peiquan Jin;Yongping Luo;Xiaoliang Wang;Shouhong Wan

Nonvolatile memory (NVM) suffers from more serious nonuniform memory access (NUMA) effects than DRAM because of the lower bandwidth and higher latency. While numerous works have aimed at optimizing NVM indexes, only a few of them tried to address the NUMA impact. Existing approaches mainly rely on local NVM write buffers or DRAM-based read buffers to mitigate the cost of remote NVM access, which introduces memory overhead and causes performance degradation for lookup and scan operations. In this article, we present NOBtree, a new NUMA-optimized persistent tree index. The novelty of NOBtree is two-fold. First, NOBtree presents per-NUMA replication and an efficient node-migration mechanism to reduce remote NVM access. Second, NOBtree proposes a NUMA-aware NVM allocator to improve the insert performance and scalability. We conducted experiments on six workloads to evaluate the performance of NOBtree. The results show that NOBtree can effectively reduce the number of remote NVM accesses. Moreover, NOBtree outperforms existing persistent indexes, including TLBtree, Fast&Fair, ROART, and PACtree, by up to

$3.23times $

in throughput and

$4.07times $

in latency.

与 DRAM 相比，非易失性内存（NVM）的带宽更低，延迟更高，因此存在更严重的非均匀内存访问（NUMA）效应。虽然有许多工作旨在优化 NVM 索引，但只有少数工作试图解决 NUMA 影响问题。现有方法主要依赖于本地 NVM 写缓冲区或基于 DRAM 的读缓冲区来减轻远程 NVM 访问的成本，这就带来了内存开销，并导致查找和扫描操作的性能下降。在本文中，我们介绍了一种新的 NUMA 优化持久树索引 NOBtree。NOBtree 的新颖之处有两方面。首先，NOBtree 提供了按 NUMA 复制和高效的节点迁移机制，以减少远程 NVM 访问。其次，NOBtree 提出了一种 NUMA 感知 NVM 分配器，以提高插入性能和可扩展性。我们在六个工作负载上进行了实验，以评估 NOBtree 的性能。结果表明，NOBtree 可以有效减少远程 NVM 访问次数。此外，NOBtree 的吞吐量和延迟分别比 TLBtree、Fast&Fair、ROART 和 PACtree 等现有持久性索引高出 3.23 倍和 4.07 倍。

{"title":"NOBtree: A NUMA-Optimized Tree Index for Nonvolatile Memory","authors":"Zhaole Chu;Peiquan Jin;Yongping Luo;Xiaoliang Wang;Shouhong Wan","doi":"10.1109/TCAD.2024.3438111","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438111","url":null,"abstract":"Nonvolatile memory (NVM) suffers from more serious nonuniform memory access (NUMA) effects than DRAM because of the lower bandwidth and higher latency. While numerous works have aimed at optimizing NVM indexes, only a few of them tried to address the NUMA impact. Existing approaches mainly rely on local NVM write buffers or DRAM-based read buffers to mitigate the cost of remote NVM access, which introduces memory overhead and causes performance degradation for lookup and scan operations. In this article, we present NOBtree, a new NUMA-optimized persistent tree index. The novelty of NOBtree is two-fold. First, NOBtree presents per-NUMA replication and an efficient node-migration mechanism to reduce remote NVM access. Second, NOBtree proposes a NUMA-aware NVM allocator to improve the insert performance and scalability. We conducted experiments on six workloads to evaluate the performance of NOBtree. The results show that NOBtree can effectively reduce the number of remote NVM accesses. Moreover, NOBtree outperforms existing persistent indexes, including TLBtree, Fast&Fair, ROART, and PACtree, by up to \u0000<inline-formula> <tex-math>$3.23times $ </tex-math></inline-formula>\u0000 in throughput and \u0000<inline-formula> <tex-math>$4.07times $ </tex-math></inline-formula>\u0000 in latency.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3840-3851"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Arch2End: Two-Stage Unified System-Level Modeling for Heterogeneous Intelligent Devices Arch2End：针对异构智能设备的两阶段统一系统级建模

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443706

Weihong Liu;Zongwei Zhu;Boyu Li;Yi Xiong;Zirui Lian;Jiawei Geng;Xuehai Zhou

The surge in intelligent edge computing has propelled the adoption and expansion of the distributed embedded systems (DESs). Numerous scheduling strategies are introduced to improve the DES throughput, such as latency-aware and group-based hierarchical scheduling. Effective device modeling can help in modular and plug-in scheduler design. For uniformity in scheduling interfaces, an unified device performance modeling is adopted, typically involving the system-level modeling that incorporates both the hardware and software stacks, broadly divided into two categories. Fine-grained modeling methods based on the hardware architecture analysis become very difficult when dealing with a large number of heterogeneous devices, mainly because much architecture information is closed-source and costly to analyse. Coarse-grained methods are based on the limited architecture information or benchmark models, resulting in insufficient generalization in the complex inference performance of diverse deep neural networks (DNNs). Therefore, we introduce a two-stage system-level modeling method (Arch2End), combining limited architecture information with scalable benchmark models to achieve an unified performance representation. Stage one leverages public information to analyse architectures in an uniform abstraction and to design the benchmark models for exploring the device performance boundaries, ensuring uniformity. Stage two extracts critical device features from the end-to-end inference metrics of extensive simulation models, ensuring universality and enhancing characterization capacity. Compared to the state-of-the-art methods, Arch2End achieves the lowest DNN latency prediction relative errors in the NAS-Bench-201 (1.7%) and real-world DNNs (8.2%). It also showcases superior performance in intergroup balanced device grouping strategies.

智能边缘计算的迅猛发展推动了分布式嵌入式系统（DES）的采用和扩展。为提高分布式嵌入式系统的吞吐量，引入了大量调度策略，如延迟感知和基于组的分层调度。有效的设备建模有助于模块化和插件式调度器的设计。为实现调度接口的统一性，采用了统一的设备性能建模，通常涉及系统级建模，包括硬件和软件堆栈，大致分为两类。在处理大量异构设备时，基于硬件架构分析的细粒度建模方法变得非常困难，这主要是因为许多架构信息都是闭源的，分析成本很高。粗粒度方法基于有限的架构信息或基准模型，导致不同深度神经网络（DNN）的复杂推理性能泛化不足。因此，我们引入了一种分两个阶段的系统级建模方法（Arch2End），将有限的架构信息与可扩展的基准模型相结合，以实现统一的性能表示。第一阶段利用公共信息以统一抽象的方式分析架构，并设计用于探索设备性能边界的基准模型，以确保统一性。第二阶段从大量仿真模型的端到端推理指标中提取关键器件特征，确保通用性并提高表征能力。与最先进的方法相比，Arch2End 在 NAS-Bench-201 (1.7%) 和真实世界 DNN (8.2%) 中实现了最低的 DNN 延迟预测相对误差。此外，Arch2End 还在组间平衡设备分组策略方面表现出色。

{"title":"Arch2End: Two-Stage Unified System-Level Modeling for Heterogeneous Intelligent Devices","authors":"Weihong Liu;Zongwei Zhu;Boyu Li;Yi Xiong;Zirui Lian;Jiawei Geng;Xuehai Zhou","doi":"10.1109/TCAD.2024.3443706","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443706","url":null,"abstract":"The surge in intelligent edge computing has propelled the adoption and expansion of the distributed embedded systems (DESs). Numerous scheduling strategies are introduced to improve the DES throughput, such as latency-aware and group-based hierarchical scheduling. Effective device modeling can help in modular and plug-in scheduler design. For uniformity in scheduling interfaces, an unified device performance modeling is adopted, typically involving the system-level modeling that incorporates both the hardware and software stacks, broadly divided into two categories. Fine-grained modeling methods based on the hardware architecture analysis become very difficult when dealing with a large number of heterogeneous devices, mainly because much architecture information is closed-source and costly to analyse. Coarse-grained methods are based on the limited architecture information or benchmark models, resulting in insufficient generalization in the complex inference performance of diverse deep neural networks (DNNs). Therefore, we introduce a two-stage system-level modeling method (Arch2End), combining limited architecture information with scalable benchmark models to achieve an unified performance representation. Stage one leverages public information to analyse architectures in an uniform abstraction and to design the benchmark models for exploring the device performance boundaries, ensuring uniformity. Stage two extracts critical device features from the end-to-end inference metrics of extensive simulation models, ensuring universality and enhancing characterization capacity. Compared to the state-of-the-art methods, Arch2End achieves the lowest DNN latency prediction relative errors in the NAS-Bench-201 (1.7%) and real-world DNNs (8.2%). It also showcases superior performance in intergroup balanced device grouping strategies.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4154-4165"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural Networks ARTEMIS：用于变压器神经网络的模拟-随机 In-DRAM 混合加速器

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3446719

Salma Afifi;Ishan Thakkar;Sudeep Pasricha

Transformers have emerged as a powerful tool for natural language processing (NLP) and computer vision. Through the attention mechanism, these models have exhibited remarkable performance gains when compared to conventional approaches like recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Nevertheless, transformers typically demand substantial execution time due to their extensive computations and large memory footprint. Processing in-memory (PIM) and near-memory computing (NMC) are promising solutions to accelerating transformers as they offer high-compute parallelism and memory bandwidth. However, designing PIM/NMC architectures to support the complex operations and massive amounts of data that need to be moved between layers in transformer neural networks remains a challenge. We propose ARTEMIS, a mixed analog-stochastic in-DRAM accelerator for transformer models. Through employing minimal changes to the conventional DRAM arrays, ARTEMIS efficiently alleviates the costs associated with transformer model execution by supporting stochastic computing for multiplications and temporal analog accumulations using a novel in-DRAM metal-on-metal capacitor. Our analysis indicates that ARTEMIS exhibits at least

$3.0times $

speedup, and

$1.8times $

lower energy compared to GPU, TPU, CPU, and state-of-the-art PIM transformer hardware accelerators.

变形器已成为自然语言处理（NLP）和计算机视觉的强大工具。通过注意力机制，与循环神经网络（RNN）和卷积神经网络（CNN）等传统方法相比，这些模型表现出了显著的性能提升。然而，变换器通常需要大量的执行时间，因为它们需要进行大量的计算并占用大量内存。内存处理（PIM）和近内存计算（NMC）提供了高计算并行性和内存带宽，是加速变换器的理想解决方案。然而，设计 PIM/NMC 架构以支持变压器神经网络中需要在层间移动的复杂操作和海量数据仍是一项挑战。我们提出了 ARTEMIS，这是一种用于变压器模型的模拟-随机 RAM 内混合加速器。ARTEMIS 对传统 DRAM 阵列的改动极小，通过使用新型内置 DRAM 金属对金属电容器支持乘法随机计算和时序模拟累加，有效地降低了变压器模型执行的相关成本。我们的分析表明，与 GPU、TPU、CPU 和最先进的 PIM 变压器硬件加速器相比，ARTEMIS 的速度至少提高了 3.0 倍，能耗降低了 1.8 倍。

{"title":"ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural Networks","authors":"Salma Afifi;Ishan Thakkar;Sudeep Pasricha","doi":"10.1109/TCAD.2024.3446719","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446719","url":null,"abstract":"Transformers have emerged as a powerful tool for natural language processing (NLP) and computer vision. Through the attention mechanism, these models have exhibited remarkable performance gains when compared to conventional approaches like recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Nevertheless, transformers typically demand substantial execution time due to their extensive computations and large memory footprint. Processing in-memory (PIM) and near-memory computing (NMC) are promising solutions to accelerating transformers as they offer high-compute parallelism and memory bandwidth. However, designing PIM/NMC architectures to support the complex operations and massive amounts of data that need to be moved between layers in transformer neural networks remains a challenge. We propose ARTEMIS, a mixed analog-stochastic in-DRAM accelerator for transformer models. Through employing minimal changes to the conventional DRAM arrays, ARTEMIS efficiently alleviates the costs associated with transformer model execution by supporting stochastic computing for multiplications and temporal analog accumulations using a novel in-DRAM metal-on-metal capacitor. Our analysis indicates that ARTEMIS exhibits at least \u0000<inline-formula> <tex-math>$3.0times $ </tex-math></inline-formula>\u0000 speedup, and \u0000<inline-formula> <tex-math>$1.8times $ </tex-math></inline-formula>\u0000 lower energy compared to GPU, TPU, CPU, and state-of-the-art PIM transformer hardware accelerators.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3336-3347"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Latent RAGE: Randomness Assessment Using Generative Entropy Models 潜在 RAGE：使用生成熵模型进行随机性评估

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3449562

Kuheli Pratihar;Rajat Subhra Chakraborty;Debdeep Mukhopadhyay

NIST’s recent review of the widely employed special publication (SP) 800–22 randomness testing suite has underscored several shortcomings, particularly the absence of entropy source modeling and the necessity for large sequence lengths. Motivated by this revelation, we explore low-dimensional modeling of the entropy source in random number generators (RNGs) using a variational autoencoder (VAE). This low-dimensional modeling enables the separation between strong and weak entropy sources by magnifying the deterministic effects in the latter, which are otherwise difficult to detect with conventional testing. Bits from weak-entropy RNGs with bias, correlation, or deterministic patterns are more likely to lie on a low-dimensional manifold within a high-dimensional space, in contrast to strong-entropy RNGs, such as true RNGs (TRNGs) and pseudo-RNGs (PRNGs) with uniformly distributed bits. We exploit this insight to employ a generative AI-based noninterference test (GeNI) for the first time, achieving implementation-agnostic low-dimensional modeling of all types of entropy sources. GeNI’s generative aspect uses VAEs to produce synthetic bitstreams from the latent representation of RNGs, which are subjected to a deep learning (DL)-based noninterference (NI) test evaluating the masking ability of the synthetic bitstreams. The core principle of the NI test is that if the bitstream exhibits high-quality randomness, the masked data from the two sources should be indistinguishable. GeNI facilitates a comparative analysis of low-dimensional entropy source representations across various RNGs, adeptly identifying the artificial randomness in specious RNGs with deterministic patterns that otherwise passes all NIST SP800-22 tests. Notably, GeNI achieves this with

$10times $

lower-sequence lengths and

$16.5times $

faster execution time compared to the NIST test suite.

美国国家标准与技术研究所（NIST）最近对广泛使用的特别出版物（SP）800-22 随机性测试套件进行了审查，强调了该测试套件的几个缺点，尤其是缺乏熵源建模以及必须使用大序列长度。受此启发，我们利用变异自动编码器 (VAE) 探索了随机数发生器 (RNG) 中熵源的低维建模。这种低维建模通过放大后者的确定性效应，实现了强熵源和弱熵源的分离。具有偏差、相关性或确定性模式的弱熵 RNG 的比特更有可能位于高维空间中的低维流形上，这与强熵 RNG（如具有均匀分布比特的真 RNG (TRNG) 和伪 RNG (PRNG)）形成鲜明对比。我们利用这一洞察力，首次采用了基于生成式人工智能的互不干涉测试（GeNI），实现了对所有类型熵源的低维建模。GeNI 的生成方面使用 VAE 从 RNG 的潜在表示中生成合成比特流，并对其进行基于深度学习（DL）的无干扰（NI）测试，以评估合成比特流的掩蔽能力。NI 测试的核心原则是，如果比特流表现出高质量的随机性，那么来自两个数据源的屏蔽数据应该是无法区分的。GeNI 方便了对各种 RNG 的低维熵源表示进行比较分析，能巧妙地识别出具有确定性模式的伪造 RNG 中的人为随机性，否则这些 RNG 将通过 NIST SP800-22 的所有测试。值得注意的是，与 NIST 测试套件相比，GeNI 的序列长度降低了 10 倍，执行时间缩短了 16.5 倍。

{"title":"Latent RAGE: Randomness Assessment Using Generative Entropy Models","authors":"Kuheli Pratihar;Rajat Subhra Chakraborty;Debdeep Mukhopadhyay","doi":"10.1109/TCAD.2024.3449562","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3449562","url":null,"abstract":"NIST’s recent review of the widely employed special publication (SP) 800–22 randomness testing suite has underscored several shortcomings, particularly the absence of entropy source modeling and the necessity for large sequence lengths. Motivated by this revelation, we explore low-dimensional modeling of the entropy source in random number generators (RNGs) using a variational autoencoder (VAE). This low-dimensional modeling enables the separation between strong and weak entropy sources by magnifying the deterministic effects in the latter, which are otherwise difficult to detect with conventional testing. Bits from weak-entropy RNGs with bias, correlation, or deterministic patterns are more likely to lie on a low-dimensional manifold within a high-dimensional space, in contrast to strong-entropy RNGs, such as true RNGs (TRNGs) and pseudo-RNGs (PRNGs) with uniformly distributed bits. We exploit this insight to employ a generative AI-based noninterference test (GeNI) for the first time, achieving implementation-agnostic low-dimensional modeling of all types of entropy sources. GeNI’s generative aspect uses VAEs to produce synthetic bitstreams from the latent representation of RNGs, which are subjected to a deep learning (DL)-based noninterference (NI) test evaluating the masking ability of the synthetic bitstreams. The core principle of the NI test is that if the bitstream exhibits high-quality randomness, the masked data from the two sources should be indistinguishable. GeNI facilitates a comparative analysis of low-dimensional entropy source representations across various RNGs, adeptly identifying the artificial randomness in specious RNGs with deterministic patterns that otherwise passes all NIST SP800-22 tests. Notably, GeNI achieves this with \u0000<inline-formula> <tex-math>$10times $ </tex-math></inline-formula>\u0000 lower-sequence lengths and \u0000<inline-formula> <tex-math>$16.5times $ </tex-math></inline-formula>\u0000 faster execution time compared to the NIST test suite.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3503-3514"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space Exploration ROI-HIT：兴趣区域驱动的高维微架构设计空间探索

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443006

Xuyang Zhao;Tianning Gao;Aidong Zhao;Zhaori Bi;Changhao Yan;Fan Yang;Sheng-Guo Wang;Dian Zhou;Xuan Zeng

Exploring the design space of RISC-V processors faces significant challenges due to the vastness of the high-dimensional design space and the associated expensive simulation costs. This work proposes a region of interest (ROI)-driven method, which focuses on the promising ROIs to reduce the over-exploration on the huge design space and improve the optimization efficiency. A tree structure based on self-organizing map (SOM) networks is proposed to partition the design space into ROIs. To reduce the high dimensionality of design space, a variable selection technique based on a sensitivity matrix is developed to prune unimportant design parameters and efficiently hit the optimum inside the ROIs. Moreover, an asynchronous parallel strategy is employed to further save the time taken by simulations. Experimental results demonstrate the superiority of our proposed method, achieving improvements of up to 43.82% in performance, 33.20% in power consumption, and 11.41% in area compared to state-of-the-art methods.

探索 RISC-V 处理器的设计空间面临着巨大的挑战，这是因为高维设计空间非常庞大，而且相关的仿真成本也非常昂贵。本研究提出了一种兴趣区域（ROI）驱动方法，该方法专注于有前景的 ROI，以减少对巨大设计空间的过度探索，提高优化效率。我们提出了一种基于自组织图（SOM）网络的树形结构，将设计空间划分为 ROI。为了降低设计空间的高维度，开发了一种基于灵敏度矩阵的变量选择技术，用于剪切不重要的设计参数，并在 ROIs 内高效地达到最优。此外，还采用了异步并行策略，进一步节省了模拟时间。实验结果证明了我们提出的方法的优越性，与最先进的方法相比，性能提高了 43.82%，功耗降低了 33.20%，面积减少了 11.41%。

{"title":"ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space Exploration","authors":"Xuyang Zhao;Tianning Gao;Aidong Zhao;Zhaori Bi;Changhao Yan;Fan Yang;Sheng-Guo Wang;Dian Zhou;Xuan Zeng","doi":"10.1109/TCAD.2024.3443006","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443006","url":null,"abstract":"Exploring the design space of RISC-V processors faces significant challenges due to the vastness of the high-dimensional design space and the associated expensive simulation costs. This work proposes a region of interest (ROI)-driven method, which focuses on the promising ROIs to reduce the over-exploration on the huge design space and improve the optimization efficiency. A tree structure based on self-organizing map (SOM) networks is proposed to partition the design space into ROIs. To reduce the high dimensionality of design space, a variable selection technique based on a sensitivity matrix is developed to prune unimportant design parameters and efficiently hit the optimum inside the ROIs. Moreover, an asynchronous parallel strategy is employed to further save the time taken by simulations. Experimental results demonstrate the superiority of our proposed method, achieving improvements of up to 43.82% in performance, 33.20% in power consumption, and 11.41% in area compared to state-of-the-art methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4178-4189"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Runtime Monitoring of ML-Based Scheduling Algorithms Toward Robust Domain-Specific SoCs 运行时监控基于 ML 的调度算法，实现稳健的特定领域 SoC

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3445815

A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras

Machine learning (ML) algorithms are being rapidly adopted to perform dynamic resource management tasks in heterogeneous system on chips. For example, ML-based task schedulers can make quick, high-quality decisions at runtime. Like any ML model, these offline-trained policies depend critically on the representative power of the training data. Hence, their performance may diminish or even catastrophically fail under unknown workloads, especially new applications. This article proposes a novel framework to continuously monitor the system to detect unforeseen scenarios using a gradient-based generalization metric called coherence. The proposed framework accurately determines whether the current policy generalizes to new inputs. If not, it incrementally trains the ML scheduler to ensure the robustness of the task-scheduling decisions. The proposed framework is evaluated thoroughly with a domain-specific SoC and six real-world applications. It can detect whether the trained scheduler generalizes to the current workload with 88.75%–98.39% accuracy. Furthermore, it enables

$1.1times -14times $

faster execution time when the scheduler is incrementally trained. Finally, overhead analysis performed on an Nvidia Jetson Xavier NX board shows that the proposed framework can run as a real-time background task.

机器学习（ML）算法正被迅速用于执行异构片上系统中的动态资源管理任务。例如，基于 ML 的任务调度器可以在运行时做出快速、高质量的决策。与任何 ML 模型一样，这些离线训练的策略在很大程度上取决于训练数据的代表性。因此，在未知工作负载（尤其是新应用）下，它们的性能可能会下降，甚至出现灾难性故障。本文提出了一个新颖的框架，利用基于梯度的泛化指标--一致性，持续监控系统以检测不可预见的情况。所提出的框架能准确判断当前策略是否能泛化到新的输入。如果不能，它将逐步训练 ML 调度器，以确保任务调度决策的稳健性。我们利用特定领域的 SoC 和六个实际应用对所提出的框架进行了全面评估。它能以 88.75%-98.39% 的准确率检测出训练有素的调度程序是否适用于当前工作负载。此外，在对调度器进行增量训练时，它还能使执行时间缩短1.1倍-14倍。最后，在Nvidia Jetson Xavier NX板上进行的开销分析表明，所提出的框架可以作为实时后台任务运行。

{"title":"Runtime Monitoring of ML-Based Scheduling Algorithms Toward Robust Domain-Specific SoCs","authors":"A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras","doi":"10.1109/TCAD.2024.3445815","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445815","url":null,"abstract":"Machine learning (ML) algorithms are being rapidly adopted to perform dynamic resource management tasks in heterogeneous system on chips. For example, ML-based task schedulers can make quick, high-quality decisions at runtime. Like any ML model, these offline-trained policies depend critically on the representative power of the training data. Hence, their performance may diminish or even catastrophically fail under unknown workloads, especially new applications. This article proposes a novel framework to continuously monitor the system to detect unforeseen scenarios using a gradient-based generalization metric called coherence. The proposed framework accurately determines whether the current policy generalizes to new inputs. If not, it incrementally trains the ML scheduler to ensure the robustness of the task-scheduling decisions. The proposed framework is evaluated thoroughly with a domain-specific SoC and six real-world applications. It can detect whether the trained scheduler generalizes to the current workload with 88.75%–98.39% accuracy. Furthermore, it enables \u0000<inline-formula> <tex-math>$1.1times -14times $ </tex-math></inline-formula>\u0000 faster execution time when the scheduler is incrementally trained. Finally, overhead analysis performed on an Nvidia Jetson Xavier NX board shows that the proposed framework can run as a real-time background task.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4202-4213"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FDPUF: Frequency-Domain PUF for Robust Authentication of Edge Devices FDPUF：用于边缘设备稳健认证的频域 PUF

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3447211

Shubhra Deb Paul;Aritra Dasgupta;Swarup Bhunia

Counterfeiting, overproduction, and cloning of integrated circuits (ICs) and associated hardware have emerged as major security concerns in the modern globalized microelectronics supply chain. One way to combat these issues effectively is to deploy hardware authentication techniques that utilize physical unclonable functions (PUFs). PUFs utilize intrinsic variations in hardware that occur during the manufacturing and fabrication process to generate device-specific fingerprints or immutable signatures that cannot be replicated by counterfeits and clones. However, unavoidable factors like environmental noise and harmonics can significantly deteriorate the quality of the PUF signature. Besides, conventional PUF solutions are generally not amenable to in-field authentication of hardware, which has emerged as a critical need for Internet of Things (IoT) edge devices to detect physical attacks on them. In this article, we introduce frequency-domain PUF or FDPUF, a novel PUF that analyzes time-domain current waveforms in the frequency domain to create high-quality authentication signatures that are suitable for in-field authentication. FDPUF decomposes electrical signals into their spectral coefficients, filters out unnecessary low-energy components, reconstructs the waveforms, and generates high-quality digital fingerprints for device authentication purposes. Compared to the existing authentication mechanisms, the higher quality of the signatures through the frequency-domain analysis makes the proposed FDPUF more suitable for protecting the integrity of the edge computing hardware. We perform experimental measurements on FPGA and analyze FDPUF properties using the National Institute of Standards and Technology test suite to demonstrate that the FDPUF provides better uniqueness and robustness than its time-domain counterpart while being attractive for in-field authentication.

集成电路（IC）及相关硬件的伪造、过度生产和克隆已成为现代全球化微电子供应链中的主要安全问题。有效解决这些问题的方法之一是部署利用物理不可克隆功能（PUF）的硬件验证技术。PUFs 利用硬件在制造和制造过程中发生的内在变化，生成特定于设备的指纹或不可更改的签名，伪造品和克隆品无法复制。然而，环境噪声和谐波等不可避免的因素会大大降低 PUF 签名的质量。此外，传统的 PUF 解决方案通常无法对硬件进行现场验证，而这已成为物联网（IoT）边缘设备检测物理攻击的关键需求。在本文中，我们介绍了频域 PUF 或 FDPUF，这是一种新型 PUF，可在频域分析时域电流波形，从而创建适合现场验证的高质量验证签名。FDPUF 将电信号分解为频谱系数，滤除不必要的低能成分，重建波形，生成用于设备认证的高质量数字指纹。与现有的认证机制相比，通过频域分析获得的签名质量更高，这使得所提出的 FDPUF 更适合保护边缘计算硬件的完整性。我们在 FPGA 上进行了实验测量，并使用美国国家标准与技术研究院的测试套件分析了 FDPUF 的特性，证明 FDPUF 比其时域对应机制具有更好的唯一性和鲁棒性，同时对现场验证也很有吸引力。

{"title":"FDPUF: Frequency-Domain PUF for Robust Authentication of Edge Devices","authors":"Shubhra Deb Paul;Aritra Dasgupta;Swarup Bhunia","doi":"10.1109/TCAD.2024.3447211","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447211","url":null,"abstract":"Counterfeiting, overproduction, and cloning of integrated circuits (ICs) and associated hardware have emerged as major security concerns in the modern globalized microelectronics supply chain. One way to combat these issues effectively is to deploy hardware authentication techniques that utilize physical unclonable functions (PUFs). PUFs utilize intrinsic variations in hardware that occur during the manufacturing and fabrication process to generate device-specific fingerprints or immutable signatures that cannot be replicated by counterfeits and clones. However, unavoidable factors like environmental noise and harmonics can significantly deteriorate the quality of the PUF signature. Besides, conventional PUF solutions are generally not amenable to in-field authentication of hardware, which has emerged as a critical need for Internet of Things (IoT) edge devices to detect physical attacks on them. In this article, we introduce frequency-domain PUF or FDPUF, a novel PUF that analyzes time-domain current waveforms in the frequency domain to create high-quality authentication signatures that are suitable for in-field authentication. FDPUF decomposes electrical signals into their spectral coefficients, filters out unnecessary low-energy components, reconstructs the waveforms, and generates high-quality digital fingerprints for device authentication purposes. Compared to the existing authentication mechanisms, the higher quality of the signatures through the frequency-domain analysis makes the proposed FDPUF more suitable for protecting the integrity of the edge computing hardware. We perform experimental measurements on FPGA and analyze FDPUF properties using the National Institute of Standards and Technology test suite to demonstrate that the FDPUF provides better uniqueness and robustness than its time-domain counterpart while being attractive for in-field authentication.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3479-3490"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0