ACM Transactions on Design Automation of Electronic Systems最新文献

Efficient Attacks on Strong PUFs via Covariance and Boolean Modeling 通过协方差和布尔建模有效攻击强 PUF

IF 2.2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-08-08 DOI: 10.1145/3687469

Hongfei Wang, Wei Liu, Wenjie Cai, Yunxiao Lu, Caixue Wan

The physical unclonable function (PUF) is a widely used hardware security primitive. Before hacking into a PUF-protected system, intruders typically initiate attacks on the PUF as the first step. Many strong PUF designs have been proposed to thwart non-invasive attacks that exploit acquired CRPs. In this work, we propose a general framework for efficient attacks on strong PUFs by investigating from two perspectives, namely, statistical covariances in the challenge space and the design dependency among PUF compositions. The framework consists of two novel attack methods against a wide range of PUF families, including XOR APUFs, interpose PUFs, and bistable ring (BR)-PUFs. It can also exploit the knowledge of reliability information to improve attack efficiency with gradient optimization. We evaluate our proposed attacks through extensive experiments, running both software-based simulation and hardware implementations on FPGAs to compare with corresponding SOTA works. Considerable effort has been made in ensuring identical software/hardware conditions for a fair comparison. The results demonstrate that our framework significantly outperforms SOTA results. Moreover, we show that our framework can efficiently attack diverse PUF families built from entirely different types, while almost all existing works solely focused on attacking one or very limited number of PUF designs.

物理不可克隆功能（PUF）是一种广泛使用的硬件安全基元。在入侵受 PUF 保护的系统之前，入侵者通常首先对 PUF 发起攻击。人们提出了许多强 PUF 设计，以挫败利用获取的 CRP 进行的非侵入式攻击。在这项工作中，我们从挑战空间中的统计协方差和 PUF 组成之间的设计依赖性这两个角度进行研究，提出了一种有效攻击强 PUF 的通用框架。该框架由两种新颖的攻击方法组成，可攻击各种 PUF 系列，包括 XOR APUF、插接 PUF 和双稳态环 (BR)-PUF。它还能利用可靠性信息知识，通过梯度优化提高攻击效率。我们通过广泛的实验评估了我们提出的攻击方法，运行了基于软件的仿真和 FPGA 上的硬件实现，并与相应的 SOTA 作品进行了比较。为了进行公平比较，我们在确保软件/硬件条件相同方面付出了巨大努力。结果表明，我们的框架明显优于 SOTA 的结果。此外，我们还表明，我们的框架可以有效地攻击由完全不同类型的 PUF 构建而成的各种 PUF 系列，而几乎所有现有作品都只专注于攻击一种或数量非常有限的 PUF 设计。

{"title":"Efficient Attacks on Strong PUFs via Covariance and Boolean Modeling","authors":"Hongfei Wang, Wei Liu, Wenjie Cai, Yunxiao Lu, Caixue Wan","doi":"10.1145/3687469","DOIUrl":"https://doi.org/10.1145/3687469","url":null,"abstract":"The physical unclonable function (PUF) is a widely used hardware security primitive. Before hacking into a PUF-protected system, intruders typically initiate attacks on the PUF as the first step. Many strong PUF designs have been proposed to thwart non-invasive attacks that exploit acquired CRPs. In this work, we propose a general framework for efficient attacks on strong PUFs by investigating from two perspectives, namely, statistical covariances in the challenge space and the design dependency among PUF compositions. The framework consists of two novel attack methods against a wide range of PUF families, including XOR APUFs, interpose PUFs, and bistable ring (BR)-PUFs. It can also exploit the knowledge of reliability information to improve attack efficiency with gradient optimization. We evaluate our proposed attacks through extensive experiments, running both software-based simulation and hardware implementations on FPGAs to compare with corresponding SOTA works. Considerable effort has been made in ensuring identical software/hardware conditions for a fair comparison. The results demonstrate that our framework significantly outperforms SOTA results. Moreover, we show that our framework can efficiently attack diverse PUF families built from entirely different types, while almost all existing works solely focused on attacking one or very limited number of PUF designs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141927072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PriorMSM: An Efficient Acceleration Architecture for Multi-Scalar Multiplication PriorMSM：高效的多乘法加速架构

IF 2.2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-07-12 DOI: 10.1145/3678006

Changxu Liu, Hao Zhou, Patrick Dai, Li Shang, Fan Yang

Multi-Scalar Multiplication (MSM) is a computationally intensive task that operates on elliptic curves based on GF ( P ). It is commonly used in Zero-knowledge proof (ZKP), where it accounts for a significant portion of the computation time required for proof generation. In this paper, we present PriorMSM, an efficient acceleration architecture for MSM. We propose a Priority-based Scheduling Mechanism (PBSM) based on a multi-FIFOs and multi-banks architecture to accelerate the implementation of MSM. By increasing the pairing success rate of internal points, PBSM reduces the number of bubbles in the pipeline of point addition (PADD), consequently improving the data throughput of the pipeline. We also introduce an advanced parallel bucket aggregation algorithm, leveraging PADD’s fully pipelined characteristics to significantly accelerate the implementation of bucket aggregation. We perform a sensitivity analysis on the crucial parameter, window size, in MSM. The results indicate that the window size of the MSM significantly impacts its latency. Area-Time Product (ATP) metric is introduced to guide the selection of the optimal window size, balancing the performance and cost for practical applications of subsequent MSM implementations. PriorMSM is evaluated using the TSMC 28nm process. It achieves a maximum speedup of 10.9 × compared to the previous custom hardware implementations and a maximum speedup of 3.9 × compared to the GPU implementations.

多乘法（MSM）是一项计算密集型任务，它在基于 GF ( P ) 的椭圆曲线上运行。它常用于零知识证明（ZKP），占证明生成所需计算时间的很大一部分。本文介绍了 MSM 的高效加速架构 PriorMSM。我们提出了一种基于优先级的调度机制（PBSM），该机制基于多 FIFO 和多银行架构，可加速 MSM 的实现。通过提高内部点的配对成功率，PBSM 减少了点添加流水线（PADD）中的气泡数量，从而提高了流水线的数据吞吐量。我们还引入了一种先进的并行桶聚合算法，利用 PADD 的全流水线特性显著加快了桶聚合的实现。我们对 MSM 中的关键参数窗口大小进行了敏感性分析。结果表明，MSM 的窗口大小对其延迟有显著影响。我们引入了面积-时间乘积（ATP）指标来指导选择最佳窗口大小，从而在后续 MSM 实现的实际应用中平衡性能和成本。PriorMSM 采用台积电 28 纳米工艺进行评估。与之前的定制硬件实现相比，它的最大速度提高了 10.9 倍，与 GPU 实现相比，它的最大速度提高了 3.9 倍。

{"title":"PriorMSM: An Efficient Acceleration Architecture for Multi-Scalar Multiplication","authors":"Changxu Liu, Hao Zhou, Patrick Dai, Li Shang, Fan Yang","doi":"10.1145/3678006","DOIUrl":"https://doi.org/10.1145/3678006","url":null,"abstract":"\u0000 Multi-Scalar Multiplication (MSM) is a computationally intensive task that operates on elliptic curves based on\u0000 GF\u0000 (\u0000 P\u0000 ). It is commonly used in Zero-knowledge proof (ZKP), where it accounts for a significant portion of the computation time required for proof generation. In this paper, we present PriorMSM, an efficient acceleration architecture for MSM. We propose a Priority-based Scheduling Mechanism (PBSM) based on a multi-FIFOs and multi-banks architecture to accelerate the implementation of MSM. By increasing the pairing success rate of internal points, PBSM reduces the number of bubbles in the pipeline of point addition (PADD), consequently improving the data throughput of the pipeline. We also introduce an advanced parallel bucket aggregation algorithm, leveraging PADD’s fully pipelined characteristics to significantly accelerate the implementation of bucket aggregation. We perform a sensitivity analysis on the crucial parameter, window size, in MSM. The results indicate that the window size of the MSM significantly impacts its latency. Area-Time Product (ATP) metric is introduced to guide the selection of the optimal window size, balancing the performance and cost for practical applications of subsequent MSM implementations. PriorMSM is evaluated using the TSMC 28nm process. It achieves a maximum speedup of 10.9 × compared to the previous custom hardware implementations and a maximum speedup of 3.9 × compared to the GPU implementations.\u0000","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141652667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Stream Scheduling of Inference Pipelines on Edge Devices - a DRL Approach 边缘设备推理流水线的多流调度--一种 DRL 方法

IF 2.2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-07-11 DOI: 10.1145/3677378

Danny Pereira, Sumana Ghosh, Soumyajit Dey

Low-power edge devices equipped with Graphics Processing Units (GPUs) are a popular target platform for real-time scheduling of inference pipelines. Such application-architecture combinations are popular in Advanced Driver-Assistance Systems (ADAS) for aiding in the real-time decision-making of automotive controllers. However, the real-time throughput sustainable by such inference pipelines is limited by resource constraints of the target edge devices. Modern GPUs, both in edge devices and workstation variants, support the facility of concurrent execution of computation kernels and data transfers using the primitive of streams , also allowing for the assignment of priority to these streams. This opens up the possibility of executing computation layers of inference pipelines within a multi-priority, multi-stream environment on the GPU. However, manually co-scheduling such applications while satisfying their throughput requirement and platform memory budget may require an unmanageable number of profiling runs. In this work, we propose a Deep Reinforcement Learning (DRL) based method for deciding the start time of various operations in each pipeline layer while optimizing the latency of execution of inference pipelines as well as memory consumption. Experimental results demonstrate the promising efficacy of the proposed DRL approach in comparison with the baseline methods, particularly in terms of real-time performance enhancements, schedulability ratio, and memory savings. We have additionally assessed the effectiveness of the proposed DRL approach using a real-time traffic simulation tool IPG CarMaker.

配备图形处理器（GPU）的低功耗边缘设备是推理流水线实时调度的热门目标平台。这种应用架构组合在高级驾驶辅助系统（ADAS）中非常流行，可为汽车控制器的实时决策提供帮助。然而，由于目标边缘设备的资源限制，此类推理流水线可持续的实时吞吐量受到了限制。现代 GPU（包括边缘设备和工作站变体）支持计算内核和数据传输的并发执行，使用流原型，还允许为这些流分配优先级。这为在 GPU 的多优先级、多流环境中执行推理流水线的计算层提供了可能性。然而，在满足吞吐量要求和平台内存预算的同时，手动共同调度此类应用可能需要进行难以管理的剖析运行。在这项工作中，我们提出了一种基于深度强化学习（DRL）的方法，用于决定每个流水线层中各种操作的开始时间，同时优化推理流水线的执行延迟和内存消耗。实验结果表明，与基线方法相比，拟议的 DRL 方法具有良好的功效，尤其是在实时性能提升、可调度性比率和内存节省方面。此外，我们还使用实时交通仿真工具 IPG CarMaker 评估了所提出的 DRL 方法的有效性。

{"title":"Multi-Stream Scheduling of Inference Pipelines on Edge Devices - a DRL Approach","authors":"Danny Pereira, Sumana Ghosh, Soumyajit Dey","doi":"10.1145/3677378","DOIUrl":"https://doi.org/10.1145/3677378","url":null,"abstract":"\u0000 Low-power edge devices equipped with Graphics Processing Units (GPUs) are a popular target platform for real-time scheduling of inference pipelines. Such application-architecture combinations are popular in Advanced Driver-Assistance Systems (ADAS) for aiding in the real-time decision-making of automotive controllers. However, the real-time throughput sustainable by such inference pipelines is limited by resource constraints of the target edge devices. Modern GPUs, both in edge devices and workstation variants, support the facility of concurrent execution of computation kernels and data transfers using the primitive of\u0000 streams\u0000 , also allowing for the assignment of priority to these streams. This opens up the possibility of executing computation layers of inference pipelines within a multi-priority, multi-stream environment on the GPU. However, manually co-scheduling such applications while satisfying their throughput requirement and platform memory budget may require an unmanageable number of profiling runs. In this work, we propose a Deep Reinforcement Learning (DRL) based method for deciding the start time of various operations in each pipeline layer while optimizing the latency of execution of inference pipelines as well as memory consumption. Experimental results demonstrate the promising efficacy of the proposed DRL approach in comparison with the baseline methods, particularly in terms of real-time performance enhancements, schedulability ratio, and memory savings. We have additionally assessed the effectiveness of the proposed DRL approach using a real-time traffic simulation tool IPG CarMaker.\u0000","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141658363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Power Optimization Approach for Large-scale RM-TB Dual Logic Circuits Based on an Adaptive Multi-Task Intelligent Algorithm 基于自适应多任务智能算法的大规模 RM-TB 双逻辑电路功率优化方法

IF 2.2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-07-10 DOI: 10.1145/3677033

Xiaoqian Wu, Huaxiao Liu, Peng Wang, Lei Liu, Zhenxue He

Logic synthesis is a crucial step in integrated circuit design, and power optimization is an indispensable part of this process. However, power optimization for large-scale Mixed Polarity Reed-Muller (MPRM) logic circuits is an NP-hard problem. In this paper, we divide Boolean circuits into small-scale circuits based on the idea of divide and conquer using the proposed Dynamic Adaptive Grouping Strategy (DAGS) and the proposed circuit decomposition model. Each small-scale Boolean circuit is transformed into an MPRM logic circuit by a polarity transformation algorithm. Based on the gate-level integration, we integrate small-scale circuits into an MPRM and Boolean Dual Logic (RBDL) circuit. Furthermore, the power optimization problem of RBDL circuits is a multi-task, multi-extremal, high-dimensional combinatorial optimization problem, for which we propose an Adaptive Multi-task Intelligent Algorithm (AMIA), which includes global task optimization, population reproduction, valuable knowledge transfer, and local exploration to search for the lowest power for RBDL circuits. Moreover, based on the proposed Fast Power Decomposition Algorithm (FPDA), we proposed a Power Optimization Approach (POA) for an RBDL circuit with the lowest power using the AMIA. Experimental results based on Microelectronics Center of North Carolina (MCNC) Benchmark test circuits demonstrate the effectiveness and superiority of the POA compared to state-of-the-art power optimization approaches.

逻辑综合是集成电路设计的关键步骤，而功耗优化则是这一过程中不可或缺的一部分。然而，大规模混合极性里德-穆勒（MPRM）逻辑电路的功耗优化是一个 NP 难问题。本文基于分而治之的思想，利用提出的动态自适应分组策略（DAGS）和电路分解模型，将布尔电路划分为小规模电路。每个小规模布尔电路通过极性转换算法转化为 MPRM 逻辑电路。在门级集成的基础上，我们将小规模电路集成到 MPRM 和布尔双逻辑（RBDL）电路中。此外，RBDL 电路的功率优化问题是一个多任务、多极端、高维的组合优化问题，为此我们提出了一种自适应多任务智能算法 (AMIA)，其中包括全局任务优化、种群繁殖、有价值的知识转移和局部探索，以寻找 RBDL 电路的最低功率。此外，基于所提出的快速功率分解算法 (FPDA)，我们提出了一种功率优化方法 (POA)，利用 AMIA 实现功率最低的 RBDL 电路。基于北卡罗来纳州微电子中心（MCNC）基准测试电路的实验结果表明，与最先进的功率优化方法相比，POA 更为有效和优越。

{"title":"A Power Optimization Approach for Large-scale RM-TB Dual Logic Circuits Based on an Adaptive Multi-Task Intelligent Algorithm","authors":"Xiaoqian Wu, Huaxiao Liu, Peng Wang, Lei Liu, Zhenxue He","doi":"10.1145/3677033","DOIUrl":"https://doi.org/10.1145/3677033","url":null,"abstract":"Logic synthesis is a crucial step in integrated circuit design, and power optimization is an indispensable part of this process. However, power optimization for large-scale Mixed Polarity Reed-Muller (MPRM) logic circuits is an NP-hard problem. In this paper, we divide Boolean circuits into small-scale circuits based on the idea of divide and conquer using the proposed Dynamic Adaptive Grouping Strategy (DAGS) and the proposed circuit decomposition model. Each small-scale Boolean circuit is transformed into an MPRM logic circuit by a polarity transformation algorithm. Based on the gate-level integration, we integrate small-scale circuits into an MPRM and Boolean Dual Logic (RBDL) circuit. Furthermore, the power optimization problem of RBDL circuits is a multi-task, multi-extremal, high-dimensional combinatorial optimization problem, for which we propose an Adaptive Multi-task Intelligent Algorithm (AMIA), which includes global task optimization, population reproduction, valuable knowledge transfer, and local exploration to search for the lowest power for RBDL circuits. Moreover, based on the proposed Fast Power Decomposition Algorithm (FPDA), we proposed a Power Optimization Approach (POA) for an RBDL circuit with the lowest power using the AMIA. Experimental results based on Microelectronics Center of North Carolina (MCNC) Benchmark test circuits demonstrate the effectiveness and superiority of the POA compared to state-of-the-art power optimization approaches.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141659979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MAB-BMC: A Formal Verification Enhancer by Harnessing Multiple BMC Engines Together MAB-BMC：同时利用多个 BMC 引擎的形式化验证增强器

IF 2.2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-07-02 DOI: 10.1145/3675168

Devleena Ghosh, Sumana Ghosh, Ansuman Banerjee, R. Gajavelly, Sudhakar Surendran

In recent times, Bounded Model Checking (BMC) engines have gained wide prominence in formal verification. Different BMC engines exist, differing in their optimization, representations and solving mechanisms used to represent and navigate the underlying state transition of the given design to be verified. The objective of this paper is to examine if combinations of BMC engines can help to combine their strengths. We propose an approach that can create a sequencing of BMC engines that can reach better depth in formal verification, as opposed to executing them alone for a specified time. Our approach uses machine learning, specifically, the Multi-Armed Bandit paradigm of reinforcement learning, to predict the best-performing BMC engine for a given unrolling depth of the underlying circuit design. We evaluate our approach on a set of benchmark designs from the Hardware Model Checking Competition (HWMCC) benchmarks and show that it outperforms the state-of-the-art BMC engines in terms of the depth reached or time taken to deduce a property violation. The synthesized BMC engine sequences reach better depths than HWMCC results and the state-of-the-art technique, super_deep, for more than 80% of the cases. It also outperforms single engine runs for more than 92% of the cases where a property violation is not found within a given time duration. For designs where property violations are found within the given time duration, the synthesized sequences found the property violation in a lesser time than HWMCC for all the designs and outperformed both super_deep and single engine runs for more than 87% of the designs.

近来，有界模型检查（BMC）引擎在形式验证中得到了广泛应用。目前存在不同的 BMC 引擎，它们在优化、表示和求解机制方面各不相同，用于表示和引导待验证设计的底层状态转换。本文的目的是研究 BMC 引擎的组合是否有助于结合它们的优势。我们提出了一种方法，可以创建 BMC 引擎的排序，从而在形式验证中达到更好的深度，而不是在指定时间内单独执行这些引擎。我们的方法使用机器学习，特别是强化学习的多臂匪徒范式，来预测底层电路设计的给定展开深度下性能最佳的 BMC 引擎。我们在硬件模型检查竞赛（HWMCC）基准中的一组基准设计上评估了我们的方法，结果表明，就达到的深度或推导出属性违反所需的时间而言，我们的方法优于最先进的 BMC 引擎。在 80% 以上的情况下，合成的 BMC 引擎序列达到的深度优于 HWMCC 结果和最先进的 super_deep 技术。在 92% 以上在给定时间内未发现违反属性的情况下，它的性能也优于单引擎运行。对于在给定持续时间内发现属性违规的设计，合成序列在所有设计中发现属性违规的时间均少于 HWMCC，并且在 87% 以上的设计中，合成序列的性能优于 super_deep 和单引擎运行。

{"title":"MAB-BMC: A Formal Verification Enhancer by Harnessing Multiple BMC Engines Together","authors":"Devleena Ghosh, Sumana Ghosh, Ansuman Banerjee, R. Gajavelly, Sudhakar Surendran","doi":"10.1145/3675168","DOIUrl":"https://doi.org/10.1145/3675168","url":null,"abstract":"In recent times, Bounded Model Checking (BMC) engines have gained wide prominence in formal verification. Different BMC engines exist, differing in their optimization, representations and solving mechanisms used to represent and navigate the underlying state transition of the given design to be verified. The objective of this paper is to examine if combinations of BMC engines can help to combine their strengths. We propose an approach that can create a sequencing of BMC engines that can reach better depth in formal verification, as opposed to executing them alone for a specified time. Our approach uses machine learning, specifically, the Multi-Armed Bandit paradigm of reinforcement learning, to predict the best-performing BMC engine for a given unrolling depth of the underlying circuit design. We evaluate our approach on a set of benchmark designs from the Hardware Model Checking Competition (HWMCC) benchmarks and show that it outperforms the state-of-the-art BMC engines in terms of the depth reached or time taken to deduce a property violation. The synthesized BMC engine sequences reach better depths than HWMCC results and the state-of-the-art technique, super_deep, for more than 80% of the cases. It also outperforms single engine runs for more than 92% of the cases where a property violation is not found within a given time duration. For designs where property violations are found within the given time duration, the synthesized sequences found the property violation in a lesser time than HWMCC for all the designs and outperformed both super_deep and single engine runs for more than 87% of the designs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141685856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Single Bitline Highly Stable, Low Power With High Speed Half-Select Disturb Free 11T SRAM Cell 单比特线高稳定、低功耗、高速半选择无干扰 11T SRAM 单元

IF 1.4 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-06-19 DOI: 10.1145/3653675

Lokesh Soni, Neeta Pandey

A half-select disturb-free 11T (HF11T) static random access memory (SRAM) cell with low power, better stability and high speed is presented in this paper. The proposed SRAM cell works well with bit-interleaving design, which enhances soft-error immunity. A comparison of the proposed HF11T cell with other cutting-edge designs such as single-ended HS free 11T (SEHF11T), a shared-pass-gate 11T (SPG11T), data-dependent stack PMOS switching 10T (DSPS10T), a single-ended half-selected robust 12T (HSR12T), and 11T SRAM cells has been made. It exhibits 4.85 × /9.19 × less read delay (T_RA) and write delay (T_WA), respectively as compared to other considered SRAM cells. It achieves 1.07 × /1.02 × better read and write stability, respectively than the considered SRAM cells. It shows maximum reduction of 1.68 × /4.58 × /94.72 × /9 × /145 × leakage power, read power, write power consumption, read power delay product (PDP) and write PDP respectively, than the considered SRAM cells. In addition, the proposed HF11T cell achieves 10.14 × higher I_on/I_off ratio than the other compared cells. These improvements come with a trade-off, resulting in 1.13 × more T_RA compared to SPG11T. The simulation is performed with Cadence Virtuoso 45nm CMOS technology at supply voltage (V_DD) of 0.6 V.

本文介绍了一种具有低功耗、更好稳定性和高速度的半选择无干扰 11T (HF11T) 静态随机存取存储器 (SRAM) 单元。所提出的 SRAM 单元采用位交错设计，能很好地增强软抗错能力。本文将拟议的 HF11T 单元与单端无 HS 11T (SEHF11T)、共享通门 11T (SPG11T)、数据依赖堆栈 PMOS 开关 10T (DSPS10T)、单端半选择稳健 12T (HSR12T) 和 11T SRAM 单元等其他先进设计进行了比较。与其他考虑过的 SRAM 单元相比，它的读取延迟（TRA）和写入延迟（TWA）分别减少了 4.85 × /9.19 ×。与其他 SRAM 单元相比，它的读取和写入稳定性分别提高了 1.07 × /1.02 ×。与所考虑的 SRAM 单元相比，它最大限度地降低了 1.68 × /4.58 × /94.72 × /9 × /145 × 漏功率、读功率、写功耗、读功率延迟积（PDP）和写功率延迟积（PDP）。此外，拟议的 HF11T 单元的离子/离子交换比比其他单元高出 10.14 倍。这些改进是有代价的，与 SPG11T 相比，TRA 增加了 1.13 倍。仿真采用 Cadence Virtuoso 45nm CMOS 技术，电源电压（VDD）为 0.6 V。

{"title":"A Single Bitline Highly Stable, Low Power With High Speed Half-Select Disturb Free 11T SRAM Cell","authors":"Lokesh Soni, Neeta Pandey","doi":"10.1145/3653675","DOIUrl":"https://doi.org/10.1145/3653675","url":null,"abstract":"A half-select disturb-free 11T (HF11T) static random access memory (SRAM) cell with low power, better stability and high speed is presented in this paper. The proposed SRAM cell works well with bit-interleaving design, which enhances soft-error immunity. A comparison of the proposed HF11T cell with other cutting-edge designs such as single-ended HS free 11T (SEHF11T), a shared-pass-gate 11T (SPG11T), data-dependent stack PMOS switching 10T (DSPS10T), a single-ended half-selected robust 12T (HSR12T), and 11T SRAM cells has been made. It exhibits 4.85 × /9.19 × less read delay (TRA) and write delay (TWA), respectively as compared to other considered SRAM cells. It achieves 1.07 × /1.02 × better read and write stability, respectively than the considered SRAM cells. It shows maximum reduction of 1.68 × /4.58 × /94.72 × /9 × /145 × leakage power, read power, write power consumption, read power delay product (PDP) and write PDP respectively, than the considered SRAM cells. In addition, the proposed HF11T cell achieves 10.14 × higher Ion/Ioff ratio than the other compared cells. These improvements come with a trade-off, resulting in 1.13 × more TRA compared to SPG11T. The simulation is performed with Cadence Virtuoso 45nm CMOS technology at supply voltage (VDD) of 0.6 V.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Cost-Driven Chip Partitioning Method for Heterogeneous 3D Integration 异构三维集成的成本驱动型芯片分区方法

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-06-14 DOI: 10.1145/3672558

Cheng-Hsien Lin, Kuan-Ting Chen, Yi-Yu Liu, Allen C.-H. Wu, TingTing Hwang

3D IC offers significant benefits in terms of performance and cost. Existing research in through-silicon via (TSV)-based 3D integration circuit (IC) partitioning has focused on minimizing the number of TSVs to reduce costs. Partitioning methods based on heterogeneous integration have emerged as viable approaches for cost optimization. Leveraging mature processes to manufacture not timing-critical blocks can yield cost benefits. Nevertheless, none of the previous 3D partitioning work has focused on reducing the overall cost, including both design and manufacturing costs, for heterogeneous 3D integration. Moreover, throughput constraints have not been considered. This paper presents a cost-aware integer linear programming (ILP)-based formulation and a heuristic algorithm that partition the functional blocks in the design into different technological groups. Each group of functional blocks will be implemented using a particular process technology, and then integrated into a 3D IC. Our results show that 3D heterogeneous integration chip implementation can reduce overall cost while satisfying various timing constraints.

三维集成电路在性能和成本方面具有显著优势。基于硅通孔（TSV）的三维集成电路（IC）分区的现有研究主要集中在尽量减少 TSV 的数量以降低成本。基于异质集成的分区方法已成为成本优化的可行方法。利用成熟的工艺来制造非时序关键块可以产生成本效益。然而，以前的三维分区工作都没有关注降低异构三维集成的总体成本，包括设计和制造成本。此外，吞吐量约束也未被考虑在内。本文提出了一种基于成本感知的整数线性规划（ILP）公式和启发式算法，将设计中的功能块划分为不同的技术组。每组功能块将使用特定的工艺技术实现，然后集成到三维集成电路中。我们的研究结果表明，三维异构集成芯片的实现可以降低总体成本，同时满足各种时序约束。

{"title":"A Cost-Driven Chip Partitioning Method for Heterogeneous 3D Integration","authors":"Cheng-Hsien Lin, Kuan-Ting Chen, Yi-Yu Liu, Allen C.-H. Wu, TingTing Hwang","doi":"10.1145/3672558","DOIUrl":"https://doi.org/10.1145/3672558","url":null,"abstract":"3D IC offers significant benefits in terms of performance and cost. Existing research in through-silicon via (TSV)-based 3D integration circuit (IC) partitioning has focused on minimizing the number of TSVs to reduce costs. Partitioning methods based on heterogeneous integration have emerged as viable approaches for cost optimization. Leveraging mature processes to manufacture not timing-critical blocks can yield cost benefits. Nevertheless, none of the previous 3D partitioning work has focused on reducing the overall cost, including both design and manufacturing costs, for heterogeneous 3D integration. Moreover, throughput constraints have not been considered. This paper presents a cost-aware integer linear programming (ILP)-based formulation and a heuristic algorithm that partition the functional blocks in the design into different technological groups. Each group of functional blocks will be implemented using a particular process technology, and then integrated into a 3D IC. Our results show that 3D heterogeneous integration chip implementation can reduce overall cost while satisfying various timing constraints.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141341376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Correction of Arithmetic Circuits in the Presence of Multiple Bugs by Groebner Basis Modification 通过格罗伊布纳基础修改自动修正存在多重缺陷的算术电路

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-06-12 DOI: 10.1145/3672559

Negar Aghapour Sabbagh, B. Alizadeh

One promising approach to verify large arithmetic circuits is making use of Symbolic Computer Algebra (SCA), where the circuit and the specification are translated to a set of polynomials, and the verification is performed by the ideal membership testing. Here, the main problem is the monomial explosion for buggy arithmetic circuits, which makes obtaining the word-level remainder become unfeasible. So, automatic correction of such circuits remains a significant challenge. Our proposed correction method partitions the circuit based on primary output bits and modifies the related Groebner basis based on the given suspicious gates, which makes it independent of the word-level remainder. We have applied our method to various signed and unsigned multipliers, with various sizes and numbers of suspicious and buggy gates. The results show that the proposed method corrects the bugs without area overhead. Moreover, it is able to correct the buggy circuit on average 51.9 × and 45.72 × faster in comparison with the state-of-the-art correction techniques, having single and multiple bugs, respectively.

利用符号计算机代数（SCA）验证大型算术电路是一种很有前途的方法，在这种方法中，电路和规范被转化为一组多项式，并通过理想成员测试进行验证。在这里，主要问题是错误算术电路的单项式爆炸，这使得获得字级余数变得不可行。因此，自动修正这类电路仍是一项重大挑战。我们提出的修正方法根据主要输出位分割电路，并根据给定的可疑门修改相关的格罗伯纳基础，从而使其与字级余数无关。我们将我们的方法应用于各种有符号和无符号乘法器，以及不同大小和数量的可疑门电路和错误门电路。结果表明，所提出的方法能在不增加面积的情况下修正错误。此外，与最先进的校正技术相比，该方法在校正单个和多个错误电路方面的平均速度分别提高了 51.9 倍和 45.72 倍。

引用次数: 0

Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical Framework 使用分析框架估算 AR/VR 工作负载传感器上部署的功率、性能和面积

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-06-07 DOI: 10.1145/3670404

Xiaoyu Sun, Xiaochen Peng, Sai Zhang, J. Gómez, W. Khwa, Syed Sarwar, Ziyun Li, Weidong Cao, Zhao Wang, Chiao Liu, Meng-Fan Chang, B. Salvo, Kerem Akarvardar, H.-S. Philip Wong

Augmented Reality and Virtual Reality have emerged as the next frontier of intelligent image sensors and computer systems. In these systems, 3D die stacking stands out as a compelling solution, enabling in-situ processing capability of the sensory data for tasks such as image classification and object detection at low power, low latency, and a small form factor. These intelligent 3D CMOS Image Sensor (CIS) systems present a wide design space, encompassing multiple domains (e.g., computer vision algorithms, circuit design, system architecture, and semiconductor technology, including 3D stacking) that have not been explored in-depth so far. This paper aims to fill this gap. We first present an analytical evaluation framework, STAR-3DSim, dedicated to rapid pre-RTL evaluation of 3D-CIS systems capturing the entire stack from the pixel layer to the on-sensor processor layer. With STAR-3DSim, we then propose several knobs for PPA (power, performance, area) improvement of the Deep Neural Network (DNN) accelerator that can provide up to 53%, 41%, and 63% reduction in energy, latency, and area, respectively, across a broad set of relevant AR/VR workloads. Lastly, we present full-system evaluation results by taking image sensing, cross-tier data transfer, and off-sensor communication into consideration.

增强现实和虚拟现实已成为智能图像传感器和计算机系统的下一个前沿领域。在这些系统中，三维芯片堆叠是一种引人注目的解决方案，它能以低功耗、低延迟和小外形尺寸实现感知数据的原位处理能力，以完成图像分类和物体检测等任务。这些智能 3D CMOS 图像传感器 (CIS) 系统具有广阔的设计空间，涵盖多个领域（如计算机视觉算法、电路设计、系统架构和半导体技术，包括 3D 堆叠），迄今为止尚未对其进行深入探讨。本文旨在填补这一空白。我们首先提出了一个分析评估框架 STAR-3DSim，专门用于快速预 RTL 评估 3D-CIS 系统，捕捉从像素层到传感器处理器层的整个堆栈。通过 STAR-3DSim，我们提出了改进深度神经网络（DNN）加速器 PPA（功耗、性能、面积）的几个旋钮，这些旋钮可以在广泛的相关 AR/VR 工作负载中将能耗、延迟和面积分别降低 53%、41% 和 63%。最后，我们将图像传感、跨层数据传输和非传感器通信考虑在内，介绍了全系统评估结果。

{"title":"Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical Framework","authors":"Xiaoyu Sun, Xiaochen Peng, Sai Zhang, J. Gómez, W. Khwa, Syed Sarwar, Ziyun Li, Weidong Cao, Zhao Wang, Chiao Liu, Meng-Fan Chang, B. Salvo, Kerem Akarvardar, H.-S. Philip Wong","doi":"10.1145/3670404","DOIUrl":"https://doi.org/10.1145/3670404","url":null,"abstract":"Augmented Reality and Virtual Reality have emerged as the next frontier of intelligent image sensors and computer systems. In these systems, 3D die stacking stands out as a compelling solution, enabling in-situ processing capability of the sensory data for tasks such as image classification and object detection at low power, low latency, and a small form factor. These intelligent 3D CMOS Image Sensor (CIS) systems present a wide design space, encompassing multiple domains (e.g., computer vision algorithms, circuit design, system architecture, and semiconductor technology, including 3D stacking) that have not been explored in-depth so far. This paper aims to fill this gap. We first present an analytical evaluation framework, STAR-3DSim, dedicated to rapid pre-RTL evaluation of 3D-CIS systems capturing the entire stack from the pixel layer to the on-sensor processor layer. With STAR-3DSim, we then propose several knobs for PPA (power, performance, area) improvement of the Deep Neural Network (DNN) accelerator that can provide up to 53%, 41%, and 63% reduction in energy, latency, and area, respectively, across a broad set of relevant AR/VR workloads. Lastly, we present full-system evaluation results by taking image sensing, cross-tier data transfer, and off-sensor communication into consideration.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141373733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing Hyperdimensional Computing Based on Trainable Encoding and Adaptive Training for Efficient and Accurate Learning 推进基于可训练编码和自适应训练的超维计算，实现高效准确学习

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-06-04 DOI: 10.1145/3665891

Jiseung Kim, Hyunsei Lee, Mohsen Imani, Yeseong Kim

Hyperdimensional computing (HDC) is a computing paradigm inspired by the mechanisms of human memory, characterizing data through high-dimensional vector representations, known as hypervectors. Recent advancements in HDC have explored its potential as a learning model, leveraging its straightforward arithmetic and high efficiency. The traditional HDC frameworks are hampered by two primary static elements: randomly generated encoders and fixed learning rates. These static components significantly limit model adaptability and accuracy. The static, randomly generated encoders, while ensuring high-dimensional representation, fail to adapt to evolving data relationships, thereby constraining the model’s ability to accurately capture and learn from complex patterns. Similarly, the fixed nature of the learning rate does not account for the varying needs of the training process over time, hindering efficient convergence and optimal performance. This paper introduces (mathsf {TrainableHD} ), a novel HDC framework that enables dynamic training of the randomly generated encoder depending on the feedback of the learning data, thereby addressing the static nature of conventional HDC encoders. (mathsf {TrainableHD} ) also enhances the training performance by incorporating adaptive optimizer algorithms in learning the hypervectors. We further refine (mathsf {TrainableHD} ) with effective quantization to enhance efficiency, allowing the execution of the inference phase in low-precision accelerators. Our evaluations demonstrate that (mathsf {TrainableHD} ) significantly improves HDC accuracy by up to 27.99% (averaging 7.02%) without additional computational costs during inference, achieving a performance level comparable to state-of-the-art deep learning models. Furthermore, (mathsf {TrainableHD} ) is optimized for execution speed and energy efficiency. Compared to deep learning on a low-power GPU platform like NVIDIA Jetson Xavier, (mathsf {TrainableHD} ) is 56.4 times faster and 73 times more energy efficient. This efficiency is further augmented through the use of Encoder Interval Training (EIT) and adaptive optimizer algorithms, enhancing the training process without compromising the model’s accuracy.

超维计算（HDC）是一种受人类记忆机制启发的计算范式，它通过高维向量表示（称为超向量）来描述数据特征。HDC 的最新进展探索了其作为学习模型的潜力，利用了其简单的运算和高效率。传统的 HDC 框架受到两个主要静态元素的阻碍：随机生成的编码器和固定的学习率。这些静态元素极大地限制了模型的适应性和准确性。随机生成的静态编码器虽然能确保高维表示，但却无法适应不断变化的数据关系，从而限制了模型准确捕捉和学习复杂模式的能力。同样，学习率的固定性也没有考虑到训练过程随时间变化的需求，从而阻碍了高效收敛和最佳性能的实现。本文介绍了一种新型 HDC 框架，它能够根据学习数据的反馈动态训练随机生成的编码器，从而解决传统 HDC 编码器的静态特性。在学习超向量的过程中，(mathsf {TrainableHD}) 还加入了自适应优化算法，从而提高了训练性能。我们通过有效量化进一步完善了 (mathsf {TrainableHD} )，以提高效率，允许在低精度加速器中执行推理阶段。我们的评估表明，在推理过程中，(mathsf {TrainableHD} )在不增加额外计算成本的情况下，将HDC的准确率显著提高了27.99%（平均为7.02%），达到了与最先进的深度学习模型相当的性能水平。此外，(mathsf {TrainableHD} )还针对执行速度和能效进行了优化。与英伟达 Jetson Xavier 等低功耗 GPU 平台上的深度学习相比，(mathsf {TrainableHD}) 的速度快 56.4 倍，能效高 73 倍。通过使用编码器间隔训练（Encoder Interval Training，EIT）和自适应优化算法，在不影响模型准确性的情况下增强了训练过程，从而进一步提高了效率。

{"title":"Advancing Hyperdimensional Computing Based on Trainable Encoding and Adaptive Training for Efficient and Accurate Learning","authors":"Jiseung Kim, Hyunsei Lee, Mohsen Imani, Yeseong Kim","doi":"10.1145/3665891","DOIUrl":"https://doi.org/10.1145/3665891","url":null,"abstract":"Hyperdimensional computing (HDC) is a computing paradigm inspired by the mechanisms of human memory, characterizing data through high-dimensional vector representations, known as hypervectors. Recent advancements in HDC have explored its potential as a learning model, leveraging its straightforward arithmetic and high efficiency. The traditional HDC frameworks are hampered by two primary static elements: randomly generated encoders and fixed learning rates. These static components significantly limit model adaptability and accuracy. The static, randomly generated encoders, while ensuring high-dimensional representation, fail to adapt to evolving data relationships, thereby constraining the model’s ability to accurately capture and learn from complex patterns. Similarly, the fixed nature of the learning rate does not account for the varying needs of the training process over time, hindering efficient convergence and optimal performance. This paper introduces (mathsf {TrainableHD} ), a novel HDC framework that enables dynamic training of the randomly generated encoder depending on the feedback of the learning data, thereby addressing the static nature of conventional HDC encoders. (mathsf {TrainableHD} ) also enhances the training performance by incorporating adaptive optimizer algorithms in learning the hypervectors. We further refine (mathsf {TrainableHD} ) with effective quantization to enhance efficiency, allowing the execution of the inference phase in low-precision accelerators. Our evaluations demonstrate that (mathsf {TrainableHD} ) significantly improves HDC accuracy by up to 27.99% (averaging 7.02%) without additional computational costs during inference, achieving a performance level comparable to state-of-the-art deep learning models. Furthermore, (mathsf {TrainableHD} ) is optimized for execution speed and energy efficiency. Compared to deep learning on a low-power GPU platform like NVIDIA Jetson Xavier, (mathsf {TrainableHD} ) is 56.4 times faster and 73 times more energy efficient. This efficiency is further augmented through the use of Encoder Interval Training (EIT) and adaptive optimizer algorithms, enhancing the training process without compromising the model’s accuracy.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141253460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0