Aritra Bagchi, Dharamjeet, Ohm Rishabh, Manan Suri, Preeti Ranjan Panda
Non-volatile memories (NVMs) with their high storage density and ultra-low leakage power offer promising potential for redesigning the memory hierarchy in next-generation Multi-Processor Systems-on-Chip (MPSoCs). However, the adoption of NVMs in cache designs introduces challenges such as NVM write overheads and limited NVM endurance. The shared NVM cache in an MPSoC experiences requests from different processor cores and responses from the off-chip memory when the requested data is not present in the cache. Besides, upon evictions of dirty data from higher-level caches, the shared NVM cache experiences another source of write operations, known as writebacks. These sources of write operations: writebacks and responses, further exacerbate the contention for the shared bandwidth of the NVM cache, and create significant performance bottlenecks. Uncontrolled write operations can also affect the endurance of the NVM cache, posing a threat to cache lifetime and system reliability. Existing strategies often address either performance or cache endurance individually, leaving a gap for a holistic solution. This study introduces the Performance Optimization and Endurance Management (POEM) methodology, a novel approach that aggressively bypasses cache writebacks and responses to alleviate the NVM cache contention. Contrary to the existing bypass policies which do not pay adequate attention to the shared NVM cache contention, and focus too much on cache data reuse, POEM’s aggressive bypass significantly improves the overall system performance, even at the expense of data reuse. POEM also employs effective wear leveling to enhance the NVM cache endurance by careful redistribution of write operations across different cache lines. Across diverse workloads, POEM yields an average speedup of (34% ) over a naïve baseline and (28.8% ) over a state-of-the-art NVM cache bypass technique, while enhancing the cache endurance by (15% ) over the baseline. POEM also explores diverse design choices by exploiting a key policy parameter that assigns varying priorities to the two system-level objectives.
{"title":"POEM: Performance Optimization and Endurance Management for Non-volatile Caches","authors":"Aritra Bagchi, Dharamjeet, Ohm Rishabh, Manan Suri, Preeti Ranjan Panda","doi":"10.1145/3653452","DOIUrl":"https://doi.org/10.1145/3653452","url":null,"abstract":"<p>Non-volatile memories (NVMs) with their high storage density and ultra-low leakage power offer promising potential for redesigning the memory hierarchy in next-generation Multi-Processor Systems-on-Chip (MPSoCs). However, the adoption of NVMs in cache designs introduces challenges such as NVM write overheads and limited NVM endurance. The shared NVM cache in an MPSoC experiences <i>requests</i> from different processor cores and <i>responses</i> from the off-chip memory when the requested data is not present in the cache. Besides, upon evictions of dirty data from higher-level caches, the shared NVM cache experiences another source of write operations, known as <i>writebacks</i>. These sources of write operations: writebacks and responses, further exacerbate the contention for the shared bandwidth of the NVM cache, and create significant performance bottlenecks. Uncontrolled write operations can also affect the endurance of the NVM cache, posing a threat to cache lifetime and system reliability. Existing strategies often address either performance or cache endurance individually, leaving a gap for a holistic solution. This study introduces the Performance Optimization and Endurance Management (POEM) methodology, a novel approach that aggressively bypasses cache writebacks and responses to alleviate the NVM cache contention. Contrary to the existing bypass policies which do not pay adequate attention to the shared NVM cache contention, and focus too much on cache data reuse, POEM’s aggressive bypass significantly improves the overall system performance, even at the expense of data reuse. POEM also employs effective wear leveling to enhance the NVM cache endurance by careful redistribution of write operations across different cache lines. Across diverse workloads, POEM yields an average speedup of (34% ) over a naïve baseline and (28.8% ) over a state-of-the-art NVM cache bypass technique, while enhancing the cache endurance by (15% ) over the baseline. POEM also explores diverse design choices by exploiting a key policy parameter that assigns varying priorities to the two system-level objectives.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"34 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140311393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Yang, Qi Xu, Hao Geng, Song Chen, Bei Yu, Yi Kang
In this paper, we focus on chip floorplanning, which aims to determine the location and orientation of circuit macros simultaneously, so that the chip area and wirelength are minimized. As the highest level of abstraction in hierarchical physical design, floorplanning bridges the gap between the system-level design and the physical synthesis, whose quality directly influences downstream placement and routing. To tackle chip floorplanning, we propose an end-to-end reinforcement learning (RL) methodology with a hindsight experience replay technique. An edge-aware graph attention network (EAGAT) is developed to effectively encode the macro and connection features of the netlist graph. Moreover, we build a hierarchical decoder architecture mainly consisting of transformer and attention pointer mechanism to output floorplan actions. Since the RL agent automatically extracts knowledge about the solution space, the previously learned policy can be quickly transferred to optimize new unseen netlists. Experimental results demonstrate that, compared with state-of-the-art floorplanners, the proposed end-to-end methodology significantly optimizes area and wirelength on public GSRC and MCNC benchmarks.
{"title":"Floorplanning with Edge-Aware Graph Attention Network and Hindsight Experience Replay","authors":"Bo Yang, Qi Xu, Hao Geng, Song Chen, Bei Yu, Yi Kang","doi":"10.1145/3653453","DOIUrl":"https://doi.org/10.1145/3653453","url":null,"abstract":"<p>In this paper, we focus on chip floorplanning, which aims to determine the location and orientation of circuit macros simultaneously, so that the chip area and wirelength are minimized. As the highest level of abstraction in hierarchical physical design, floorplanning bridges the gap between the system-level design and the physical synthesis, whose quality directly influences downstream placement and routing. To tackle chip floorplanning, we propose an end-to-end reinforcement learning (RL) methodology with a hindsight experience replay technique. An edge-aware graph attention network (EAGAT) is developed to effectively encode the macro and connection features of the netlist graph. Moreover, we build a hierarchical decoder architecture mainly consisting of transformer and attention pointer mechanism to output floorplan actions. Since the RL agent automatically extracts knowledge about the solution space, the previously learned policy can be quickly transferred to optimize new unseen netlists. Experimental results demonstrate that, compared with state-of-the-art floorplanners, the proposed end-to-end methodology significantly optimizes area and wirelength on public GSRC and MCNC benchmarks.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"23 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140204154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunlin Li, Kun Jiang, Yong Zhang, Lincheng Jiang, Youlong Luo, Shaohua Wan
The convergence of unmanned aerial vehicle (UAV)-aided mobile edge computing (MEC) networks and blockchain transforms the existing mobile networking paradigm. However, in the temporary hotspot scenario for intelligent connected vehicles (ICVs) in UAV-aided MEC networks, deploying blockchain-based services and applications in vehicles is generally impossible due to its high computational resource and storage requirements. One possible solution is to offload part of all the computational tasks to MEC servers wherever possible. Unfortunately, due to the limited availability and high mobility of the vehicles, there is still lacking simple solutions that can support low-latency and higher reliability networking services for ICVs. In this paper, we study the task offloading problem of minimizing the total system latency and the optimal task offloading scheme, subject to constraints on the hover position coordinates of the UAV, the fixed bonuses, flexible transaction fees, transaction rates, mining difficulty, costs and battery energy consumption of the UAV. The problem is confirmed to be a challenging linear integer planning problem, we formulate the problem as a constrained Markov decision process (CMDP). Deep Reinforcement Learning (DRL) has excellently solved sequential decision-making problems in dynamic ICVs environment, therefore, we propose a novel distributed DRL-based P-D3QN approach by using Prioritized Experience Replay (PER) strategy and the dueling double deep Q-network (D3QN) algorithm to solve the optimal task offloading policy effectively. Finally, experiment results show that compared with the benchmark scheme, the P-D3QN algorithm can bring about 26.24% latency improvement and increase about 42.26% offloading utility.
{"title":"Deep Reinforcement Learning-Based Mining Task Offloading Scheme for Intelligent Connected Vehicles in UAV-Aided MEC","authors":"Chunlin Li, Kun Jiang, Yong Zhang, Lincheng Jiang, Youlong Luo, Shaohua Wan","doi":"10.1145/3653451","DOIUrl":"https://doi.org/10.1145/3653451","url":null,"abstract":"<p>The convergence of unmanned aerial vehicle (UAV)-aided mobile edge computing (MEC) networks and blockchain transforms the existing mobile networking paradigm. However, in the temporary hotspot scenario for intelligent connected vehicles (ICVs) in UAV-aided MEC networks, deploying blockchain-based services and applications in vehicles is generally impossible due to its high computational resource and storage requirements. One possible solution is to offload part of all the computational tasks to MEC servers wherever possible. Unfortunately, due to the limited availability and high mobility of the vehicles, there is still lacking simple solutions that can support low-latency and higher reliability networking services for ICVs. In this paper, we study the task offloading problem of minimizing the total system latency and the optimal task offloading scheme, subject to constraints on the hover position coordinates of the UAV, the fixed bonuses, flexible transaction fees, transaction rates, mining difficulty, costs and battery energy consumption of the UAV. The problem is confirmed to be a challenging linear integer planning problem, we formulate the problem as a constrained Markov decision process (CMDP). Deep Reinforcement Learning (DRL) has excellently solved sequential decision-making problems in dynamic ICVs environment, therefore, we propose a novel distributed DRL-based P-D3QN approach by using Prioritized Experience Replay (PER) strategy and the dueling double deep Q-network (D3QN) algorithm to solve the optimal task offloading policy effectively. Finally, experiment results show that compared with the benchmark scheme, the P-D3QN algorithm can bring about 26.24% latency improvement and increase about 42.26% offloading utility.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"122 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140170357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongduo Liu, Yijian Qian, Youqiang Liang, Bin Zhang, Zhaohan Liu, Tao He, Wenqian Zhao, Jiangbo Lu, Bei Yu
In the digital era, the prevalence of low-quality images contrasts with the widespread use of high-definition displays, primarily due to low-resolution cameras and compression technologies. Image super-resolution (SR) techniques, particularly those leveraging deep learning, aim to enhance these images for high-definition presentation. However, real-time execution of deep neural network (DNN)-based SR methods at the edge poses challenges due to their high computational and storage requirements. To address this, field-programmable gate arrays (FPGAs) have emerged as a promising platform, offering flexibility, programmability, and adaptability to evolving models. Previous FPGA-based SR solutions have focused on reducing computational and memory costs through aggressive simplification techniques, often sacrificing the quality of the reconstructed images. This paper introduces a novel SR network specifically designed for edge applications, which maintains reconstruction performance while managing computation costs effectively. Additionally, we propose an architectural design that enables the real-time and end-to-end inference of the proposed SR network on embedded FPGAs. Our key contributions include a tailored SR algorithm optimized for embedded FPGAs, a DSP-enhanced design that achieves a significant four-fold speedup, a novel scalable cache strategy for handling large feature maps, optimization of DSP cascade consumption, and a constraint optimization approach for resource allocation. Experimental results demonstrate that our FPGA-specific accelerator surpasses existing solutions, delivering superior throughput, energy efficiency, and image quality.
在数字时代,低质量图像的普遍存在与高清显示器的广泛使用形成了鲜明对比,这主要是由于低分辨率相机和压缩技术造成的。图像超分辨率(SR)技术,尤其是利用深度学习的技术,旨在增强这些图像的高清晰度。然而,在边缘实时执行基于深度神经网络(DNN)的 SR 方法面临着挑战,因为它们对计算和存储要求很高。为了解决这个问题,现场可编程门阵列(FPGA)成为一个很有前途的平台,它具有灵活性、可编程性和对不断发展的模型的适应性。以前基于 FPGA 的 SR 解决方案侧重于通过积极的简化技术降低计算和内存成本,但往往牺牲了重建图像的质量。本文介绍了一种专为边缘应用设计的新型 SR 网络,它能在保持重建性能的同时有效管理计算成本。此外,我们还提出了一种架构设计,可在嵌入式 FPGA 上实现拟议 SR 网络的实时和端到端推理。我们的主要贡献包括专为嵌入式 FPGA 优化的定制 SR 算法、可显著提高四倍速度的 DSP 增强设计、用于处理大型特征图的新型可扩展缓存策略、DSP 级联消耗优化以及用于资源分配的约束优化方法。实验结果表明,我们的 FPGA 专用加速器超越了现有解决方案,提供了卓越的吞吐量、能效和图像质量。
{"title":"A High-Performance Accelerator for Real-Time Super-Resolution on Edge FPGAs","authors":"Hongduo Liu, Yijian Qian, Youqiang Liang, Bin Zhang, Zhaohan Liu, Tao He, Wenqian Zhao, Jiangbo Lu, Bei Yu","doi":"10.1145/3652855","DOIUrl":"https://doi.org/10.1145/3652855","url":null,"abstract":"<p>In the digital era, the prevalence of low-quality images contrasts with the widespread use of high-definition displays, primarily due to low-resolution cameras and compression technologies. Image super-resolution (SR) techniques, particularly those leveraging deep learning, aim to enhance these images for high-definition presentation. However, real-time execution of deep neural network (DNN)-based SR methods at the edge poses challenges due to their high computational and storage requirements. To address this, field-programmable gate arrays (FPGAs) have emerged as a promising platform, offering flexibility, programmability, and adaptability to evolving models. Previous FPGA-based SR solutions have focused on reducing computational and memory costs through aggressive simplification techniques, often sacrificing the quality of the reconstructed images. This paper introduces a novel SR network specifically designed for edge applications, which maintains reconstruction performance while managing computation costs effectively. Additionally, we propose an architectural design that enables the real-time and end-to-end inference of the proposed SR network on embedded FPGAs. Our key contributions include a tailored SR algorithm optimized for embedded FPGAs, a DSP-enhanced design that achieves a significant four-fold speedup, a novel scalable cache strategy for handling large feature maps, optimization of DSP cascade consumption, and a constraint optimization approach for resource allocation. Experimental results demonstrate that our FPGA-specific accelerator surpasses existing solutions, delivering superior throughput, energy efficiency, and image quality.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"16 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Newcomb-Benford law is the law, also known as Benford's law, of anomalous numbers stating that in many real-life numerical datasets, including physical and statistical ones, numbers have small initial digit. Numbers irregularity observed in nature leads to the question, is the arithmetical-logical unit, responsible for performing calculations in computers optimal? Are there other architectures, not as regular as commonly used Parallel Prefix Adders that can perform better, especially when operating on the datasets that are not purely random, but irregular,? In this article, structures of propagate-generate tree are compared including regular and irregular configurations – various structures are examined: regular, irregular, with gray cells only, with both gray and black and with higher valency cells. Performance is evaluated in terms of energy consumption. The evaluation was performed using the extended power model of static CMOS gates. The model is based on changes of vectors, naturally taking into account spatio-temporal correlations. The energy parameters of the designed cells were calculated on the basis of electrical (Spice) simulation. Designs and simulations were done in Cadence environment, calculations of the power dissipation were performed in Matlab. The results clearly show that there are PPA structures that perform much better for a specific type of numerical data. Negligent design can lead to an increase greater than two times of power consumption. The novel architectures of PPA described in this work might find practical applications in specialized adders dealing with numerical datasets, such as, for example, sine functions commonly used in digital signal processing.
{"title":"Comparative Analysis of Dynamic Power Consumption of Parallel Prefix Adder","authors":"Ireneusz Brzozowski","doi":"10.1145/3651984","DOIUrl":"https://doi.org/10.1145/3651984","url":null,"abstract":"<p>The Newcomb-Benford law is the law, also known as Benford's law, of anomalous numbers stating that in many real-life numerical datasets, including physical and statistical ones, numbers have small initial digit. Numbers irregularity observed in nature leads to the question, is the arithmetical-logical unit, responsible for performing calculations in computers optimal? Are there other architectures, not as regular as commonly used Parallel Prefix Adders that can perform better, especially when operating on the datasets that are not purely random, but irregular,? In this article, structures of propagate-generate tree are compared including regular and irregular configurations – various structures are examined: regular, irregular, with gray cells only, with both gray and black and with higher valency cells. Performance is evaluated in terms of energy consumption. The evaluation was performed using the extended power model of static CMOS gates. The model is based on changes of vectors, naturally taking into account spatio-temporal correlations. The energy parameters of the designed cells were calculated on the basis of electrical (Spice) simulation. Designs and simulations were done in Cadence environment, calculations of the power dissipation were performed in Matlab. The results clearly show that there are PPA structures that perform much better for a specific type of numerical data. Negligent design can lead to an increase greater than two times of power consumption. The novel architectures of PPA described in this work might find practical applications in specialized adders dealing with numerical datasets, such as, for example, sine functions commonly used in digital signal processing.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"120 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140128363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Moshiur Rahman, Jim Geist, Daniel Xing, Yuntao Liu, Ankur Srivastava, Travis Meade, Yier Jin, Swarup Bhunia
Due to the inclination towards a fab-less model of integrated circuit (IC) manufacturing, several untrusted entities get white-box access to the proprietary intellectual property (IP) blocks from diverse vendors. To this end, the untrusted entities pose security-breach threats in the form of piracy, cloning, and reverse engineering, sometimes threatening national security. Hardware obfuscation is a prominent countermeasure against such issues. Obfuscation allows for preventing the usage of the IP blocks without authorization from the IP owners. Due to finite state machine (FSM) transformation-based hardware obfuscation, the design’s FSM gets transformed to make it difficult for an attacker to reverse engineer the design. A secret key needs to be applied to make the FSM functional thus preventing the usage of the IP for unintended purposes. Although several hardware obfuscation techniques have been proposed, due to the inability to analyze the techniques from the attackers’ standpoint, numerous vulnerabilities inherent to the obfuscation methods go undetected unless a true adversary discovers them. In this paper, we present a collaborative approach between two entities - one acting as an attacker or red team and another as a defender or blue team, the first systematic approach to replicate the real attacker-defender scenario in the hardware security domain, which in return strengthens the FSM transformation-based obfuscation technique. The blue team transforms the underlying FSM of a gate-level netlist using state space obfuscation. The red team plays the role of an adversary or evaluator and tries to unlock the design by extracting the unlocking key or recovering the obfuscation circuitries. As the key outcome of this red team - blue team effort, a robust state space obfuscation methodology is evolved showing security promises.
由于集成电路(IC)制造倾向于采用无工厂模式,一些不受信任的实体可以白盒方式访问来自不同供应商的专有知识产权(IP)模块。为此,这些不受信任的实体以盗版、克隆和逆向工程的形式造成安全漏洞威胁,有时甚至威胁到国家安全。硬件混淆是解决此类问题的重要对策。混淆可以防止未经知识产权所有者授权而使用知识产权块。由于采用了基于有限状态机(FSM)转换的硬件混淆技术,设计的 FSM 会被转换,使攻击者难以对设计进行逆向工程。要使 FSM 起作用,需要使用密钥,从而防止 IP 被用于非预期目的。虽然已经提出了几种硬件混淆技术,但由于无法从攻击者的角度分析这些技术,除非真正的对手发现,否则混淆方法中固有的许多漏洞都不会被发现。在本文中,我们提出了一种两个实体之间的合作方法--一个实体作为攻击者或红队,另一个实体作为防御者或蓝队,这是首个在硬件安全领域复制真实攻击者-防御者场景的系统方法,它反过来加强了基于 FSM 变换的混淆技术。蓝队使用状态空间混淆技术转换门级网表的底层 FSM。红队扮演对手或评估者的角色,试图通过提取解锁密钥或恢复混淆电路来解锁设计。作为红队和蓝队合作的重要成果,一种强大的状态空间混淆方法得到了发展,并显示出其安全性前景。
{"title":"Security Evaluation of State Space Obfuscation of Hardware IP through a Red Team – Blue Team Practice","authors":"Md Moshiur Rahman, Jim Geist, Daniel Xing, Yuntao Liu, Ankur Srivastava, Travis Meade, Yier Jin, Swarup Bhunia","doi":"10.1145/3640461","DOIUrl":"https://doi.org/10.1145/3640461","url":null,"abstract":"<p>Due to the inclination towards a fab-less model of integrated circuit (IC) manufacturing, several untrusted entities get white-box access to the proprietary intellectual property (IP) blocks from diverse vendors. To this end, the untrusted entities pose security-breach threats in the form of piracy, cloning, and reverse engineering, sometimes threatening national security. Hardware obfuscation is a prominent countermeasure against such issues. Obfuscation allows for preventing the usage of the IP blocks without authorization from the IP owners. Due to finite state machine (FSM) transformation-based hardware obfuscation, the design’s FSM gets transformed to make it difficult for an attacker to reverse engineer the design. A secret key needs to be applied to make the FSM functional thus preventing the usage of the IP for unintended purposes. Although several hardware obfuscation techniques have been proposed, due to the inability to analyze the techniques from the attackers’ standpoint, numerous vulnerabilities inherent to the obfuscation methods go undetected unless a true adversary discovers them. In this paper, we present a collaborative approach between two entities - one acting as an attacker or <i>red team</i> and another as a defender or <i>blue team</i>, the first systematic approach to replicate the real attacker-defender scenario in the hardware security domain, which in return strengthens the FSM transformation-based obfuscation technique. The <i>blue team</i> transforms the underlying FSM of a gate-level netlist using state space obfuscation. The <i>red team</i> plays the role of an adversary or evaluator and tries to unlock the design by extracting the unlocking key or recovering the obfuscation circuitries. As the key outcome of this red team - blue team effort, a robust state space obfuscation methodology is evolved showing security promises.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"240 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140037170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Testing for delay faults after chip manufacturing is critical to correct chip operation. Tests for delay faults are applied using scan chains that provide access to internal memory elements. As a result, a circuit may operate under non-functional operation conditions during test application. This may lead to overtesting. The extraction of broadside tests from functional test sequences ensures that the tests create functional operation conditions. When N functional test sequences of length L + 1 are available, the number of broadside tests that can be extracted is N · L. Depending on the source of the functional test sequences, the value of N · L may be large. In this case, it is important to select a subset of n ≤ N sequences, and consider only the first l ≤ L clock cycles of every sequence for the extraction of n · l ≪ N · L broadside tests. The two-dimensional N × L search space for broadside tests is the subject of this article. Using a static procedure that considers fixed values of n and l, the article demonstrates that, for the same value of n · l, different circuits benefit from different values of n and l. It also describes a dynamic procedure that matches the parameters n and l to the circuit. The discussion is supported by experimental results for transition faults in benchmark circuits.
芯片制造后的延迟故障测试对芯片的正确运行至关重要。延迟故障测试是通过扫描链进行的,扫描链可访问内部存储器元件。因此,在测试过程中,电路可能会在非功能操作条件下运行。这可能会导致过度测试。从功能测试序列中提取宽边测试可确保测试创建功能操作条件。根据功能测试序列的来源,N - L 的值可能很大。在这种情况下,重要的是选择 n ≤ N 个序列的子集,并只考虑每个序列的前 l ≤ L 个时钟周期,以提取 n - l ≪ N - L 个宽边测试。宽边测试的二维 N × L 搜索空间是本文的主题。文章使用一种考虑固定 n 和 l 值的静态程序,证明了对于相同的 n - l 值,不同的电路从不同的 n 和 l 值中获益。讨论得到了基准电路中过渡故障的实验结果的支持。
{"title":"2D Search Space for Extracting Broadside Tests from Functional Test Sequences","authors":"Irith Pomeranz","doi":"10.1145/3650207","DOIUrl":"https://doi.org/10.1145/3650207","url":null,"abstract":"<p>Testing for delay faults after chip manufacturing is critical to correct chip operation. Tests for delay faults are applied using scan chains that provide access to internal memory elements. As a result, a circuit may operate under non-functional operation conditions during test application. This may lead to overtesting. The extraction of broadside tests from functional test sequences ensures that the tests create functional operation conditions. When <i>N</i> functional test sequences of length <i>L</i> + 1 are available, the number of broadside tests that can be extracted is <i>N</i> · <i>L</i>. Depending on the source of the functional test sequences, the value of <i>N</i> · <i>L</i> may be large. In this case, it is important to select a subset of <i>n</i> ≤ <i>N</i> sequences, and consider only the first <i>l</i> ≤ <i>L</i> clock cycles of every sequence for the extraction of <i>n</i> · <i>l</i> ≪ <i>N</i> · <i>L</i> broadside tests. The two-dimensional <i>N</i> × <i>L</i> search space for broadside tests is the subject of this article. Using a static procedure that considers fixed values of <i>n</i> and <i>l</i>, the article demonstrates that, for the same value of <i>n</i> · <i>l</i>, different circuits benefit from different values of <i>n</i> and <i>l</i>. It also describes a dynamic procedure that matches the parameters <i>n</i> and <i>l</i> to the circuit. The discussion is supported by experimental results for transition faults in benchmark circuits.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"855 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140017482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fault attacks are one of the most powerful forms of cryptanalytic attack on embedded systems, that can corrupt cipher’s operations leading to a breach of confidentiality and integrity. A single precisely injected fault during the execution of a cipher can be exploited to retrieve the secret key in a few milliseconds. Naïve countermeasures introduced into implementation can lead to huge overheads, making them unusable in resource-constraint environments. On the other hand, optimized countermeasures requires significant knowledge, not just about the attack, but also on the (a) the cryptographic properties of the cipher, (b) the program structure, and (c) the underlying hardware architecture. This makes the protection against fault attacks tedious and error-prone.
In this paper, we introduce the first automated compiler framework named FortiFix that can detect and patch fault exploitable regions in a block cipher implementation. The framework has two phases. The pre-compilation phase identifies regions in the source code of a block cipher that are vulnerable to fault attacks. The second phase is incorporated as transformation passes in the LLVM compiler to find exploitable instructions, quantify the impact of a fault on these instructions, and finally insert appropriate countermeasures based on user defined security requirements. As a proof of concept, we have evaluated two block cipher implementations AES-128 and CLEFIA-128 on three different hardware platforms such as MSP430 (16-bit), ARM (32-bit) and RISCV (32-bit).
{"title":"FortiFix: A Fault Attack Aware Compiler Framework for Crypto Implementations","authors":"Keerthi K, Chester Rebeiro","doi":"10.1145/3650029","DOIUrl":"https://doi.org/10.1145/3650029","url":null,"abstract":"<p>Fault attacks are one of the most powerful forms of cryptanalytic attack on embedded systems, that can corrupt cipher’s operations leading to a breach of confidentiality and integrity. A single precisely injected fault during the execution of a cipher can be exploited to retrieve the secret key in a few milliseconds. Naïve countermeasures introduced into implementation can lead to huge overheads, making them unusable in resource-constraint environments. On the other hand, optimized countermeasures requires significant knowledge, not just about the attack, but also on the (a) the cryptographic properties of the cipher, (b) the program structure, and (c) the underlying hardware architecture. This makes the protection against fault attacks tedious and error-prone. </p><p>In this paper, we introduce the first automated compiler framework named <span>FortiFix</span> that can detect and patch fault exploitable regions in a block cipher implementation. The framework has two phases. The <i>pre-compilation phase</i> identifies regions in the source code of a block cipher that are vulnerable to fault attacks. The second phase is incorporated as transformation passes in the LLVM compiler to find exploitable instructions, quantify the impact of a fault on these instructions, and finally insert appropriate countermeasures based on user defined security requirements. As a proof of concept, we have evaluated two block cipher implementations AES-128 and CLEFIA-128 on three different hardware platforms such as MSP430 (16-bit), ARM (32-bit) and RISCV (32-bit).</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"80 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140017391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Root-cause analysis for integrated systems has become increasingly challenging due to their growing complexity. To tackle these challenges, machine learning (ML) has been applied to enhance root-cause analysis. Nonetheless, ML-based root-cause analysis usually requires abundant training data with root causes labeled by human experts, which are difficult or even impossible to obtain. To overcome this drawback, a semi-supervised co-training method is proposed for root-cause-analysis in this paper, which only requires a small portion of labeled data. First, a random forest is trained with labeled data. Next, we propose a co-training technique to learn from unlabeled data with semi-supervised learning, which pre-labels a subset of these data automatically and then retrains each decision tree in the random forest. In addition, a robust framework is proposed to avoid over-fitting. We further apply initialization by clustering and feature selection to improve the diagnostic performance. With two case studies from industry, the proposed approach shows superior performance against other state-of-the-art methods by saving up to 67% of labeling efforts.
由于集成系统日益复杂,对其进行根本原因分析变得越来越具有挑战性。为了应对这些挑战,人们应用机器学习(ML)来加强根源分析。然而,基于 ML 的根本原因分析通常需要大量由人类专家标注根本原因的训练数据,而这些数据很难甚至不可能获得。为了克服这一缺点,本文提出了一种用于根源分析的半监督协同训练方法,它只需要一小部分标注数据。首先,使用标注数据训练随机森林。接下来,我们提出了一种通过半监督学习从无标签数据中学习的联合训练技术,该技术会自动预标记这些数据的一个子集,然后重新训练随机森林中的每一棵决策树。此外,我们还提出了一个稳健的框架,以避免过度拟合。我们进一步通过聚类和特征选择进行初始化,以提高诊断性能。通过对两个行业案例的研究,我们提出的方法与其他最先进的方法相比表现出更优越的性能,最多可节省 67% 的标注工作。
{"title":"Root-Cause Analysis with Semi-Supervised Co-Training for Integrated Systems","authors":"Renjian Pan, Xin Li, Krishnendu Chakrabarty","doi":"10.1145/3649313","DOIUrl":"https://doi.org/10.1145/3649313","url":null,"abstract":"<p>Root-cause analysis for integrated systems has become increasingly challenging due to their growing complexity. To tackle these challenges, machine learning (ML) has been applied to enhance root-cause analysis. Nonetheless, ML-based root-cause analysis usually requires abundant training data with root causes labeled by human experts, which are difficult or even impossible to obtain. To overcome this drawback, a semi-supervised co-training method is proposed for root-cause-analysis in this paper, which only requires a small portion of labeled data. First, a random forest is trained with labeled data. Next, we propose a co-training technique to learn from unlabeled data with semi-supervised learning, which pre-labels a subset of these data automatically and then retrains each decision tree in the random forest. In addition, a robust framework is proposed to avoid over-fitting. We further apply initialization by clustering and feature selection to improve the diagnostic performance. With two case studies from industry, the proposed approach shows superior performance against other state-of-the-art methods by saving up to 67% of labeling efforts.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"1 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140017388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prior hardware accelerator designs primarily focused on single-chip solutions for 10MB-class computer vision models. The GB-class transformer models for natural language processing (NLP) impose challenges on existing accelerator design due to the massive number of parameters and the diverse matrix multiplication (MatMul) workloads involved. This work proposes a heterogeneous 3D-based accelerator design for transformer models, which adopts an interposer substrate with multiple 3D memory/logic hybrid cubes optimized for accelerating different MatMul workloads. An approximate computing scheme is proposed to take advantage of heterogeneous computing paradigms of mixed-signal compute-in-memory (CIM) and digital tensor processing units (TPU). From the system-level evaluation results, 10 TOPS/W energy efficiency is achieved for the Bert and GPT2 model, which is about 2.6 × ∼ 3.1 × higher than the baseline with 7nm TPU and stacked FeFET memory.
{"title":"H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices","authors":"Yandong Luo, Shimeng Yu","doi":"10.1145/3649219","DOIUrl":"https://doi.org/10.1145/3649219","url":null,"abstract":"<p>Prior hardware accelerator designs primarily focused on single-chip solutions for 10MB-class computer vision models. The GB-class transformer models for natural language processing (NLP) impose challenges on existing accelerator design due to the massive number of parameters and the diverse matrix multiplication (MatMul) workloads involved. This work proposes a heterogeneous 3D-based accelerator design for transformer models, which adopts an interposer substrate with multiple 3D memory/logic hybrid cubes optimized for accelerating different MatMul workloads. An approximate computing scheme is proposed to take advantage of heterogeneous computing paradigms of mixed-signal compute-in-memory (CIM) and digital tensor processing units (TPU). From the system-level evaluation results, 10 TOPS/W energy efficiency is achieved for the Bert and GPT2 model, which is about 2.6 × ∼ 3.1 × higher than the baseline with 7nm TPU and stacked FeFET memory.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"145 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139987992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}