ACM Transactions on Embedded Computing Systems最新文献_第4页

A Comprehensive Model for Efficient Design Space Exploration of Imprecise Computational Blocks 非精确计算块高效设计空间探索的综合模型

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3625555

Mohammad Haji Seyed Javadi, Mohsen Faryabi, Hamid Reza Mahdiani

After almost a decade of research, development of more efficient imprecise computational blocks is still a major concern in imprecise computing domain. There are many instances of the introduced imprecise components of different types, while their main difference is that they propose different precision-cost-performance trade-offs. In this paper, a novel comprehensive model for the imprecise components is introduced, which can be exploited to cover a wide range of precision-cost-performance trade-offs, for different types of imprecise components. The model helps to find the suitable imprecise component based on any desired error criterion. Therefore, the most significant advantage of the proposed model is that it can be simply exploited for design space exploration of different imprecise components to extract the suitable components, with the desired precision-cost-performance trade-off for any specific application. To demonstrate the efficiency of the proposed model, two novel families of Lowest-cost Imprecise Adders (LIAs) and Lowest-cost Imprecise Multipliers (LIMs) are introduced in the paper, which are systematically extracted based on exploration of the design space provided by the proposed model. A wide range of simulation and synthesis results are also presented in the paper to prove the comparable efficiency of the systematically extracted LIA/LIM structures with respect to the most efficient existing human-made imprecise components both individually and in a Multiply-Accumulate application.

经过近十年的研究，开发更高效的非精确计算块仍然是非精确计算领域的一个主要问题。引入的不同类型的不精确组件有许多实例，而它们的主要区别在于它们提出了不同的精度-成本-性能权衡。本文介绍了一种新的不精确部件综合模型，该模型可用于涵盖不同类型的不精确部件的广泛精度-成本-性能权衡。该模型可以根据任何期望的误差准则找到合适的不精确分量。因此，所提出的模型最显著的优点是，它可以简单地用于不同不精确组件的设计空间探索，以提取合适的组件，并为任何特定应用提供所需的精度-成本-性能权衡。为了证明所提模型的有效性，本文引入了两个新的最低成本不精确加法器(LIAs)和最低成本不精确乘法器(LIMs)族，并在探索所提模型提供的设计空间的基础上系统地提取了它们。本文还提供了广泛的仿真和综合结果，以证明系统提取的LIA/LIM结构相对于最有效的现有人造不精确部件的效率相当，无论是单独的还是在乘法累积应用中。

{"title":"A Comprehensive Model for Efficient Design Space Exploration of Imprecise Computational Blocks","authors":"Mohammad Haji Seyed Javadi, Mohsen Faryabi, Hamid Reza Mahdiani","doi":"10.1145/3625555","DOIUrl":"https://doi.org/10.1145/3625555","url":null,"abstract":"After almost a decade of research, development of more efficient imprecise computational blocks is still a major concern in imprecise computing domain. There are many instances of the introduced imprecise components of different types, while their main difference is that they propose different precision-cost-performance trade-offs. In this paper, a novel comprehensive model for the imprecise components is introduced, which can be exploited to cover a wide range of precision-cost-performance trade-offs, for different types of imprecise components. The model helps to find the suitable imprecise component based on any desired error criterion. Therefore, the most significant advantage of the proposed model is that it can be simply exploited for design space exploration of different imprecise components to extract the suitable components, with the desired precision-cost-performance trade-off for any specific application. To demonstrate the efficiency of the proposed model, two novel families of Lowest-cost Imprecise Adders (LIAs) and Lowest-cost Imprecise Multipliers (LIMs) are introduced in the paper, which are systematically extracted based on exploration of the design space provided by the proposed model. A wide range of simulation and synthesis results are also presented in the paper to prove the comparable efficiency of the systematically extracted LIA/LIM structures with respect to the most efficient existing human-made imprecise components both individually and in a Multiply-Accumulate application.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 78","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Energy-efficient Communications for Improving Timely Progress of Intermittent Powered BLE Devices 提高间歇供电BLE设备及时进度的节能通信

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3626197

Chen-Tui Hung, Kai Xuan Lee, Yi-Zheng Liu, Ya-Shu Chen, Zhong-Han Chan

Battery-less devices offer potential solutions for maintaining sustainable Internet of Things (IoT) networks. However, limited energy harvesting capacity can lead to power failures, limiting the system’s quality of service (QoS) . To improve timely task progress, we present ETIME, a scheduling framework that enables energy-efficient communication for intermittent-powered IoT devices. To maximize energy efficiency while meeting the timely requirements of intermittent systems, we first model the relationship between insufficient harvesting energy and task behavior time. We then propose a method for predicting response times for battery-less devices. Considering both delays from multiple task interference and insufficient system energy, we introduce a dynamic wake-up strategy to improve timely task progress. Additionally, to minimize power consumption from connection components, we propose a dynamic connection interval adjustment to provide energy-efficient communication. The proposed algorithms are implemented in a lightweight operating system on real devices. Experimental results show that our approach can significantly improve progress for timely applications while maintaining task progress.

无电池设备为维持可持续的物联网(IoT)网络提供了潜在的解决方案。然而，有限的能量收集能力可能导致电源故障，从而限制了系统的服务质量(QoS)。为了提高任务进度的及时性，我们提出了ETIME，这是一个调度框架，可以为间歇性供电的物联网设备实现节能通信。为了最大限度地提高能源效率，同时满足间歇系统的及时性要求，我们首先建立了收集能量不足与任务行为时间之间的关系模型。然后，我们提出了一种预测无电池设备响应时间的方法。同时考虑多任务干扰造成的延迟和系统能量不足，引入动态唤醒策略，提高任务进度的及时性。此外，为了最大限度地减少连接组件的功耗，我们建议动态调整连接间隔以提供节能通信。所提出的算法在实际设备上的轻量级操作系统中实现。实验结果表明，我们的方法可以在保持任务进度的同时显著提高实时应用程序的进度。

{"title":"Energy-efficient Communications for Improving Timely Progress of Intermittent Powered BLE Devices","authors":"Chen-Tui Hung, Kai Xuan Lee, Yi-Zheng Liu, Ya-Shu Chen, Zhong-Han Chan","doi":"10.1145/3626197","DOIUrl":"https://doi.org/10.1145/3626197","url":null,"abstract":"Battery-less devices offer potential solutions for maintaining sustainable Internet of Things (IoT) networks. However, limited energy harvesting capacity can lead to power failures, limiting the system’s quality of service (QoS) . To improve timely task progress, we present ETIME, a scheduling framework that enables energy-efficient communication for intermittent-powered IoT devices. To maximize energy efficiency while meeting the timely requirements of intermittent systems, we first model the relationship between insufficient harvesting energy and task behavior time. We then propose a method for predicting response times for battery-less devices. Considering both delays from multiple task interference and insufficient system energy, we introduce a dynamic wake-up strategy to improve timely task progress. Additionally, to minimize power consumption from connection components, we propose a dynamic connection interval adjustment to provide energy-efficient communication. The proposed algorithms are implemented in a lightweight operating system on real devices. Experimental results show that our approach can significantly improve progress for timely applications while maintaining task progress.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 79","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Distributed Schedule Randomization to Mitigate Timing Attacks in Industrial Control Systems 在线分布式调度随机化缓解工业控制系统中的定时攻击

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3624584

Ankita Samaddar, Arvind Easwaran

Industrial control systems (ICSs) consist of a large number of control applications that are associated with periodic real-time flows with hard deadlines. To facilitate large-scale integration, remote control, and co-ordination, wireless sensor and actuator networks form the main communication framework in most ICSs. Among the existing wireless sensor and actuator network protocols, WirelessHART is the most suitable protocol for real-time applications in ICSs. The communications in a WirelessHART network are time-division multiple access based. To satisfy the hard deadlines of the real-time flows, the schedule in a WirelessHART network is pre-computed. The same schedule is repeated over every hyperperiod (i.e., lowest common multiple of the periods of the flows). However, a malicious attacker can exploit the repetitive behavior of the flow schedules to launch timing attacks (e.g., selective jamming attacks). To mitigate timing attacks, we propose an online distributed schedule randomization strategy that randomizes the time-slots in the schedules at each network device without violating the flow deadlines, while ensuring the closed-loop control stability. To increase the extent of randomization in the schedules further, and to reduce the energy consumption of the system, we incorporate a period adaptation strategy that adjusts the transmission periods of the flows depending on the stability of the control loops at runtime. We use Kullback-Leibler divergence and prediction probability of slots as two metrics to evaluate the performance of our proposed strategy. We compare our strategy with an offline centralized schedule randomization strategy. Experimental results show that the schedules generated by our strategy are 10% to 15% more diverse and 5% to 10% less predictable on average compared to the offline strategy when the number of base schedules and keys vary between 4 and 6 and 12 and 32, respectively, under all slot utilization (number of occupied slots in a hyperperiod). On incorporating period adaptation, the divergence in the schedules reduceat each period increase with 46% less power consumption on average.

工业控制系统(ics)由大量控制应用程序组成，这些应用程序与具有硬截止日期的周期性实时流相关联。为了促进大规模集成，远程控制和协调，无线传感器和执行器网络构成了大多数ics的主要通信框架。在现有的无线传感器和执行器网络协议中，WirelessHART是最适合在ics中实时应用的协议。无线shart网络中的通信是基于时分多址的。为了满足实时流的硬时限，无线shart网络中的调度是预先计算的。相同的调度在每个超周期(即，流周期的最低公共倍数)上重复。然而，恶意攻击者可以利用流量调度的重复行为来发起定时攻击(例如，选择性干扰攻击)。为了减轻定时攻击，我们提出了一种在线分布式调度随机化策略，该策略在保证闭环控制稳定性的同时，在不违反流截止日期的情况下，对每个网络设备调度中的时隙进行随机化。为了进一步增加调度中的随机化程度，并减少系统的能量消耗，我们结合了一个周期适应策略，该策略根据运行时控制回路的稳定性来调整流的传输周期。我们使用Kullback-Leibler散度和槽的预测概率作为两个指标来评估我们提出的策略的性能。我们将该策略与离线集中式调度随机化策略进行了比较。实验结果表明，与离线策略相比，当基本调度和密钥的数量分别在4到6和12到32之间变化时，我们的策略生成的调度的多样性增加了10%到15%，平均可预测性降低了5%到10%，在所有槽利用率(超周期中占用的槽数)下。在纳入周期适应后，时间表的差异减少了每个周期的增长，平均减少了46%的电力消耗。

{"title":"Online Distributed Schedule Randomization to Mitigate Timing Attacks in Industrial Control Systems","authors":"Ankita Samaddar, Arvind Easwaran","doi":"10.1145/3624584","DOIUrl":"https://doi.org/10.1145/3624584","url":null,"abstract":"Industrial control systems (ICSs) consist of a large number of control applications that are associated with periodic real-time flows with hard deadlines. To facilitate large-scale integration, remote control, and co-ordination, wireless sensor and actuator networks form the main communication framework in most ICSs. Among the existing wireless sensor and actuator network protocols, WirelessHART is the most suitable protocol for real-time applications in ICSs. The communications in a WirelessHART network are time-division multiple access based. To satisfy the hard deadlines of the real-time flows, the schedule in a WirelessHART network is pre-computed. The same schedule is repeated over every hyperperiod (i.e., lowest common multiple of the periods of the flows). However, a malicious attacker can exploit the repetitive behavior of the flow schedules to launch timing attacks (e.g., selective jamming attacks). To mitigate timing attacks, we propose an online distributed schedule randomization strategy that randomizes the time-slots in the schedules at each network device without violating the flow deadlines, while ensuring the closed-loop control stability. To increase the extent of randomization in the schedules further, and to reduce the energy consumption of the system, we incorporate a period adaptation strategy that adjusts the transmission periods of the flows depending on the stability of the control loops at runtime. We use Kullback-Leibler divergence and prediction probability of slots as two metrics to evaluate the performance of our proposed strategy. We compare our strategy with an offline centralized schedule randomization strategy. Experimental results show that the schedules generated by our strategy are 10% to 15% more diverse and 5% to 10% less predictable on average compared to the offline strategy when the number of base schedules and keys vary between 4 and 6 and 12 and 32, respectively, under all slot utilization (number of occupied slots in a hyperperiod). On incorporating period adaptation, the divergence in the schedules reduceat each period increase with 46% less power consumption on average.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and Analysis of High Performance Heterogeneous Block-based Approximate Adders 高性能异构块近似加法器的设计与分析

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3625686

Ebrahim Farahmand, Ali Mahani, Muhammad Abdullah Hanif, Muhammad Shafique

Approximate computing is an emerging paradigm to improve the power and performance efficiency of error-resilient applications. As adders are one of the key components in almost all processing systems, a significant amount of research has been carried out toward designing approximate adders that can offer better efficiency than conventional designs; however, at the cost of some accuracy loss. In this article, we highlight a new class of energy-efficient approximate adders, namely, Heterogeneous Block-based Approximate Adders (HBAAs), and propose a generic configurable adder model that can be configured to represent a particular HBAA configuration. An HBAA, in general, is composed of heterogeneous sub-adder blocks of equal length, where each sub-adder can be an approximate sub-adder and have a different configuration. The sub-adders are mainly approximated through inexact logic and carry truncation. Compared to the existing design space, HBAAs provide additional design points that fall on the Pareto-front and offer a better quality-efficiency tradeoff in certain scenarios. Furthermore, to enable efficient design space exploration based on user-defined constraints, we propose an analytical model to efficiently evaluate the Probability Mass Function (PMF) of approximation error and other error metrics, such as Mean Error Distance (MED), Normalized Mean Error Distance (NMED), and Error Rate (ER) of HBAAs. The results show that HBAA configurations can provide around 15% reduction in area and up to 17% reduction in energy compared to state-of-the-art approximate adders.

近似计算是一种新兴的范式，用于提高容错应用程序的功率和性能效率。由于加法器是几乎所有处理系统中的关键部件之一，因此已经进行了大量的研究，以设计可以提供比传统设计更好效率的近似加法器;然而，以一定的精度损失为代价。在本文中，我们重点介绍了一类新的节能近似加法器，即基于异构块的近似加法器(HBAAs)，并提出了一种通用的可配置加法器模型，该模型可以配置为表示特定的HBAA配置。一般来说，HBAA由长度相等的异构子加法器块组成，其中每个子加法器可以是近似子加法器，并且具有不同的配置。子加法器主要通过非精确逻辑和进位截断来逼近。与现有的设计空间相比，HBAAs提供了额外的设计点，落在Pareto-front上，并在某些场景中提供了更好的质量-效率权衡。此外，为了实现基于用户自定义约束的有效设计空间探索，我们提出了一个分析模型，以有效地评估HBAAs的近似误差和其他误差度量(如平均误差距离(MED)、归一化平均误差距离(NMED)和错误率(ER))的概率质量函数(PMF)。结果表明，与最先进的近似加法器相比，HBAA配置可以减少约15%的面积，最多可减少17%的能量。

{"title":"Design and Analysis of High Performance Heterogeneous Block-based Approximate Adders","authors":"Ebrahim Farahmand, Ali Mahani, Muhammad Abdullah Hanif, Muhammad Shafique","doi":"10.1145/3625686","DOIUrl":"https://doi.org/10.1145/3625686","url":null,"abstract":"Approximate computing is an emerging paradigm to improve the power and performance efficiency of error-resilient applications. As adders are one of the key components in almost all processing systems, a significant amount of research has been carried out toward designing approximate adders that can offer better efficiency than conventional designs; however, at the cost of some accuracy loss. In this article, we highlight a new class of energy-efficient approximate adders, namely, Heterogeneous Block-based Approximate Adders (HBAAs), and propose a generic configurable adder model that can be configured to represent a particular HBAA configuration. An HBAA, in general, is composed of heterogeneous sub-adder blocks of equal length, where each sub-adder can be an approximate sub-adder and have a different configuration. The sub-adders are mainly approximated through inexact logic and carry truncation. Compared to the existing design space, HBAAs provide additional design points that fall on the Pareto-front and offer a better quality-efficiency tradeoff in certain scenarios. Furthermore, to enable efficient design space exploration based on user-defined constraints, we propose an analytical model to efficiently evaluate the Probability Mass Function (PMF) of approximation error and other error metrics, such as Mean Error Distance (MED), Normalized Mean Error Distance (NMED), and Error Rate (ER) of HBAAs. The results show that HBAA configurations can provide around 15% reduction in area and up to 17% reduction in energy compared to state-of-the-art approximate adders.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 89","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Enabling Binary Neural Network Training on the Edge 启用边缘二进制神经网络训练

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3626100

Erwei Wang, James J. Davis, Daniele Moro, Piotr Zielinski, Jia Jie Lim, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, George A. Constantinides

The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio’s standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3–5×, while reaching similar test accuracy (± 2 pp) in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78× memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy.

日益复杂的机器学习模型不断增长的计算需求经常需要使用强大的基于云的基础设施进行训练。二进制神经网络被认为是设备上推理的有希望的候选者，因为它们比更高精度的替代方案具有极高的计算和内存节省。然而，他们现有的训练方法需要并发存储所有层的高精度激活，通常使得在内存受限的设备上学习是不可行的。在本文中，我们证明了二元神经网络训练所需的反向传播操作对量化具有很强的鲁棒性，从而使现代模型的边缘学习成为一个实用的命题。我们引入了一种低成本的二元神经网络训练策略，显示出相当大的内存占用减少，同时与courbarariaux &本吉奥的标准方法。这些减少主要是通过只保留二进制格式的激活来实现的。对于后一种算法，我们的插入式替换看到内存需求减少3 - 5倍，同时在可比时间内达到类似的测试精度(±2 pp)，在一系列小规模模型中训练以分类流行数据集。我们还从头开始演示了二值化ResNet-18的ImageNet训练，实现了3.78倍的内存减少。我们的工作是开源的，包括针对Raspberry pi的原型，我们用来验证我们建模的内存减少和捕获相关的能量下降。这种节省将允许避免不必要的云卸载，减少延迟，提高能源效率，并保护最终用户的隐私。

{"title":"Enabling Binary Neural Network Training on the Edge","authors":"Erwei Wang, James J. Davis, Daniele Moro, Piotr Zielinski, Jia Jie Lim, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, George A. Constantinides","doi":"10.1145/3626100","DOIUrl":"https://doi.org/10.1145/3626100","url":null,"abstract":"The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio’s standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3–5×, while reaching similar test accuracy (± 2 pp) in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78× memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 48","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135190997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Special Issue: “AI Acceleration on FPGAs” 专题:《fpga上的AI加速》

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3626323

Yun (Eric) Liang, Wei Zhang, Stephen Neuendorffer, Wayne Luk

introduction Share on Special Issue: “AI Acceleration on FPGAs” Authors: Yun (Eric) Liang Peking University, Peking, People's Republic of China Peking University, Peking, People's Republic of China 0000-0002-9076-7998Search about this author , Wei Zhang The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China 0000-0002-7622-6714Search about this author , Stephen Neuendorffer Xilinx, San Jose, CA Xilinx, San Jose, CA 0000-0003-2956-8428Search about this author , Wayne Luk Imperial College London, London, UK Imperial College London, London, UK 0000-0002-6750-927XSearch about this author Authors Info & Claims ACM Transactions on Embedded Computing SystemsVolume 22Issue 6Article No.: 89pp 1–3https://doi.org/10.1145/3626323Published:09 November 2023Publication History 0citation0DownloadsMetricsTotal Citations0Total Downloads0Last 12 Months0Last 6 weeks0 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access

专题分享:“fpga上的AI加速”梁云(Eric)中华人民共和国北京大学北京大学北京大学中华人民共和国北京0000-0002-9076-7998查询作者，张炜香港科技大学，中华人民共和国香港香港科技大学0000-0002-7622-6714查询作者，Stephen Neuendorffer Xilinx, San Jose, CA Xilinx, San Jose，CA 0000-0003-2956-8428检索本作者，Wayne Luk帝国理工学院伦敦，伦敦，英国伦敦帝国理工学院伦敦，英国伦敦0000-0002- 650-927x检索本作者作者信息与主张ACM嵌入式计算系统汇刊第22卷第6期文章编号: 89pp 1-3https://doi.org/10.1145/3626323Published:09 2023年11月出版历史0citation0downloadsmetrictotalcitations0总下载最近12个月过去6周获得引文警报新的引文警报添加!此警报已成功添加，并将发送到:每当您选择的记录被引用时，您将收到通知。要管理您的警报首选项，请单击下面的按钮。管理我的提醒新引文提醒!请登录到您的帐户保存到binder保存到binder创建一个新的BinderNameCancelCreateExport CitationPublisher SiteGet Access

{"title":"Special Issue: “AI Acceleration on FPGAs”","authors":"Yun (Eric) Liang, Wei Zhang, Stephen Neuendorffer, Wayne Luk","doi":"10.1145/3626323","DOIUrl":"https://doi.org/10.1145/3626323","url":null,"abstract":"introduction Share on Special Issue: “AI Acceleration on FPGAs” Authors: Yun (Eric) Liang Peking University, Peking, People's Republic of China Peking University, Peking, People's Republic of China 0000-0002-9076-7998Search about this author , Wei Zhang The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China 0000-0002-7622-6714Search about this author , Stephen Neuendorffer Xilinx, San Jose, CA Xilinx, San Jose, CA 0000-0003-2956-8428Search about this author , Wayne Luk Imperial College London, London, UK Imperial College London, London, UK 0000-0002-6750-927XSearch about this author Authors Info & Claims ACM Transactions on Embedded Computing SystemsVolume 22Issue 6Article No.: 89pp 1–3https://doi.org/10.1145/3626323Published:09 November 2023Publication History 0citation0DownloadsMetricsTotal Citations0Total Downloads0Last 12 Months0Last 6 weeks0 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135242664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic Thermal Management of 3D Memory through Rotating Low Power States and Partial Channel Closure 通过旋转低功耗状态和部分通道关闭的3D存储器的动态热管理

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3624581

Lokesh Siddhu, Aritra Bagchi, Rajesh Kedia, Isaar Ahmad, Shailja Pandey, Preeti Ranjan Panda

Modern high-performance and high-bandwidth three-dimensional (3D) memories are characterized by frequent heating. Prior art suggests turning off hot channels and migrating data to the background DDR memory, incurring significant performance and energy overheads. We propose three Dynamic Thermal Management (DTM) approaches for 3D memories, reducing these overheads. The first approach, Rotating-channel Low-power-state-based DTM (RL-DTM) , minimizes the energy overheads by avoiding data migration. RL-DTM places 3D memory channels into low power states instead of turning them off. Since data accesses are disallowed during low power state, RL-DTM balances each channel’s low-power-state duration. The second approach, Masked rotating-channel Low-power-state-based DTM (ML-DTM) , is a fine-grained policy that minimizes the energy-delay product (EDP) and improves the performance of RL-DTM by considering the channel access rate. The third strategy, Partial channel closure and ML-DTM , minimizes performance overheads of existing channel-level turn-off-based policies by closing a channel only partially and integrating ML-DTM, reducing the number of channels being turned off. We evaluate the proposed DTM policies using various mixes of SPEC benchmarks and multi-threaded workloads and observe them to significantly improve performance, energy, and EDP over state-of-the-art approaches for different 3D memory architectures.

现代高性能和高带宽三维(3D)存储器的特点是频繁加热。现有技术建议关闭热通道并将数据迁移到后台DDR内存，这会导致显著的性能和能源开销。我们提出了三种3D存储器的动态热管理(DTM)方法，以减少这些开销。第一种方法是基于旋转通道低功耗状态的DTM (RL-DTM)，通过避免数据迁移将能量开销降至最低。RL-DTM将3D存储通道置于低功耗状态，而不是关闭它们。由于在低功耗状态期间不允许数据访问，因此RL-DTM平衡了每个通道的低功耗状态持续时间。第二种方法是掩膜旋转信道低功耗状态DTM (ML-DTM)，它是一种细粒度策略，通过考虑信道访问速率来最小化能量延迟积(EDP)并提高RL-DTM的性能。第三种策略，部分通道关闭和ML-DTM，通过仅部分关闭通道并集成ML-DTM，减少被关闭的通道数量，将现有的基于通道级关闭策略的性能开销降至最低。我们使用各种SPEC基准测试和多线程工作负载的混合来评估建议的DTM策略，并观察它们与不同3D内存架构的最先进方法相比，显着提高了性能、能源和EDP。

{"title":"Dynamic Thermal Management of 3D Memory through Rotating Low Power States and Partial Channel Closure","authors":"Lokesh Siddhu, Aritra Bagchi, Rajesh Kedia, Isaar Ahmad, Shailja Pandey, Preeti Ranjan Panda","doi":"10.1145/3624581","DOIUrl":"https://doi.org/10.1145/3624581","url":null,"abstract":"Modern high-performance and high-bandwidth three-dimensional (3D) memories are characterized by frequent heating. Prior art suggests turning off hot channels and migrating data to the background DDR memory, incurring significant performance and energy overheads. We propose three Dynamic Thermal Management (DTM) approaches for 3D memories, reducing these overheads. The first approach, Rotating-channel Low-power-state-based DTM (RL-DTM) , minimizes the energy overheads by avoiding data migration. RL-DTM places 3D memory channels into low power states instead of turning them off. Since data accesses are disallowed during low power state, RL-DTM balances each channel’s low-power-state duration. The second approach, Masked rotating-channel Low-power-state-based DTM (ML-DTM) , is a fine-grained policy that minimizes the energy-delay product (EDP) and improves the performance of RL-DTM by considering the channel access rate. The third strategy, Partial channel closure and ML-DTM , minimizes performance overheads of existing channel-level turn-off-based policies by closing a channel only partially and integrating ML-DTM, reducing the number of channels being turned off. We evaluate the proposed DTM policies using various mixes of SPEC benchmarks and multi-threaded workloads and observe them to significantly improve performance, energy, and EDP over state-of-the-art approaches for different 3D memory architectures.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 43","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SG-Float: Achieving Memory Access and Computing Power Reduction Using Self-Gating Float in CNNs SG-Float:在cnn中使用自门控Float实现内存访问和计算能力降低

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-11-09 DOI: 10.1145/3624582

Jun-Shen Wu, Tsen-Wei Hsu, Ren-Shuo Liu

Convolutional neural networks (CNNs) are essential for advancing the field of artificial intelligence. However, since these networks are highly demanding in terms of memory and computation, implementing CNNs can be challenging. To make CNNs more accessible to energy-constrained devices, researchers are exploring new algorithmic techniques and hardware designs that can reduce memory and computation requirements. In this work, we present self-gating float (SG-Float), algorithm hardware co-design of a novel binary number format, which can significantly reduce memory access and computing power requirements in CNNs. SG-Float is a self-gating format that uses the exponent to self-gate the mantissa to zero, exploiting the characteristic of floating-point that the exponent determines the magnitude of a floating-point value and the error tolerance property of CNNs. SG-Float represents relatively small values using only the exponent, which increases the proportion of ineffective mantissas, corresponding to reducing mantissa multiplications of floating-point numbers. To minimize the accuracy loss caused by the approximation error introduced by SG-Float, we propose a fine-tuning process to determine the exponent thresholds of SG-Float and reclaim the accuracy loss. We also develop a hardware optimization technique, called the SG-Float buffering strategy, to best match SG-Float with CNN accelerators and further reduce memory access. We apply the SG-Float buffering strategy to vector-vector multiplication processing elements (PEs), which NVDLA adopts, in TSMC 40nm technology. Our evaluation results demonstrate that SG-Float can achieve up to 35% reduction in memory access power and up to 54% reduction in computing power compared with AdaptivFloat, a state-of-the-art format, with negligible power and area overhead. Additionally, we show that SG-Float can be combined with neural network pruning methods to further reduce memory access and mantissa multiplications in pruned CNN models. Overall, our work shows that SG-Float is a promising solution to the problem of CNN memory access and computing power.

卷积神经网络(cnn)对于推进人工智能领域至关重要。然而，由于这些网络在内存和计算方面要求很高，因此实现cnn可能具有挑战性。为了让能量受限的设备更容易使用cnn，研究人员正在探索新的算法技术和硬件设计，以减少内存和计算需求。在这项工作中，我们提出了自门控浮点数(SG-Float)，一种新型二进制数格式的算法硬件协同设计，可以显着降低cnn的内存访问和计算能力需求。SG-Float是一种利用指数将尾数自门为零的自门格式，利用了浮点数的特点，即指数决定浮点值的大小和cnn的容错特性。SG-Float只使用指数表示相对较小的值，这增加了无效尾数的比例，对应于减少浮点数的尾数乘法。为了最大限度地减少SG-Float近似误差带来的精度损失，我们提出了一种微调过程来确定SG-Float的指数阈值并回收精度损失。我们还开发了一种硬件优化技术，称为SG-Float缓冲策略，以最佳地匹配SG-Float与CNN加速器，并进一步减少内存访问。我们将SG-Float缓冲策略应用于NVDLA在台积电40nm工艺中采用的矢量-矢量乘法处理元件(pe)。我们的评估结果表明，与AdaptivFloat(一种最先进的格式)相比，SG-Float可以实现高达35%的内存访问功耗降低和高达54%的计算能力降低，而功耗和面积开销可以忽略不计。此外，我们表明SG-Float可以与神经网络修剪方法相结合，以进一步减少修剪后的CNN模型中的内存访问和尾数乘法。总的来说，我们的工作表明SG-Float是解决CNN内存访问和计算能力问题的一个很有前途的解决方案。

{"title":"SG-Float: Achieving Memory Access and Computing Power Reduction Using Self-Gating Float in CNNs","authors":"Jun-Shen Wu, Tsen-Wei Hsu, Ren-Shuo Liu","doi":"10.1145/3624582","DOIUrl":"https://doi.org/10.1145/3624582","url":null,"abstract":"Convolutional neural networks (CNNs) are essential for advancing the field of artificial intelligence. However, since these networks are highly demanding in terms of memory and computation, implementing CNNs can be challenging. To make CNNs more accessible to energy-constrained devices, researchers are exploring new algorithmic techniques and hardware designs that can reduce memory and computation requirements. In this work, we present self-gating float (SG-Float), algorithm hardware co-design of a novel binary number format, which can significantly reduce memory access and computing power requirements in CNNs. SG-Float is a self-gating format that uses the exponent to self-gate the mantissa to zero, exploiting the characteristic of floating-point that the exponent determines the magnitude of a floating-point value and the error tolerance property of CNNs. SG-Float represents relatively small values using only the exponent, which increases the proportion of ineffective mantissas, corresponding to reducing mantissa multiplications of floating-point numbers. To minimize the accuracy loss caused by the approximation error introduced by SG-Float, we propose a fine-tuning process to determine the exponent thresholds of SG-Float and reclaim the accuracy loss. We also develop a hardware optimization technique, called the SG-Float buffering strategy, to best match SG-Float with CNN accelerators and further reduce memory access. We apply the SG-Float buffering strategy to vector-vector multiplication processing elements (PEs), which NVDLA adopts, in TSMC 40nm technology. Our evaluation results demonstrate that SG-Float can achieve up to 35% reduction in memory access power and up to 54% reduction in computing power compared with AdaptivFloat, a state-of-the-art format, with negligible power and area overhead. Additionally, we show that SG-Float can be combined with neural network pruning methods to further reduce memory access and mantissa multiplications in pruned CNN models. Overall, our work shows that SG-Float is a promising solution to the problem of CNN memory access and computing power.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 98","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PArtNNer: Platform-agnostic Adaptive Edge-Cloud DNN Partitioning for minimizing End-to-End Latency 合作伙伴:平台无关的自适应边缘云DNN分区，以最大限度地减少端到端延迟

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-10-27 DOI: 10.1145/3630266

Soumendu Kumar Ghosh, Arnab Raha, Vijay Raghunathan, Anand Raghunathan

The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, ‘edge-only’ and ‘cloud-only’ execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose PArtNNer , a platform-agnostic adaptive DNN partitioning algorithm that finds the optimal partitioning point in DNNs to minimize inference latency. PArtNNer can adapt to dynamic variations in communication bandwidth and cloud server load without requiring pre-characterization of underlying platforms. Experimental results for six image classification and object detection DNNs on a set of five commercial off-the-shelf compute platforms and three communication standards indicate that PArtNNer results in 10.2 × and 3.2 × (on average) and up to 21.1 × and 6.7 × improvements in end-to-end inference latency compared to execution of the DNN entirely on the edge device or entirely on a cloud server, respectively. Compared to pre-characterization-based partitioning approaches, PArtNNer converges to the optimal partitioning point 17.6 × faster.

在过去的十年中，深度神经网络(dnn)已经成为各种计算机视觉应用的实际算法。在智能边缘设备中，设备获取的传感器数据流由运行在边缘设备本身或云中的DNN应用程序处理。然而，由于边缘设备中有限的计算、内存和能源资源、边缘云连接网络的动态变化带宽以及云服务器计算负载的时间变化，“仅边缘”和“仅云”执行最先进的dnn可能无法满足应用程序的延迟要求。这项工作研究了跨边缘设备(移动/端设备)和云服务器的分布式(分区)推理，以最大限度地减少端到端DNN推理延迟。我们研究了临时变化的操作条件以及底层计算和通信架构对是否仅在边缘上运行推理，完全在云中运行，或在两者之间划分DNN模型执行的决定的影响。利用从本研究中获得的见解以及运行DNN推理的各种边缘平台能力的广泛差异，我们提出了partnerner，这是一种与平台无关的自适应DNN划分算法，可在DNN中找到最佳划分点以最小化推理延迟。partnerner可以适应通信带宽和云服务器负载的动态变化，而无需预先描述底层平台。在一组五种商用现有计算平台和三种通信标准上对六种图像分类和目标检测DNN进行的实验结果表明，与完全在边缘设备上或完全在云服务器上执行DNN相比，partnerner的端到端推理延迟分别提高了10.2倍和3.2倍(平均)，高达21.1倍和6.7倍。与基于预特征的分区方法相比，partnerner收敛到最优分区点的速度快17.6倍。

{"title":"PArtNNer: Platform-agnostic Adaptive Edge-Cloud DNN Partitioning for minimizing End-to-End Latency","authors":"Soumendu Kumar Ghosh, Arnab Raha, Vijay Raghunathan, Anand Raghunathan","doi":"10.1145/3630266","DOIUrl":"https://doi.org/10.1145/3630266","url":null,"abstract":"The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, ‘edge-only’ and ‘cloud-only’ execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose PArtNNer , a platform-agnostic adaptive DNN partitioning algorithm that finds the optimal partitioning point in DNNs to minimize inference latency. PArtNNer can adapt to dynamic variations in communication bandwidth and cloud server load without requiring pre-characterization of underlying platforms. Experimental results for six image classification and object detection DNNs on a set of five commercial off-the-shelf compute platforms and three communication standards indicate that PArtNNer results in 10.2 × and 3.2 × (on average) and up to 21.1 × and 6.7 × improvements in end-to-end inference latency compared to execution of the DNN entirely on the edge device or entirely on a cloud server, respectively. Compared to pre-characterization-based partitioning approaches, PArtNNer converges to the optimal partitioning point 17.6 × faster.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136234398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Hierarchical Classification Method for High-Accuracy Instruction Disassembly with Near-Field EM Measurements 基于近场电磁测量的高精度指令拆卸分层分类方法

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-10-25 DOI: 10.1145/3629167

Vishnuvardhan V. Iyer, Aditya Thimmaiah, Michael Orshansky, Andreas Gerstlauer, Ali E. Yilmaz

Electromagnetic (EM) fields have been extensively studied as potent side-channel tools for testing the security of hardware implementations. In this work, a low-cost side-channel disassembler that uses fine-grained EM signals to predict a program's execution trace with high accuracy is proposed. Unlike conventional side-channel disassemblers, the proposed disassembler does not require extensive randomized instantiations of instructions to profile them, instead relying on leakage-model-informed sub-sampling of potential architectural states resulting from instruction execution, which is further augmented by using a structured hierarchical approach. The proposed disassembler consists of two phases: (i) In the feature-selection phase, signals are collected with a relatively small EM probe, performing high-resolution scans near the chip surface, as profiling codes are executed. The measured signals from the numerous probe configurations are compiled into a hierarchical database by storing the min-max envelopes of the probed EM fields and differential signals derived from them, a novel dimension that increases the potency of the analysis. The envelope-to-envelope distances are evaluated throughout the hierarchy to identify optimal measurement configurations that maximize the distance between each pair of instruction classes. (ii) In the classification phase, signals measured for unknown instructions using optimal measurement configurations identified in the first phase are compared to the envelopes stored in the database to perform binary classification with majority voting, identifying candidate instruction classes at each hierarchical stage. Both phases of the disassembler rely on a 4-stage hierarchical grouping of instructions by their length, size, operands, and functions. The proposed disassembler is shown to recover ∼97-99% of instructions from several test and application benchmark programs executed on the AT89S51 microcontroller.

电磁(EM)场作为测试硬件实现安全性的有效侧信道工具已被广泛研究。在这项工作中，提出了一种低成本的侧信道反汇编器，它使用细粒度的EM信号来高精度地预测程序的执行轨迹。与传统的侧通道反汇编器不同，所提出的反汇编器不需要大量的随机指令实例来分析它们，而是依赖于由指令执行产生的泄漏模型通知的潜在架构状态的子采样，通过使用结构化分层方法进一步增强了这一点。所提出的反汇编器包括两个阶段:(i)在特征选择阶段，在执行分析代码时，用相对较小的EM探针收集信号，在芯片表面附近进行高分辨率扫描。通过存储探测电磁场的最小-最大包络和由此产生的差分信号，来自多种探头配置的测量信号被编译成一个分层数据库，这是一个新的维度，增加了分析的效力。在整个层次结构中评估信封到信封的距离，以确定使每对指令类之间的距离最大化的最佳测量配置。(ii)在分类阶段，使用第一阶段确定的最优测量配置对未知指令测量的信号与数据库中存储的信封进行比较，进行多数投票的二值分类，确定每个层次阶段的候选指令类。反汇编器的两个阶段都依赖于按长度、大小、操作数和功能对指令进行的4阶段分层分组。所提出的反汇编器被证明可以从AT89S51微控制器上执行的几个测试和应用基准程序中恢复~ 97-99%的指令。

{"title":"A Hierarchical Classification Method for High-Accuracy Instruction Disassembly with Near-Field EM Measurements","authors":"Vishnuvardhan V. Iyer, Aditya Thimmaiah, Michael Orshansky, Andreas Gerstlauer, Ali E. Yilmaz","doi":"10.1145/3629167","DOIUrl":"https://doi.org/10.1145/3629167","url":null,"abstract":"Electromagnetic (EM) fields have been extensively studied as potent side-channel tools for testing the security of hardware implementations. In this work, a low-cost side-channel disassembler that uses fine-grained EM signals to predict a program's execution trace with high accuracy is proposed. Unlike conventional side-channel disassemblers, the proposed disassembler does not require extensive randomized instantiations of instructions to profile them, instead relying on leakage-model-informed sub-sampling of potential architectural states resulting from instruction execution, which is further augmented by using a structured hierarchical approach. The proposed disassembler consists of two phases: (i) In the feature-selection phase, signals are collected with a relatively small EM probe, performing high-resolution scans near the chip surface, as profiling codes are executed. The measured signals from the numerous probe configurations are compiled into a hierarchical database by storing the min-max envelopes of the probed EM fields and differential signals derived from them, a novel dimension that increases the potency of the analysis. The envelope-to-envelope distances are evaluated throughout the hierarchy to identify optimal measurement configurations that maximize the distance between each pair of instruction classes. (ii) In the classification phase, signals measured for unknown instructions using optimal measurement configurations identified in the first phase are compared to the envelopes stored in the database to perform binary classification with majority voting, identifying candidate instruction classes at each hierarchical stage. Both phases of the disassembler rely on a 4-stage hierarchical grouping of instructions by their length, size, operands, and functions. The proposed disassembler is shown to recover ∼97-99% of instructions from several test and application benchmark programs executed on the AT89S51 microcontroller.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1