Chen-Tui Hung, Kai Xuan Lee, Yi-Zheng Liu, Ya-Shu Chen, Zhong-Han Chan
Battery-less devices offer potential solutions for maintaining sustainable Internet of Things (IoT) networks. However, limited energy harvesting capacity can lead to power failures, limiting the system’s quality of service (QoS) . To improve timely task progress, we present ETIME, a scheduling framework that enables energy-efficient communication for intermittent-powered IoT devices. To maximize energy efficiency while meeting the timely requirements of intermittent systems, we first model the relationship between insufficient harvesting energy and task behavior time. We then propose a method for predicting response times for battery-less devices. Considering both delays from multiple task interference and insufficient system energy, we introduce a dynamic wake-up strategy to improve timely task progress. Additionally, to minimize power consumption from connection components, we propose a dynamic connection interval adjustment to provide energy-efficient communication. The proposed algorithms are implemented in a lightweight operating system on real devices. Experimental results show that our approach can significantly improve progress for timely applications while maintaining task progress.
{"title":"Energy-efficient Communications for Improving Timely Progress of Intermittent Powered BLE Devices","authors":"Chen-Tui Hung, Kai Xuan Lee, Yi-Zheng Liu, Ya-Shu Chen, Zhong-Han Chan","doi":"10.1145/3626197","DOIUrl":"https://doi.org/10.1145/3626197","url":null,"abstract":"Battery-less devices offer potential solutions for maintaining sustainable Internet of Things (IoT) networks. However, limited energy harvesting capacity can lead to power failures, limiting the system’s quality of service (QoS) . To improve timely task progress, we present ETIME, a scheduling framework that enables energy-efficient communication for intermittent-powered IoT devices. To maximize energy efficiency while meeting the timely requirements of intermittent systems, we first model the relationship between insufficient harvesting energy and task behavior time. We then propose a method for predicting response times for battery-less devices. Considering both delays from multiple task interference and insufficient system energy, we introduce a dynamic wake-up strategy to improve timely task progress. Additionally, to minimize power consumption from connection components, we propose a dynamic connection interval adjustment to provide energy-efficient communication. The proposed algorithms are implemented in a lightweight operating system on real devices. Experimental results show that our approach can significantly improve progress for timely applications while maintaining task progress.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 79","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Haji Seyed Javadi, Mohsen Faryabi, Hamid Reza Mahdiani
After almost a decade of research, development of more efficient imprecise computational blocks is still a major concern in imprecise computing domain. There are many instances of the introduced imprecise components of different types, while their main difference is that they propose different precision-cost-performance trade-offs. In this paper, a novel comprehensive model for the imprecise components is introduced, which can be exploited to cover a wide range of precision-cost-performance trade-offs, for different types of imprecise components. The model helps to find the suitable imprecise component based on any desired error criterion. Therefore, the most significant advantage of the proposed model is that it can be simply exploited for design space exploration of different imprecise components to extract the suitable components, with the desired precision-cost-performance trade-off for any specific application. To demonstrate the efficiency of the proposed model, two novel families of Lowest-cost Imprecise Adders (LIAs) and Lowest-cost Imprecise Multipliers (LIMs) are introduced in the paper, which are systematically extracted based on exploration of the design space provided by the proposed model. A wide range of simulation and synthesis results are also presented in the paper to prove the comparable efficiency of the systematically extracted LIA/LIM structures with respect to the most efficient existing human-made imprecise components both individually and in a Multiply-Accumulate application.
{"title":"A Comprehensive Model for Efficient Design Space Exploration of Imprecise Computational Blocks","authors":"Mohammad Haji Seyed Javadi, Mohsen Faryabi, Hamid Reza Mahdiani","doi":"10.1145/3625555","DOIUrl":"https://doi.org/10.1145/3625555","url":null,"abstract":"After almost a decade of research, development of more efficient imprecise computational blocks is still a major concern in imprecise computing domain. There are many instances of the introduced imprecise components of different types, while their main difference is that they propose different precision-cost-performance trade-offs. In this paper, a novel comprehensive model for the imprecise components is introduced, which can be exploited to cover a wide range of precision-cost-performance trade-offs, for different types of imprecise components. The model helps to find the suitable imprecise component based on any desired error criterion. Therefore, the most significant advantage of the proposed model is that it can be simply exploited for design space exploration of different imprecise components to extract the suitable components, with the desired precision-cost-performance trade-off for any specific application. To demonstrate the efficiency of the proposed model, two novel families of Lowest-cost Imprecise Adders (LIAs) and Lowest-cost Imprecise Multipliers (LIMs) are introduced in the paper, which are systematically extracted based on exploration of the design space provided by the proposed model. A wide range of simulation and synthesis results are also presented in the paper to prove the comparable efficiency of the systematically extracted LIA/LIM structures with respect to the most efficient existing human-made imprecise components both individually and in a Multiply-Accumulate application.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 78","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Industrial control systems (ICSs) consist of a large number of control applications that are associated with periodic real-time flows with hard deadlines. To facilitate large-scale integration, remote control, and co-ordination, wireless sensor and actuator networks form the main communication framework in most ICSs. Among the existing wireless sensor and actuator network protocols, WirelessHART is the most suitable protocol for real-time applications in ICSs. The communications in a WirelessHART network are time-division multiple access based. To satisfy the hard deadlines of the real-time flows, the schedule in a WirelessHART network is pre-computed. The same schedule is repeated over every hyperperiod (i.e., lowest common multiple of the periods of the flows). However, a malicious attacker can exploit the repetitive behavior of the flow schedules to launch timing attacks (e.g., selective jamming attacks). To mitigate timing attacks, we propose an online distributed schedule randomization strategy that randomizes the time-slots in the schedules at each network device without violating the flow deadlines, while ensuring the closed-loop control stability. To increase the extent of randomization in the schedules further, and to reduce the energy consumption of the system, we incorporate a period adaptation strategy that adjusts the transmission periods of the flows depending on the stability of the control loops at runtime. We use Kullback-Leibler divergence and prediction probability of slots as two metrics to evaluate the performance of our proposed strategy. We compare our strategy with an offline centralized schedule randomization strategy. Experimental results show that the schedules generated by our strategy are 10% to 15% more diverse and 5% to 10% less predictable on average compared to the offline strategy when the number of base schedules and keys vary between 4 and 6 and 12 and 32, respectively, under all slot utilization (number of occupied slots in a hyperperiod). On incorporating period adaptation, the divergence in the schedules reduceat each period increase with 46% less power consumption on average.
{"title":"Online Distributed Schedule Randomization to Mitigate Timing Attacks in Industrial Control Systems","authors":"Ankita Samaddar, Arvind Easwaran","doi":"10.1145/3624584","DOIUrl":"https://doi.org/10.1145/3624584","url":null,"abstract":"Industrial control systems (ICSs) consist of a large number of control applications that are associated with periodic real-time flows with hard deadlines. To facilitate large-scale integration, remote control, and co-ordination, wireless sensor and actuator networks form the main communication framework in most ICSs. Among the existing wireless sensor and actuator network protocols, WirelessHART is the most suitable protocol for real-time applications in ICSs. The communications in a WirelessHART network are time-division multiple access based. To satisfy the hard deadlines of the real-time flows, the schedule in a WirelessHART network is pre-computed. The same schedule is repeated over every hyperperiod (i.e., lowest common multiple of the periods of the flows). However, a malicious attacker can exploit the repetitive behavior of the flow schedules to launch timing attacks (e.g., selective jamming attacks). To mitigate timing attacks, we propose an online distributed schedule randomization strategy that randomizes the time-slots in the schedules at each network device without violating the flow deadlines, while ensuring the closed-loop control stability. To increase the extent of randomization in the schedules further, and to reduce the energy consumption of the system, we incorporate a period adaptation strategy that adjusts the transmission periods of the flows depending on the stability of the control loops at runtime. We use Kullback-Leibler divergence and prediction probability of slots as two metrics to evaluate the performance of our proposed strategy. We compare our strategy with an offline centralized schedule randomization strategy. Experimental results show that the schedules generated by our strategy are 10% to 15% more diverse and 5% to 10% less predictable on average compared to the offline strategy when the number of base schedules and keys vary between 4 and 6 and 12 and 32, respectively, under all slot utilization (number of occupied slots in a hyperperiod). On incorporating period adaptation, the divergence in the schedules reduceat each period increase with 46% less power consumption on average.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ebrahim Farahmand, Ali Mahani, Muhammad Abdullah Hanif, Muhammad Shafique
Approximate computing is an emerging paradigm to improve the power and performance efficiency of error-resilient applications. As adders are one of the key components in almost all processing systems, a significant amount of research has been carried out toward designing approximate adders that can offer better efficiency than conventional designs; however, at the cost of some accuracy loss. In this article, we highlight a new class of energy-efficient approximate adders, namely, Heterogeneous Block-based Approximate Adders (HBAAs), and propose a generic configurable adder model that can be configured to represent a particular HBAA configuration. An HBAA, in general, is composed of heterogeneous sub-adder blocks of equal length, where each sub-adder can be an approximate sub-adder and have a different configuration. The sub-adders are mainly approximated through inexact logic and carry truncation. Compared to the existing design space, HBAAs provide additional design points that fall on the Pareto-front and offer a better quality-efficiency tradeoff in certain scenarios. Furthermore, to enable efficient design space exploration based on user-defined constraints, we propose an analytical model to efficiently evaluate the Probability Mass Function (PMF) of approximation error and other error metrics, such as Mean Error Distance (MED), Normalized Mean Error Distance (NMED), and Error Rate (ER) of HBAAs. The results show that HBAA configurations can provide around 15% reduction in area and up to 17% reduction in energy compared to state-of-the-art approximate adders.
{"title":"Design and Analysis of High Performance Heterogeneous Block-based Approximate Adders","authors":"Ebrahim Farahmand, Ali Mahani, Muhammad Abdullah Hanif, Muhammad Shafique","doi":"10.1145/3625686","DOIUrl":"https://doi.org/10.1145/3625686","url":null,"abstract":"Approximate computing is an emerging paradigm to improve the power and performance efficiency of error-resilient applications. As adders are one of the key components in almost all processing systems, a significant amount of research has been carried out toward designing approximate adders that can offer better efficiency than conventional designs; however, at the cost of some accuracy loss. In this article, we highlight a new class of energy-efficient approximate adders, namely, Heterogeneous Block-based Approximate Adders (HBAAs), and propose a generic configurable adder model that can be configured to represent a particular HBAA configuration. An HBAA, in general, is composed of heterogeneous sub-adder blocks of equal length, where each sub-adder can be an approximate sub-adder and have a different configuration. The sub-adders are mainly approximated through inexact logic and carry truncation. Compared to the existing design space, HBAAs provide additional design points that fall on the Pareto-front and offer a better quality-efficiency tradeoff in certain scenarios. Furthermore, to enable efficient design space exploration based on user-defined constraints, we propose an analytical model to efficiently evaluate the Probability Mass Function (PMF) of approximation error and other error metrics, such as Mean Error Distance (MED), Normalized Mean Error Distance (NMED), and Error Rate (ER) of HBAAs. The results show that HBAA configurations can provide around 15% reduction in area and up to 17% reduction in energy compared to state-of-the-art approximate adders.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 89","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erwei Wang, James J. Davis, Daniele Moro, Piotr Zielinski, Jia Jie Lim, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, George A. Constantinides
The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio’s standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3–5×, while reaching similar test accuracy (± 2 pp) in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78× memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy.
{"title":"Enabling Binary Neural Network Training on the Edge","authors":"Erwei Wang, James J. Davis, Daniele Moro, Piotr Zielinski, Jia Jie Lim, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, George A. Constantinides","doi":"10.1145/3626100","DOIUrl":"https://doi.org/10.1145/3626100","url":null,"abstract":"The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio’s standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3–5×, while reaching similar test accuracy (± 2 pp) in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78× memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 48","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135190997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun (Eric) Liang, Wei Zhang, Stephen Neuendorffer, Wayne Luk
introduction Share on Special Issue: “AI Acceleration on FPGAs” Authors: Yun (Eric) Liang Peking University, Peking, People's Republic of China Peking University, Peking, People's Republic of China 0000-0002-9076-7998Search about this author , Wei Zhang The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China 0000-0002-7622-6714Search about this author , Stephen Neuendorffer Xilinx, San Jose, CA Xilinx, San Jose, CA 0000-0003-2956-8428Search about this author , Wayne Luk Imperial College London, London, UK Imperial College London, London, UK 0000-0002-6750-927XSearch about this author Authors Info & Claims ACM Transactions on Embedded Computing SystemsVolume 22Issue 6Article No.: 89pp 1–3https://doi.org/10.1145/3626323Published:09 November 2023Publication History 0citation0DownloadsMetricsTotal Citations0Total Downloads0Last 12 Months0Last 6 weeks0 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access
专题分享:“fpga上的AI加速”梁云(Eric)中华人民共和国北京大学北京大学北京大学中华人民共和国北京0000-0002-9076-7998查询作者,张炜香港科技大学,中华人民共和国香港香港科技大学0000-0002-7622-6714查询作者,Stephen Neuendorffer Xilinx, San Jose, CA Xilinx, San Jose,CA 0000-0003-2956-8428检索本作者,Wayne Luk帝国理工学院伦敦,伦敦,英国伦敦帝国理工学院伦敦,英国伦敦0000-0002- 650-927x检索本作者作者信息与主张ACM嵌入式计算系统汇刊第22卷第6期文章编号: 89pp 1-3https://doi.org/10.1145/3626323Published:09 2023年11月出版历史0citation0downloadsmetrictotalcitations0总下载最近12个月过去6周获得引文警报新的引文警报添加!此警报已成功添加,并将发送到:每当您选择的记录被引用时,您将收到通知。要管理您的警报首选项,请单击下面的按钮。管理我的提醒新引文提醒!请登录到您的帐户保存到binder保存到binder创建一个新的BinderNameCancelCreateExport CitationPublisher SiteGet Access
{"title":"Special Issue: “AI Acceleration on FPGAs”","authors":"Yun (Eric) Liang, Wei Zhang, Stephen Neuendorffer, Wayne Luk","doi":"10.1145/3626323","DOIUrl":"https://doi.org/10.1145/3626323","url":null,"abstract":"introduction Share on Special Issue: “AI Acceleration on FPGAs” Authors: Yun (Eric) Liang Peking University, Peking, People's Republic of China Peking University, Peking, People's Republic of China 0000-0002-9076-7998Search about this author , Wei Zhang The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China 0000-0002-7622-6714Search about this author , Stephen Neuendorffer Xilinx, San Jose, CA Xilinx, San Jose, CA 0000-0003-2956-8428Search about this author , Wayne Luk Imperial College London, London, UK Imperial College London, London, UK 0000-0002-6750-927XSearch about this author Authors Info & Claims ACM Transactions on Embedded Computing SystemsVolume 22Issue 6Article No.: 89pp 1–3https://doi.org/10.1145/3626323Published:09 November 2023Publication History 0citation0DownloadsMetricsTotal Citations0Total Downloads0Last 12 Months0Last 6 weeks0 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135242664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern high-performance and high-bandwidth three-dimensional (3D) memories are characterized by frequent heating. Prior art suggests turning off hot channels and migrating data to the background DDR memory, incurring significant performance and energy overheads. We propose three Dynamic Thermal Management (DTM) approaches for 3D memories, reducing these overheads. The first approach, Rotating-channel Low-power-state-based DTM (RL-DTM) , minimizes the energy overheads by avoiding data migration. RL-DTM places 3D memory channels into low power states instead of turning them off. Since data accesses are disallowed during low power state, RL-DTM balances each channel’s low-power-state duration. The second approach, Masked rotating-channel Low-power-state-based DTM (ML-DTM) , is a fine-grained policy that minimizes the energy-delay product (EDP) and improves the performance of RL-DTM by considering the channel access rate. The third strategy, Partial channel closure and ML-DTM , minimizes performance overheads of existing channel-level turn-off-based policies by closing a channel only partially and integrating ML-DTM, reducing the number of channels being turned off. We evaluate the proposed DTM policies using various mixes of SPEC benchmarks and multi-threaded workloads and observe them to significantly improve performance, energy, and EDP over state-of-the-art approaches for different 3D memory architectures.
{"title":"Dynamic Thermal Management of 3D Memory through Rotating Low Power States and Partial Channel Closure","authors":"Lokesh Siddhu, Aritra Bagchi, Rajesh Kedia, Isaar Ahmad, Shailja Pandey, Preeti Ranjan Panda","doi":"10.1145/3624581","DOIUrl":"https://doi.org/10.1145/3624581","url":null,"abstract":"Modern high-performance and high-bandwidth three-dimensional (3D) memories are characterized by frequent heating. Prior art suggests turning off hot channels and migrating data to the background DDR memory, incurring significant performance and energy overheads. We propose three Dynamic Thermal Management (DTM) approaches for 3D memories, reducing these overheads. The first approach, Rotating-channel Low-power-state-based DTM (RL-DTM) , minimizes the energy overheads by avoiding data migration. RL-DTM places 3D memory channels into low power states instead of turning them off. Since data accesses are disallowed during low power state, RL-DTM balances each channel’s low-power-state duration. The second approach, Masked rotating-channel Low-power-state-based DTM (ML-DTM) , is a fine-grained policy that minimizes the energy-delay product (EDP) and improves the performance of RL-DTM by considering the channel access rate. The third strategy, Partial channel closure and ML-DTM , minimizes performance overheads of existing channel-level turn-off-based policies by closing a channel only partially and integrating ML-DTM, reducing the number of channels being turned off. We evaluate the proposed DTM policies using various mixes of SPEC benchmarks and multi-threaded workloads and observe them to significantly improve performance, energy, and EDP over state-of-the-art approaches for different 3D memory architectures.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 43","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Convolutional neural networks (CNNs) are essential for advancing the field of artificial intelligence. However, since these networks are highly demanding in terms of memory and computation, implementing CNNs can be challenging. To make CNNs more accessible to energy-constrained devices, researchers are exploring new algorithmic techniques and hardware designs that can reduce memory and computation requirements. In this work, we present self-gating float (SG-Float), algorithm hardware co-design of a novel binary number format, which can significantly reduce memory access and computing power requirements in CNNs. SG-Float is a self-gating format that uses the exponent to self-gate the mantissa to zero, exploiting the characteristic of floating-point that the exponent determines the magnitude of a floating-point value and the error tolerance property of CNNs. SG-Float represents relatively small values using only the exponent, which increases the proportion of ineffective mantissas, corresponding to reducing mantissa multiplications of floating-point numbers. To minimize the accuracy loss caused by the approximation error introduced by SG-Float, we propose a fine-tuning process to determine the exponent thresholds of SG-Float and reclaim the accuracy loss. We also develop a hardware optimization technique, called the SG-Float buffering strategy, to best match SG-Float with CNN accelerators and further reduce memory access. We apply the SG-Float buffering strategy to vector-vector multiplication processing elements (PEs), which NVDLA adopts, in TSMC 40nm technology. Our evaluation results demonstrate that SG-Float can achieve up to 35% reduction in memory access power and up to 54% reduction in computing power compared with AdaptivFloat, a state-of-the-art format, with negligible power and area overhead. Additionally, we show that SG-Float can be combined with neural network pruning methods to further reduce memory access and mantissa multiplications in pruned CNN models. Overall, our work shows that SG-Float is a promising solution to the problem of CNN memory access and computing power.
{"title":"SG-Float: Achieving Memory Access and Computing Power Reduction Using Self-Gating Float in CNNs","authors":"Jun-Shen Wu, Tsen-Wei Hsu, Ren-Shuo Liu","doi":"10.1145/3624582","DOIUrl":"https://doi.org/10.1145/3624582","url":null,"abstract":"Convolutional neural networks (CNNs) are essential for advancing the field of artificial intelligence. However, since these networks are highly demanding in terms of memory and computation, implementing CNNs can be challenging. To make CNNs more accessible to energy-constrained devices, researchers are exploring new algorithmic techniques and hardware designs that can reduce memory and computation requirements. In this work, we present self-gating float (SG-Float), algorithm hardware co-design of a novel binary number format, which can significantly reduce memory access and computing power requirements in CNNs. SG-Float is a self-gating format that uses the exponent to self-gate the mantissa to zero, exploiting the characteristic of floating-point that the exponent determines the magnitude of a floating-point value and the error tolerance property of CNNs. SG-Float represents relatively small values using only the exponent, which increases the proportion of ineffective mantissas, corresponding to reducing mantissa multiplications of floating-point numbers. To minimize the accuracy loss caused by the approximation error introduced by SG-Float, we propose a fine-tuning process to determine the exponent thresholds of SG-Float and reclaim the accuracy loss. We also develop a hardware optimization technique, called the SG-Float buffering strategy, to best match SG-Float with CNN accelerators and further reduce memory access. We apply the SG-Float buffering strategy to vector-vector multiplication processing elements (PEs), which NVDLA adopts, in TSMC 40nm technology. Our evaluation results demonstrate that SG-Float can achieve up to 35% reduction in memory access power and up to 54% reduction in computing power compared with AdaptivFloat, a state-of-the-art format, with negligible power and area overhead. Additionally, we show that SG-Float can be combined with neural network pruning methods to further reduce memory access and mantissa multiplications in pruned CNN models. Overall, our work shows that SG-Float is a promising solution to the problem of CNN memory access and computing power.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" 98","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, ‘edge-only’ and ‘cloud-only’ execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose PArtNNer , a platform-agnostic adaptive DNN partitioning algorithm that finds the optimal partitioning point in DNNs to minimize inference latency. PArtNNer can adapt to dynamic variations in communication bandwidth and cloud server load without requiring pre-characterization of underlying platforms. Experimental results for six image classification and object detection DNNs on a set of five commercial off-the-shelf compute platforms and three communication standards indicate that PArtNNer results in 10.2 × and 3.2 × (on average) and up to 21.1 × and 6.7 × improvements in end-to-end inference latency compared to execution of the DNN entirely on the edge device or entirely on a cloud server, respectively. Compared to pre-characterization-based partitioning approaches, PArtNNer converges to the optimal partitioning point 17.6 × faster.
{"title":"PArtNNer: Platform-agnostic Adaptive Edge-Cloud DNN Partitioning for minimizing End-to-End Latency","authors":"Soumendu Kumar Ghosh, Arnab Raha, Vijay Raghunathan, Anand Raghunathan","doi":"10.1145/3630266","DOIUrl":"https://doi.org/10.1145/3630266","url":null,"abstract":"The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, ‘edge-only’ and ‘cloud-only’ execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose PArtNNer , a platform-agnostic adaptive DNN partitioning algorithm that finds the optimal partitioning point in DNNs to minimize inference latency. PArtNNer can adapt to dynamic variations in communication bandwidth and cloud server load without requiring pre-characterization of underlying platforms. Experimental results for six image classification and object detection DNNs on a set of five commercial off-the-shelf compute platforms and three communication standards indicate that PArtNNer results in 10.2 × and 3.2 × (on average) and up to 21.1 × and 6.7 × improvements in end-to-end inference latency compared to execution of the DNN entirely on the edge device or entirely on a cloud server, respectively. Compared to pre-characterization-based partitioning approaches, PArtNNer converges to the optimal partitioning point 17.6 × faster.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136234398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vishnuvardhan V. Iyer, Aditya Thimmaiah, Michael Orshansky, Andreas Gerstlauer, Ali E. Yilmaz
Electromagnetic (EM) fields have been extensively studied as potent side-channel tools for testing the security of hardware implementations. In this work, a low-cost side-channel disassembler that uses fine-grained EM signals to predict a program's execution trace with high accuracy is proposed. Unlike conventional side-channel disassemblers, the proposed disassembler does not require extensive randomized instantiations of instructions to profile them, instead relying on leakage-model-informed sub-sampling of potential architectural states resulting from instruction execution, which is further augmented by using a structured hierarchical approach. The proposed disassembler consists of two phases: (i) In the feature-selection phase, signals are collected with a relatively small EM probe, performing high-resolution scans near the chip surface, as profiling codes are executed. The measured signals from the numerous probe configurations are compiled into a hierarchical database by storing the min-max envelopes of the probed EM fields and differential signals derived from them, a novel dimension that increases the potency of the analysis. The envelope-to-envelope distances are evaluated throughout the hierarchy to identify optimal measurement configurations that maximize the distance between each pair of instruction classes. (ii) In the classification phase, signals measured for unknown instructions using optimal measurement configurations identified in the first phase are compared to the envelopes stored in the database to perform binary classification with majority voting, identifying candidate instruction classes at each hierarchical stage. Both phases of the disassembler rely on a 4-stage hierarchical grouping of instructions by their length, size, operands, and functions. The proposed disassembler is shown to recover ∼97-99% of instructions from several test and application benchmark programs executed on the AT89S51 microcontroller.
{"title":"A Hierarchical Classification Method for High-Accuracy Instruction Disassembly with Near-Field EM Measurements","authors":"Vishnuvardhan V. Iyer, Aditya Thimmaiah, Michael Orshansky, Andreas Gerstlauer, Ali E. Yilmaz","doi":"10.1145/3629167","DOIUrl":"https://doi.org/10.1145/3629167","url":null,"abstract":"Electromagnetic (EM) fields have been extensively studied as potent side-channel tools for testing the security of hardware implementations. In this work, a low-cost side-channel disassembler that uses fine-grained EM signals to predict a program's execution trace with high accuracy is proposed. Unlike conventional side-channel disassemblers, the proposed disassembler does not require extensive randomized instantiations of instructions to profile them, instead relying on leakage-model-informed sub-sampling of potential architectural states resulting from instruction execution, which is further augmented by using a structured hierarchical approach. The proposed disassembler consists of two phases: (i) In the feature-selection phase, signals are collected with a relatively small EM probe, performing high-resolution scans near the chip surface, as profiling codes are executed. The measured signals from the numerous probe configurations are compiled into a hierarchical database by storing the min-max envelopes of the probed EM fields and differential signals derived from them, a novel dimension that increases the potency of the analysis. The envelope-to-envelope distances are evaluated throughout the hierarchy to identify optimal measurement configurations that maximize the distance between each pair of instruction classes. (ii) In the classification phase, signals measured for unknown instructions using optimal measurement configurations identified in the first phase are compared to the envelopes stored in the database to perform binary classification with majority voting, identifying candidate instruction classes at each hierarchical stage. Both phases of the disassembler rely on a 4-stage hierarchical grouping of instructions by their length, size, operands, and functions. The proposed disassembler is shown to recover ∼97-99% of instructions from several test and application benchmark programs executed on the AT89S51 microcontroller.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}