Yungang Pan, Rouhollah Mahfouzi, Soheil Samii, Petru Eles, Zebo Peng
The fifth-generation (5G) technology standard in telecommunications is expected to support ultra-reliable low latency communication to enable real-time applications such as industrial automation and control. 5G configured grant (CG) scheduling features a pre-allocated periodicity-based scheduling approach, which reduces control signaling time and guarantees service quality. Although this enables 5G to support hard real-time periodic traffics, synthesizing the schedule efficiently and achieving high resource efficiency, while serving multiple communications, are still an open problem. In this work, we study the trade-off between scheduling flexibility and control overhead when performing CG scheduling. To address the CG scheduling problem, we first formulate it using satisfiability modulo theories (SMT) so that an SMT solver can be used to generate optimal solutions. To enhance scalability, we propose two heuristic approaches. The first one as the baseline, Co1, follows the basic idea of the 5G CG scheduling scheme that minimizes the control overhead. The second one, CoU, enables increased scheduling flexibility while considering the involved control overhead. The effectiveness and scalability of the proposed techniques and the superiority of CoU compared to Co1 have been evaluated using a large number of generated benchmarks as well as a realistic case study for industrial automation.
{"title":"Multi-Traffic Resource Optimization for Real-Time Applications with 5G Configured Grant Scheduling","authors":"Yungang Pan, Rouhollah Mahfouzi, Soheil Samii, Petru Eles, Zebo Peng","doi":"10.1145/3664621","DOIUrl":"https://doi.org/10.1145/3664621","url":null,"abstract":"<p>The fifth-generation (5G) technology standard in telecommunications is expected to support ultra-reliable low latency communication to enable real-time applications such as industrial automation and control. 5G configured grant (CG) scheduling features a pre-allocated periodicity-based scheduling approach, which reduces control signaling time and guarantees service quality. Although this enables 5G to support hard real-time periodic traffics, synthesizing the schedule efficiently and achieving high resource efficiency, while serving multiple communications, are still an open problem. In this work, we study the trade-off between scheduling flexibility and control overhead when performing CG scheduling. To address the CG scheduling problem, we first formulate it using satisfiability modulo theories (SMT) so that an SMT solver can be used to generate optimal solutions. To enhance scalability, we propose two heuristic approaches. The first one as the baseline, Co1, follows the basic idea of the 5G CG scheduling scheme that minimizes the control overhead. The second one, CoU, enables increased scheduling flexibility while considering the involved control overhead. The effectiveness and scalability of the proposed techniques and the superiority of CoU compared to Co1 have been evaluated using a large number of generated benchmarks as well as a realistic case study for industrial automation.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"24 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141169425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A Wireless Sensor Network (WSN) comprises an ad-hoc network of nodes laden with sensors that are used to monitor a region mostly in the outdoors and often not easily accessible. Despite exceptions, several deployments of WSN continue to grapple with the limitation of finite energy derived through batteries. Thus, it is imperative that the energy of a WSN be conserved and its life prolonged. An important direction of work to this end is towards the transmission of data between nodes in a manner that minimum energy is expended. One approach to doing this is cluster-based routing, wherein nodes in a WSN are organised into clusters, and transmission of data from the node is through a representative node called a cluster-head. Forming optimal clusters and choosing an optimal cluster-head is an NP-Hard problem. Significant work is done towards devising mechanisms to form clusters and choosing cluster heads to reduce the transmission overhead to a minimum. In this paper, an approach is proposed to create clusters and identify cluster heads that are near optimal. The approach involves two-stage clustering, with the clustering algorithm for each stage chosen through an exhaustive search. Furthermore, unlike existing approaches that choose a cluster-head on the basis of the residual energy of nodes, the proposed approach utilises three factors in addition to the residual energy, namely the distance of a node from the cluster centroid, the distance of a node from the final destination (base-station), and the connectivity of the node. The approach is shown to be effective and economical through extensive validation via simulations and through a real-world prototypical implementation.
{"title":"Dynamic Cluster Head Selection in WSN","authors":"Rupendra Pratap Singh Hada, Abhishek Srivastava","doi":"10.1145/3665867","DOIUrl":"https://doi.org/10.1145/3665867","url":null,"abstract":"<p>A Wireless Sensor Network (WSN) comprises an ad-hoc network of nodes laden with sensors that are used to monitor a region mostly in the outdoors and often not easily accessible. Despite exceptions, several deployments of WSN continue to grapple with the limitation of finite energy derived through batteries. Thus, it is imperative that the energy of a WSN be conserved and its life prolonged. An important direction of work to this end is towards the transmission of data between nodes in a manner that minimum energy is expended. One approach to doing this is cluster-based routing, wherein nodes in a WSN are organised into clusters, and transmission of data from the node is through a representative node called a cluster-head. Forming optimal clusters and choosing an optimal cluster-head is an NP-Hard problem. Significant work is done towards devising mechanisms to form clusters and choosing cluster heads to reduce the transmission overhead to a minimum. In this paper, an approach is proposed to create clusters and identify cluster heads that are near optimal. The approach involves two-stage clustering, with the clustering algorithm for each stage chosen through an exhaustive search. Furthermore, unlike existing approaches that choose a cluster-head on the basis of the residual energy of nodes, the proposed approach utilises three factors in addition to the residual energy, namely the distance of a node from the cluster centroid, the distance of a node from the final destination (base-station), and the connectivity of the node. The approach is shown to be effective and economical through extensive validation via simulations and through a real-world prototypical implementation.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"142 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pavitra Bhade, Joseph Paturel, Olivier Sentieys, Sharad Sinha
Cache Side Channel Attacks (CSCA) have been haunting most processor architectures for decades now. Existing approaches to mitigation of such attacks have certain drawbacks namely software mishandling, performance overhead, low throughput due to false alarms, etc. Hence, “mitigation only when detected” should be the approach to minimize the effects of such drawbacks. We propose a novel methodology of fine-grained detection of timing-based CSCA using a hardware-based detection module.
We discuss the design, implementation, and use of our proposed detection module in processor architectures. Our approach successfully detects attacks that flush secret victim information from cache memory like Flush+Reload, Flush+Flush, Prime+Probe, Evict+Probe, and Prime+Abort, commonly known as cache timing attacks. Detection is on time with minimal performance overhead. The parameterizable number of counters used in our module allows detection of multiple attacks on multiple sensitive locations simultaneously. The fine-grained nature ensures negligible false alarms, severely reducing the need for any unnecessary mitigation. The proposed work is evaluated by synthesizing the entire detection algorithm as an attack detection block, Edge-CaSCADe, in a RISC-V processor as a target example. The detection results are checked under different workload conditions with respect to the number of attackers, the number of victims having RSA,AES and ECC based encryption schemes like ECIES, and on benchmark applications like MiBench and Embench. More than (98% ) detection accuracy within (2% ) of the beginning of an attack can be achieved with negligible false alarms. The detection module has an area and power overhead of (0.9% ) to (2% ) and (1% ) to (2.1% ) for the targeted RISC-V processor core without cache for 1 to 5 counters, respectively. The detection module does not affect the processor critical path and hence has no impact on its maximum operating frequency.
{"title":"Lightweight Hardware-Based Cache Side-Channel Attack Detection for Edge Devices (Edge-CaSCADe)","authors":"Pavitra Bhade, Joseph Paturel, Olivier Sentieys, Sharad Sinha","doi":"10.1145/3663673","DOIUrl":"https://doi.org/10.1145/3663673","url":null,"abstract":"<p>Cache Side Channel Attacks (CSCA) have been haunting most processor architectures for decades now. Existing approaches to mitigation of such attacks have certain drawbacks namely software mishandling, performance overhead, low throughput due to false alarms, etc. Hence, <i>“mitigation only when detected”</i> should be the approach to minimize the effects of such drawbacks. We propose a novel methodology of fine-grained detection of timing-based CSCA using a hardware-based detection module. </p><p>We discuss the design, implementation, and use of our proposed detection module in processor architectures. Our approach successfully detects attacks that flush secret victim information from cache memory like Flush+Reload, Flush+Flush, Prime+Probe, Evict+Probe, and Prime+Abort, commonly known as cache timing attacks. Detection is on time with minimal performance overhead. The parameterizable number of counters used in our module allows detection of multiple attacks on multiple sensitive locations simultaneously. The fine-grained nature ensures negligible false alarms, severely reducing the need for any unnecessary mitigation. The proposed work is evaluated by synthesizing the entire detection algorithm as an attack detection block, Edge-CaSCADe, in a RISC-V processor as a target example. The detection results are checked under different workload conditions with respect to the number of attackers, the number of victims having RSA,AES and ECC based encryption schemes like ECIES, and on benchmark applications like MiBench and Embench. More than (98% ) detection accuracy within (2% ) of the beginning of an attack can be achieved with negligible false alarms. The detection module has an area and power overhead of (0.9% ) to (2% ) and (1% ) to (2.1% ) for the targeted RISC-V processor core without cache for 1 to 5 counters, respectively. The detection module does not affect the processor critical path and hence has no impact on its maximum operating frequency.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"23 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140941823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Function layout, also known as function reordering or function placement, is one of the most effective profile-guided compiler optimizations. By reordering functions in a binary, compilers can improve the performance of large-scale applications or reduce the compressed size of mobile applications. Although the technique has been extensively studied in the context of large-scale binaries, no study has thoroughly investigated function layout algorithms on mobile applications.
In this paper we develop the first principled solution for optimizing function layouts in the mobile space. To this end, we identify two key optimization goals: reducing the compressed code size and improving the cold start-up time of a mobile application. Then we propose a formal model for the layout problem, whose objective closely matches our goals, and a novel algorithm for optimizing the layout. The method is inspired by the classic balanced graph partitioning problem. We have carefully engineered and implemented the algorithm in an open-source compiler, LLVM. An extensive evaluation of the new method on large commercial mobile applications demonstrates improvements in start-up time and compressed size compared to the state-of-the-art approach.
{"title":"Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-Up","authors":"Ellis Hoag, Kyungwoo Lee, Julián Mestre, Sergey Pupyrev, YongKang Zhu","doi":"10.1145/3660635","DOIUrl":"https://doi.org/10.1145/3660635","url":null,"abstract":"<p>Function layout, also known as function reordering or function placement, is one of the most effective profile-guided compiler optimizations. By reordering functions in a binary, compilers can improve the performance of large-scale applications or reduce the compressed size of mobile applications. Although the technique has been extensively studied in the context of large-scale binaries, no study has thoroughly investigated function layout algorithms on mobile applications. </p><p>In this paper we develop the first principled solution for optimizing function layouts in the mobile space. To this end, we identify two key optimization goals: reducing the compressed code size and improving the cold start-up time of a mobile application. Then we propose a formal model for the layout problem, whose objective closely matches our goals, and a novel algorithm for optimizing the layout. The method is inspired by the classic balanced graph partitioning problem. We have carefully engineered and implemented the algorithm in an open-source compiler, LLVM. An extensive evaluation of the new method on large commercial mobile applications demonstrates improvements in start-up time and compressed size compared to the state-of-the-art approach.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"12 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One primary objective of drone simulation is to evaluate diverse drone configurations and contexts aligned with specific user objectives. The initial challenge for simulator designers involves managing the heterogeneity of drone components, encompassing both software and hardware systems, as well as the drone’s behavior. To facilitate the integration of these diverse models, the Functional Mock-Up Interface (FMI) for Co-Simulation proposes a generic data-oriented interface. However, an additional challenge lies in simplifying the configuration of co-simulation, necessitating an approach to guide the modeling of parametric features and operational conditions such as failures or environment changes.
The paper addresses this challenge by introducing CARES, a Model-Driven Engineering (MDE) and component-based approach for designing drone simulators, integrating the Functional Mock-Up Interface (FMI) for Co-Simulation. The proposed models incorporate concepts from Component-Based Software Engineering (CBSE) and FMI. The NAVIDRO architectural style is presented for designing and configuring drone co-simulation. CARES utilizes a code generator to produce structural glue code (Java or C++), facilitating the integration of FMI-based domain-specific code. The approach is evaluated through the development of a simulator for navigation functions in an Autonomous Underwater Vehicle (AUV), demonstrating its effectiveness in assessing various AUV configurations and contexts.
{"title":"NAVIDRO, a CARES architectural style for configuring drone co-simulation","authors":"Loic Salmon, Pierre-Yves Pillain, Goulven Guillou, Jean-Philippe Babau","doi":"10.1145/3651889","DOIUrl":"https://doi.org/10.1145/3651889","url":null,"abstract":"<p>One primary objective of drone simulation is to evaluate diverse drone configurations and contexts aligned with specific user objectives. The initial challenge for simulator designers involves managing the heterogeneity of drone components, encompassing both software and hardware systems, as well as the drone’s behavior. To facilitate the integration of these diverse models, the Functional Mock-Up Interface (FMI) for Co-Simulation proposes a generic data-oriented interface. However, an additional challenge lies in simplifying the configuration of co-simulation, necessitating an approach to guide the modeling of parametric features and operational conditions such as failures or environment changes. </p><p>The paper addresses this challenge by introducing CARES, a Model-Driven Engineering (MDE) and component-based approach for designing drone simulators, integrating the Functional Mock-Up Interface (FMI) for Co-Simulation. The proposed models incorporate concepts from Component-Based Software Engineering (CBSE) and FMI. The NAVIDRO architectural style is presented for designing and configuring drone co-simulation. CARES utilizes a code generator to produce structural glue code (Java or C++), facilitating the integration of FMI-based domain-specific code. The approach is evaluated through the development of a simulator for navigation functions in an Autonomous Underwater Vehicle (AUV), demonstrating its effectiveness in assessing various AUV configurations and contexts.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the Internet of Things (IoTs) increasingly combines AI technology, it is a trend to deploy neural network algorithms at edges and make IoT devices more intelligent than ever. Moreover, energy-harvesting technology-based IoT devices have shown the advantages of green and low-carbon economy, convenient maintenance, and theoretically infinite lifetime, etc. However, the harvested energy is often unstable, resulting in low performance due to the fact that a fixed load cannot sufficiently utilize the harvested energy. To address this problem, recent works focusing on ReRAM-based convolutional neural networks (CNN) accelerators under harvested energy have proposed hardware/software optimizations. However, those works have overlooked the mismatch between the power requirement of different CNN layers and the variation of harvested power.
Motivated by the above observation, this paper proposes a novel strategy, called REC, that retimes convolutional layers of CNN inferences to improve the performance and energy efficiency of energy harvesting ReRAM-based accelerators. Specifically, at the offline stage, REC defines different power levels to fit the power requirements of different convolutional layers. At runtime, instead of sequentially executing the convolutional layers of an inference one by one, REC retimes the execution timeframe of different convolutional layers so as to accommodate different CNN layers to the changing power inputs. What is more, REC provides a parallel strategy to fully utilize very high power inputs. Moreover, a case study is presented to show that REC is effective to improve the real-time accomplishment of periodical critical inferences because REC provides an opportunity for critical inferences to preempt the process window with a high power supply. Our experimental results show that the proposed REC scheme achieves an average performance improvement of 6.1 × (up to 16.5 ×) compared to the traditional strategy without the REC idea. The case study results show that the REC scheme can significantly improve the success rate of periodical critical inferences’ real-time accomplishment.
{"title":"REC: REtime Convolutional layers to fully exploit harvested energy for ReRAM-based CNN accelerators","authors":"Kunyu Zhou, Keni Qiu","doi":"10.1145/3652593","DOIUrl":"https://doi.org/10.1145/3652593","url":null,"abstract":"<p>As the Internet of Things (IoTs) increasingly combines AI technology, it is a trend to deploy neural network algorithms at edges and make IoT devices more intelligent than ever. Moreover, energy-harvesting technology-based IoT devices have shown the advantages of green and low-carbon economy, convenient maintenance, and theoretically infinite lifetime, etc. However, the harvested energy is often unstable, resulting in low performance due to the fact that a fixed load cannot sufficiently utilize the harvested energy. To address this problem, recent works focusing on ReRAM-based convolutional neural networks (CNN) accelerators under harvested energy have proposed hardware/software optimizations. However, those works have overlooked the mismatch between the power requirement of different CNN layers and the variation of harvested power. </p><p>Motivated by the above observation, this paper proposes a novel strategy, called <i>REC</i>, that retimes convolutional layers of CNN inferences to improve the performance and energy efficiency of energy harvesting ReRAM-based accelerators. Specifically, at the offline stage, <i>REC</i> defines different power levels to fit the power requirements of different convolutional layers. At runtime, instead of sequentially executing the convolutional layers of an inference one by one, <i>REC</i> retimes the execution timeframe of different convolutional layers so as to accommodate different CNN layers to the changing power inputs. What is more, <i>REC</i> provides a parallel strategy to fully utilize very high power inputs. Moreover, a case study is presented to show that <i>REC</i> is effective to improve the real-time accomplishment of periodical critical inferences because <i>REC</i> provides an opportunity for critical inferences to preempt the process window with a high power supply. Our experimental results show that the proposed <i>REC</i> scheme achieves an average performance improvement of 6.1 × (up to 16.5 ×) compared to the traditional strategy without the <i>REC</i> idea. The case study results show that the <i>REC</i> scheme can significantly improve the success rate of periodical critical inferences’ real-time accomplishment.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"28 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Remote IoT devices face significant security risks due to their inherent physical vulnerability. An adversarial actor with sufficient capability can monitor the devices or exfiltrate data to access sensitive information. Remotely deployed devices such as sensors need enhanced resilience against memory leakage if performing privileged tasks. To increase the security and trust of these devices we present a novel framework implementing a privacy homomorphism which creates sensor data directly in an encoded format. The sensor data is permuted at the time of creation in a manner which appears random to an observer. A separate secure server in communication with the device provides necessary information which allows the device to perform processing on the encoded data but does not allow decoding of the result. The device transmits the encoded results to the secure server which maintains the ability to interpret the results. In this paper we show how this framework works for an image sensor calculating differences between a stream of images, with initial results showing an overhead as low as only 266% in terms of throughput when compared to computing on standard unencoded numbers such as two’s complement. We further show 5,000x speedup over a recent homomorphic encryption ASIC.
{"title":"Implementing Privacy Homomorphism with Random Encoding and Computation Controlled by a Remote Secure Server","authors":"Kevin Hutto, Vincent Mooney","doi":"10.1145/3651617","DOIUrl":"https://doi.org/10.1145/3651617","url":null,"abstract":"<p>Remote IoT devices face significant security risks due to their inherent physical vulnerability. An adversarial actor with sufficient capability can monitor the devices or exfiltrate data to access sensitive information. Remotely deployed devices such as sensors need enhanced resilience against memory leakage if performing privileged tasks. To increase the security and trust of these devices we present a novel framework implementing a privacy homomorphism which creates sensor data directly in an encoded format. The sensor data is permuted at the time of creation in a manner which appears random to an observer. A separate secure server in communication with the device provides necessary information which allows the device to perform processing on the encoded data but does not allow decoding of the result. The device transmits the encoded results to the secure server which maintains the ability to interpret the results. In this paper we show how this framework works for an image sensor calculating differences between a stream of images, with initial results showing an overhead as low as only 266% in terms of throughput when compared to computing on standard unencoded numbers such as two’s complement. We further show 5,000x speedup over a recent homomorphic encryption ASIC.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"22 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140070766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yueting Li, Xueyan Wang, He Zhang, Biao Pan, Keni Qiu, Wang Kang, Jun Wang, Weisheng Zhao
Convolutional Neural Networks (CNNs) have significantly impacted embedded system applications across various domains. However, this exacerbates the real-time processing and hardware resource-constrained challenges of embedded systems. To tackle these issues, we propose spin-transfer torque magnetic random-access memory (STT-MRAM)-based near memory computing (NMC) design for embedded systems. We optimize this design from three aspects: Fast-pipelined STT-MRAM readout scheme provides higher memory bandwidth for NMC design, enhancing real-time processing capability with a non-trivial area overhead. Direct index compression format in conjunction with digital sparse matrix-vector multiplication (SpMV) accelerator supports various matrices of practical applications that alleviate computing resource requirements. Custom NMC instructions and stream converter for NMC systems dynamically adjust available hardware resources for better utilization. Experimental results demonstrate that the memory bandwidth of STT-MRAM achieves 26.7GB/s. Energy consumption and latency improvement of digital SpMV accelerator are up to 64x and 1120x across sparsity matrices spanning from 10% to 99.8%. Single-precision and double-precision elements transmission increased up to 8x and 9.6x, respectively. Furthermore, our design achieves a throughput of up to 15.9x over state-of-the-art designs.
{"title":"Toward Energy Efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems","authors":"Yueting Li, Xueyan Wang, He Zhang, Biao Pan, Keni Qiu, Wang Kang, Jun Wang, Weisheng Zhao","doi":"10.1145/3650729","DOIUrl":"https://doi.org/10.1145/3650729","url":null,"abstract":"<p>Convolutional Neural Networks (CNNs) have significantly impacted embedded system applications across various domains. However, this exacerbates the real-time processing and hardware resource-constrained challenges of embedded systems. To tackle these issues, we propose spin-transfer torque magnetic random-access memory (STT-MRAM)-based near memory computing (NMC) design for embedded systems. We optimize this design from three aspects: Fast-pipelined STT-MRAM readout scheme provides higher memory bandwidth for NMC design, enhancing real-time processing capability with a non-trivial area overhead. Direct index compression format in conjunction with digital sparse matrix-vector multiplication (SpMV) accelerator supports various matrices of practical applications that alleviate computing resource requirements. Custom NMC instructions and stream converter for NMC systems dynamically adjust available hardware resources for better utilization. Experimental results demonstrate that the memory bandwidth of STT-MRAM achieves 26.7GB/s. Energy consumption and latency improvement of digital SpMV accelerator are up to 64x and 1120x across sparsity matrices spanning from 10% to 99.8%. Single-precision and double-precision elements transmission increased up to 8x and 9.6x, respectively. Furthermore, our design achieves a throughput of up to 15.9x over state-of-the-art designs.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"50 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140055337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linwei Niu, Danda B. Rawat, Dakai Zhu, Jonathan Musselwhite, Zonghua Gu, Qingxu Deng
Fault tolerance, energy management, and quality of service (QoS) are essential aspects for the design of real-time embedded systems. In this work, we focus on exploring methods that can simultaneously address the above three critical issues under standby-sparing. The standby-sparing mechanism adopts a dual-processor architecture in which each processor plays the role of the backup for the other one dynamically. In this way it can provide fault tolerance subject to both permanent and transient faults. Due to its duplicate executions of the real-time jobs/tasks, the energy consumption of a standby-sparing system could be quite high. With the purpose of reducing energy under standby-sparing, we proposed three novel scheduling schemes: the first one is for (1, 1)-constrained tasks, and the second one and the third one (which can be combined into an integrated approach to maximize the overall energy reduction) are for general (m, k)-constrained tasks which require that among any k consecutive jobs of a task no more than (k − m) out of them could miss their deadlines. Through extensive evaluations and performance analysis, our results demonstrate that compared with the existing research, the proposed techniques can reduce energy by up to 11% for (1, 1)-constrained tasks and 25% for general (m, k)-constrained tasks while assuring (m, k)-constraints and fault tolerance as well as providing better user perceived QoS levels under standby-sparing.
容错性、能源管理和服务质量(QoS)是设计实时嵌入式系统的基本要素。在这项工作中,我们将重点探索在备用机共享机制下同时解决上述三个关键问题的方法。备援机制采用双处理器架构,每个处理器动态地扮演另一个处理器的备份角色。这样,它就能在出现永久性和瞬时性故障时提供容错。由于实时工作/任务的重复执行,备用系统的能耗可能相当高。为了降低备用系统的能耗,我们提出了三种新的调度方案:第一种方案适用于(1,1)受限任务,第二种方案和第三种方案(可合并为一种综合方法,以最大限度地降低整体能耗)适用于一般(m,k)受限任务,要求在任务的任意 k 个连续作业中,不能有超过 (k - m) 个作业错过截止日期。通过广泛的评估和性能分析,我们的结果表明,与现有研究相比,所提出的技术在确保(m,k)约束和容错性的同时,可为(1,1)约束任务减少高达 11% 的能源,为一般(m,k)约束任务减少 25% 的能源,并在备用资源共享的情况下提供更好的用户感知 QoS 水平。
{"title":"Energy Management for Fault-Tolerant (m,k)-Constrained Real-Time Systems that Use Standby-Sparing","authors":"Linwei Niu, Danda B. Rawat, Dakai Zhu, Jonathan Musselwhite, Zonghua Gu, Qingxu Deng","doi":"10.1145/3648365","DOIUrl":"https://doi.org/10.1145/3648365","url":null,"abstract":"<p>Fault tolerance, energy management, and quality of service (QoS) are essential aspects for the design of real-time embedded systems. In this work, we focus on exploring methods that can simultaneously address the above three critical issues under standby-sparing. The standby-sparing mechanism adopts a dual-processor architecture in which each processor plays the role of the backup for the other one dynamically. In this way it can provide fault tolerance subject to both permanent and transient faults. Due to its duplicate executions of the real-time jobs/tasks, the energy consumption of a standby-sparing system could be quite high. With the purpose of reducing energy under standby-sparing, we proposed three novel scheduling schemes: the first one is for (1, 1)-constrained tasks, and the second one and the third one (which can be combined into an integrated approach to maximize the overall energy reduction) are for general (<i>m</i>, <i>k</i>)-constrained tasks which require that among any <i>k</i> consecutive jobs of a task no more than (<i>k</i> − <i>m</i>) out of them could miss their deadlines. Through extensive evaluations and performance analysis, our results demonstrate that compared with the existing research, the proposed techniques can reduce energy by up to 11% for (1, 1)-constrained tasks and 25% for general (<i>m</i>, <i>k</i>)-constrained tasks while assuring (<i>m</i>, <i>k</i>)-constraints and fault tolerance as well as providing better user perceived QoS levels under standby-sparing.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"376 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dogan Ulus, Thomas Ferrère, Eugene Asarin, Dejan Nickovic, Oded Maler
The rise of machine learning and cloud technologies has led to a remarkable influx of data within modern cyber-physical systems. However, extracting meaningful information from this data has become a significant challenge due to its volume and complexity. Timed pattern matching has emerged as a powerful specification-based runtime verification and temporal data analysis technique to address this challenge.
In this paper, we provide a comprehensive tutorial on timed pattern matching that ranges from the underlying algebra and pattern specification languages to performance analyses and practical case studies. Analogous to textual pattern matching, timed pattern matching is the task of finding all time periods within temporal behaviors of cyber-physical systems that match a predefined pattern. Originally we introduced and solved several variants of the problem using the name of match sets, which has evolved into the concept of timed relations over the past decade. Here we first formalize and present the algebra of timed relations as a standalone mathematical tool to solve the pattern matching problem of timed pattern specifications. In particular, we show how to use the algebra of timed relations to solve the pattern matching problem for timed regular expressions and metric compass logic in a unified manner. We experimentally demonstrate that our timed pattern matching approach performs and scales well in practice. We further provide in-depth insights into the similarities and fundamental differences between monitoring and matching problems as well as regular expressions and temporal logic formulas. Finally, we illustrate the practical application of timed pattern matching through two case studies, which show how to extract structured information from temporal datasets obtained via simulations or real-world observations. These results and examples show that timed pattern matching is a rigorous and efficient technique in developing and analyzing cyber-physical systems.
{"title":"Elements of Timed Pattern Matching","authors":"Dogan Ulus, Thomas Ferrère, Eugene Asarin, Dejan Nickovic, Oded Maler","doi":"10.1145/3645114","DOIUrl":"https://doi.org/10.1145/3645114","url":null,"abstract":"<p>The rise of machine learning and cloud technologies has led to a remarkable influx of data within modern cyber-physical systems. However, extracting meaningful information from this data has become a significant challenge due to its volume and complexity. Timed pattern matching has emerged as a powerful specification-based runtime verification and temporal data analysis technique to address this challenge. </p><p>In this paper, we provide a comprehensive tutorial on timed pattern matching that ranges from the underlying algebra and pattern specification languages to performance analyses and practical case studies. Analogous to textual pattern matching, timed pattern matching is the task of finding all time periods within temporal behaviors of cyber-physical systems that match a predefined pattern. Originally we introduced and solved several variants of the problem using the name of match sets, which has evolved into the concept of timed relations over the past decade. Here we first formalize and present the algebra of timed relations as a standalone mathematical tool to solve the pattern matching problem of timed pattern specifications. In particular, we show how to use the algebra of timed relations to solve the pattern matching problem for timed regular expressions and metric compass logic in a unified manner. We experimentally demonstrate that our timed pattern matching approach performs and scales well in practice. We further provide in-depth insights into the similarities and fundamental differences between monitoring and matching problems as well as regular expressions and temporal logic formulas. Finally, we illustrate the practical application of timed pattern matching through two case studies, which show how to extract structured information from temporal datasets obtained via simulations or real-world observations. These results and examples show that timed pattern matching is a rigorous and efficient technique in developing and analyzing cyber-physical systems.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"2 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139754108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}