Yigit Tuncel, Toygun Basaklar, Dina Carpenter-Graffy, Umit Ogras
Continuous monitoring of areas nearby the electric grid is critical for preventing and early detection of devastating wildfires. Existing wildfire monitoring systems are intermittent and oblivious to local ambient risk factors, resulting in poor wildfire awareness. Ambient sensor suites deployed near the gridlines can increase the monitoring granularity and detection accuracy. However, these sensors must address two challenging and competing objectives at the same time. First, they must remain powered for years without manual maintenance due to their remote locations. Second, they must provide and transmit reliable information if and when a wildfire starts. The first objective requires aggressive energy savings and ambient energy harvesting, while the second requires continuous operation of a range of sensors. To the best of our knowledge, this paper presents the first self-sustained cyber-physical system that dynamically co-optimizes the wildfire detection accuracy and active time of sensors. The proposed approach employs reinforcement learning to train a policy that controls the sensor operations as a function of the environment (i.e., current sensor readings), harvested energy, and battery level. The proposed cyber-physical system is evaluated extensively using real-life temperature, wind, and solar energy harvesting datasets and an open-source wildfire simulator. In long-term (5 years) evaluations, the proposed framework achieves 89% uptime, which is 46% higher than a carefully tuned heuristic approach. At the same time, it averages a 2-minute initial response time, which is at least 2.5× faster than the same heuristic approach. Furthermore, the policy network consumes 0.6 mJ per day on the TI CC2652R microcontroller using TensorFlow Lite for Micro, which is negligible compared to the daily sensor suite energy consumption.
对电网附近地区的持续监测对于预防和早期发现毁灭性的野火至关重要。现有的野火监测系统是间歇性的,对当地环境的风险因素一无所知,导致对野火的认识不足。部署在网格线附近的环境传感器套件可以提高监测粒度和检测精度。然而,这些传感器必须同时解决两个具有挑战性和竞争性的目标。首先,由于位置偏远,它们必须在没有人工维护的情况下保持供电多年。其次,当野火开始时,他们必须提供和传递可靠的信息。第一个目标需要积极的能源节约和环境能量收集,而第二个目标需要一系列传感器的连续运行。据我们所知,本文提出了第一个自我维持的网络物理系统,该系统可以动态地共同优化野火探测精度和传感器的活动时间。所提出的方法采用强化学习来训练一个策略,该策略将传感器操作作为环境(即当前传感器读数)、收集的能量和电池水平的函数来控制。所提出的网络物理系统使用现实生活中的温度、风和太阳能收集数据集和开源野火模拟器进行了广泛的评估。在长期(5年)评估中,建议的框架达到89%的正常运行时间,比精心调整的启发式方法高46%。同时,它的平均初始响应时间为2分钟,比相同的启发式方法至少快2.5倍。此外,使用TensorFlow Lite for Micro的TI CC2652R微控制器上的策略网络每天消耗0.6 mJ,与日常传感器套件能耗相比,这可以忽略不计。
{"title":"A Self-Sustained CPS Design for Reliable Wildfire Monitoring","authors":"Yigit Tuncel, Toygun Basaklar, Dina Carpenter-Graffy, Umit Ogras","doi":"10.1145/3608100","DOIUrl":"https://doi.org/10.1145/3608100","url":null,"abstract":"Continuous monitoring of areas nearby the electric grid is critical for preventing and early detection of devastating wildfires. Existing wildfire monitoring systems are intermittent and oblivious to local ambient risk factors, resulting in poor wildfire awareness. Ambient sensor suites deployed near the gridlines can increase the monitoring granularity and detection accuracy. However, these sensors must address two challenging and competing objectives at the same time. First, they must remain powered for years without manual maintenance due to their remote locations. Second, they must provide and transmit reliable information if and when a wildfire starts. The first objective requires aggressive energy savings and ambient energy harvesting, while the second requires continuous operation of a range of sensors. To the best of our knowledge, this paper presents the first self-sustained cyber-physical system that dynamically co-optimizes the wildfire detection accuracy and active time of sensors. The proposed approach employs reinforcement learning to train a policy that controls the sensor operations as a function of the environment (i.e., current sensor readings), harvested energy, and battery level. The proposed cyber-physical system is evaluated extensively using real-life temperature, wind, and solar energy harvesting datasets and an open-source wildfire simulator. In long-term (5 years) evaluations, the proposed framework achieves 89% uptime, which is 46% higher than a carefully tuned heuristic approach. At the same time, it averages a 2-minute initial response time, which is at least 2.5× faster than the same heuristic approach. Furthermore, the policy network consumes 0.6 mJ per day on the TI CC2652R microcontroller using TensorFlow Lite for Micro, which is negligible compared to the daily sensor suite energy consumption.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamid Mousavi, Mohammad Loni, Mina Alibeigi, Masoud Daneshtalab
The deployment of Deep Neural Networks (DNNs) on edge devices is hindered by the substantial gap between performance requirements and available computational power. While recent research has made significant strides in developing pruning methods to build a sparse network for reducing the computing overhead of DNNs, there remains considerable accuracy loss, especially at high pruning ratios. We find that the architectures designed for dense networks by differentiable architecture search methods are ineffective when pruning mechanisms are applied to them. The main reason is that the current methods do not support sparse architectures in their search space and use a search objective that is made for dense networks and does not focus on sparsity. This paper proposes a new method to search for sparsity-friendly neural architectures. It is done by adding two new sparse operations to the search space and modifying the search objective. We propose two novel parametric SparseConv and SparseLinear operations in order to expand the search space to include sparse operations. In particular, these operations make a flexible search space due to using sparse parametric versions of linear and convolution operations. The proposed search objective lets us train the architecture based on the sparsity of the search space operations. Quantitative analyses demonstrate that architectures found through DASS outperform those used in the state-of-the-art sparse networks on the CIFAR-10 and ImageNet datasets. In terms of performance and hardware effectiveness, DASS increases the accuracy of the sparse version of MobileNet-v2 from 73.44% to 81.35% (+7.91% improvement) with a 3.87× faster inference time.
{"title":"DASS: Differentiable Architecture Search for Sparse Neural Networks","authors":"Hamid Mousavi, Mohammad Loni, Mina Alibeigi, Masoud Daneshtalab","doi":"10.1145/3609385","DOIUrl":"https://doi.org/10.1145/3609385","url":null,"abstract":"The deployment of Deep Neural Networks (DNNs) on edge devices is hindered by the substantial gap between performance requirements and available computational power. While recent research has made significant strides in developing pruning methods to build a sparse network for reducing the computing overhead of DNNs, there remains considerable accuracy loss, especially at high pruning ratios. We find that the architectures designed for dense networks by differentiable architecture search methods are ineffective when pruning mechanisms are applied to them. The main reason is that the current methods do not support sparse architectures in their search space and use a search objective that is made for dense networks and does not focus on sparsity. This paper proposes a new method to search for sparsity-friendly neural architectures. It is done by adding two new sparse operations to the search space and modifying the search objective. We propose two novel parametric SparseConv and SparseLinear operations in order to expand the search space to include sparse operations. In particular, these operations make a flexible search space due to using sparse parametric versions of linear and convolution operations. The proposed search objective lets us train the architecture based on the sparsity of the search space operations. Quantitative analyses demonstrate that architectures found through DASS outperform those used in the state-of-the-art sparse networks on the CIFAR-10 and ImageNet datasets. In terms of performance and hardware effectiveness, DASS increases the accuracy of the sparse version of MobileNet-v2 from 73.44% to 81.35% (+7.91% improvement) with a 3.87× faster inference time.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An IDK classifier is a computing component that categorizes inputs into one of a number of classes, if it is able to do so with the required level of confidence, otherwise it returns “I Don’t Know” (IDK). IDK classifier cascades have been proposed as a way of balancing the needs for fast response and high accuracy in classification-based machine perception. Efficient algorithms for the synthesis of IDK classifier cascades have been derived; however, the responsiveness of these cascades is highly dependent on the accuracy of predictions regarding the run-time behavior of the classifiers from which they are built. Accurate predictions of such run-time behavior is difficult to obtain for many of the classifiers used for perception. By applying the algorithms using predictions framework, we propose efficient algorithms for the synthesis of IDK classifier cascades that are robust to inaccurate predictions in the following sense: the IDK classifier cascades synthesized by our algorithms have short expected execution durations when the predictions are accurate, and these expected durations increase only within specified bounds when the predictions are inaccurate.
{"title":"Optimal Synthesis of Robust IDK Classifier Cascades","authors":"Sanjoy Baruah, Alan Burns, Robert Ian Davis","doi":"10.1145/3609129","DOIUrl":"https://doi.org/10.1145/3609129","url":null,"abstract":"An IDK classifier is a computing component that categorizes inputs into one of a number of classes, if it is able to do so with the required level of confidence, otherwise it returns “I Don’t Know” (IDK). IDK classifier cascades have been proposed as a way of balancing the needs for fast response and high accuracy in classification-based machine perception. Efficient algorithms for the synthesis of IDK classifier cascades have been derived; however, the responsiveness of these cascades is highly dependent on the accuracy of predictions regarding the run-time behavior of the classifiers from which they are built. Accurate predictions of such run-time behavior is difficult to obtain for many of the classifiers used for perception. By applying the algorithms using predictions framework, we propose efficient algorithms for the synthesis of IDK classifier cascades that are robust to inaccurate predictions in the following sense: the IDK classifier cascades synthesized by our algorithms have short expected execution durations when the predictions are accurate, and these expected durations increase only within specified bounds when the predictions are inaccurate.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.
{"title":"IOSR: Improving I/O Efficiency for Memory Swapping on Mobile Devices Via Scheduling and Reshaping","authors":"Wentong Li, Liang Shi, Hang Li, Changlong Li, Edwin Hsing-Mean Sha","doi":"10.1145/3607923","DOIUrl":"https://doi.org/10.1145/3607923","url":null,"abstract":"Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human activity recognition (HAR) is an important component in a number of health applications, including rehabilitation, Parkinson’s disease, daily activity monitoring, and fitness monitoring. State-of-the-art HAR approaches use multiple sensors on the body to accurately identify activities at runtime. These approaches typically assume that data from all sensors are available for runtime activity recognition. However, data from one or more sensors may be unavailable due to malfunction, energy constraints, or communication challenges between the sensors. Missing data can lead to significant degradation in the accuracy, thus affecting quality of service to users. A common approach for handling missing data is to train classifiers or sensor data recovery algorithms for each combination of missing sensors. However, this results in significant memory and energy overhead on resource-constrained wearable devices. In strong contrast to prior approaches, this paper presents a clustering-based approach (CIM) to impute missing data at runtime. We first define a set of possible clusters and representative data patterns for each sensor in HAR. Then, we create and store a mapping between clusters across sensors. At runtime, when data from a sensor are missing, we utilize the stored mapping table to obtain most likely cluster for the missing sensor. The representative window for the identified cluster is then used as imputation to perform activity classification. We also provide a method to obtain imputation-aware activity prediction sets to handle uncertainty in data when using imputation. Experiments on three HAR datasets show that CIM achieves accuracy within 10% of a baseline without missing data for one missing sensor when providing single activity labels. The accuracy gap drops to less than 1% with imputation-aware classification. Measurements on a low-power processor show that CIM achieves close to 100% energy savings compared to state-of-the-art generative approaches.
{"title":"CIM: A Novel Clustering-based Energy-Efficient Data Imputation Method for Human Activity Recognition","authors":"Dina Hussein, Ganapati Bhat","doi":"10.1145/3609111","DOIUrl":"https://doi.org/10.1145/3609111","url":null,"abstract":"Human activity recognition (HAR) is an important component in a number of health applications, including rehabilitation, Parkinson’s disease, daily activity monitoring, and fitness monitoring. State-of-the-art HAR approaches use multiple sensors on the body to accurately identify activities at runtime. These approaches typically assume that data from all sensors are available for runtime activity recognition. However, data from one or more sensors may be unavailable due to malfunction, energy constraints, or communication challenges between the sensors. Missing data can lead to significant degradation in the accuracy, thus affecting quality of service to users. A common approach for handling missing data is to train classifiers or sensor data recovery algorithms for each combination of missing sensors. However, this results in significant memory and energy overhead on resource-constrained wearable devices. In strong contrast to prior approaches, this paper presents a clustering-based approach (CIM) to impute missing data at runtime. We first define a set of possible clusters and representative data patterns for each sensor in HAR. Then, we create and store a mapping between clusters across sensors. At runtime, when data from a sensor are missing, we utilize the stored mapping table to obtain most likely cluster for the missing sensor. The representative window for the identified cluster is then used as imputation to perform activity classification. We also provide a method to obtain imputation-aware activity prediction sets to handle uncertainty in data when using imputation. Experiments on three HAR datasets show that CIM achieves accuracy within 10% of a baseline without missing data for one missing sensor when providing single activity labels. The accuracy gap drops to less than 1% with imputation-aware classification. Measurements on a low-power processor show that CIM achieves close to 100% energy savings compared to state-of-the-art generative approaches.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque
Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both ( i ) a real hardware SoC platform (NVIDIA Xavier AGX) and ( ii ) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57 × latency speedup and is 3.38 × more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.
{"title":"MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment","authors":"Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque","doi":"10.1145/3609386","DOIUrl":"https://doi.org/10.1145/3609386","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both ( i ) a real hardware SoC platform (NVIDIA Xavier AGX) and ( ii ) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57 × latency speedup and is 3.38 × more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints. This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.
{"title":"Let Coarse-Grained Resources Be Shared: Mapping Entire Neural Networks on FPGAs","authors":"Tzung-Han Juang, Christof Schlaak, Christophe Dubach","doi":"10.1145/3609109","DOIUrl":"https://doi.org/10.1145/3609109","url":null,"abstract":"Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints. This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras
Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.
{"title":"DTRL: Decision Tree-based Multi-Objective Reinforcement Learning for Runtime Task Scheduling in Domain-Specific System-on-Chips","authors":"Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras","doi":"10.1145/3609108","DOIUrl":"https://doi.org/10.1145/3609108","url":null,"abstract":"Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) is a popular method for privacy-preserving machine learning on edge devices. However, the heterogeneity of edge devices, including differences in system architecture, data, and co-running applications, can significantly impact the energy efficiency of FL. To address these issues, we propose an energy-efficient personalized federated search framework. This framework has three key components. Firstly, we search for partial models with high inference efficiency to reduce training energy consumption and the occurrence of stragglers in each round. Secondly, we build lightweight search controllers that control the model sampling and respond to runtime variances, mitigating new straggler issues caused by co-running applications. Finally, we design an adaptive search update strategy based on graph aggregation to improve personalized training convergence. Our framework reduces the energy consumption of the training process by lowering the training overhead of each round and speeding up the training convergence rate. Experimental results show that our approach achieves up to 5.02% accuracy and 3.45× energy efficiency improvements.
{"title":"Energy-efficient Personalized Federated Search with Graph for Edge Computing","authors":"Zhao Yang, Qingshuang Sun","doi":"10.1145/3609435","DOIUrl":"https://doi.org/10.1145/3609435","url":null,"abstract":"Federated Learning (FL) is a popular method for privacy-preserving machine learning on edge devices. However, the heterogeneity of edge devices, including differences in system architecture, data, and co-running applications, can significantly impact the energy efficiency of FL. To address these issues, we propose an energy-efficient personalized federated search framework. This framework has three key components. Firstly, we search for partial models with high inference efficiency to reduce training energy consumption and the occurrence of stragglers in each round. Secondly, we build lightweight search controllers that control the model sampling and respond to runtime variances, mitigating new straggler issues caused by co-running applications. Finally, we design an adaptive search update strategy based on graph aggregation to improve personalized training convergence. Our framework reduces the energy consumption of the training process by lowering the training overhead of each round and speeding up the training convergence rate. Experimental results show that our approach achieves up to 5.02% accuracy and 3.45× energy efficiency improvements.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.
Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.
{"title":"WARM-tree: Making Quadtrees Write-efficient and Space-economic on Persistent Memories","authors":"Shin-Ting Wu, Liang-Chi Chen, Po-Chun Huang, Yuan-Hao Chang, Chien-Chung Ho, Wei-Kuan Shih","doi":"10.1145/3608033","DOIUrl":"https://doi.org/10.1145/3608033","url":null,"abstract":"Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}