ACM Transactions on Embedded Computing Systems最新文献_第9页

A Self-Sustained CPS Design for Reliable Wildfire Monitoring 可靠野火监测的自维持CPS设计

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3608100

Yigit Tuncel, Toygun Basaklar, Dina Carpenter-Graffy, Umit Ogras

Continuous monitoring of areas nearby the electric grid is critical for preventing and early detection of devastating wildfires. Existing wildfire monitoring systems are intermittent and oblivious to local ambient risk factors, resulting in poor wildfire awareness. Ambient sensor suites deployed near the gridlines can increase the monitoring granularity and detection accuracy. However, these sensors must address two challenging and competing objectives at the same time. First, they must remain powered for years without manual maintenance due to their remote locations. Second, they must provide and transmit reliable information if and when a wildfire starts. The first objective requires aggressive energy savings and ambient energy harvesting, while the second requires continuous operation of a range of sensors. To the best of our knowledge, this paper presents the first self-sustained cyber-physical system that dynamically co-optimizes the wildfire detection accuracy and active time of sensors. The proposed approach employs reinforcement learning to train a policy that controls the sensor operations as a function of the environment (i.e., current sensor readings), harvested energy, and battery level. The proposed cyber-physical system is evaluated extensively using real-life temperature, wind, and solar energy harvesting datasets and an open-source wildfire simulator. In long-term (5 years) evaluations, the proposed framework achieves 89% uptime, which is 46% higher than a carefully tuned heuristic approach. At the same time, it averages a 2-minute initial response time, which is at least 2.5× faster than the same heuristic approach. Furthermore, the policy network consumes 0.6 mJ per day on the TI CC2652R microcontroller using TensorFlow Lite for Micro, which is negligible compared to the daily sensor suite energy consumption.

对电网附近地区的持续监测对于预防和早期发现毁灭性的野火至关重要。现有的野火监测系统是间歇性的，对当地环境的风险因素一无所知，导致对野火的认识不足。部署在网格线附近的环境传感器套件可以提高监测粒度和检测精度。然而，这些传感器必须同时解决两个具有挑战性和竞争性的目标。首先，由于位置偏远，它们必须在没有人工维护的情况下保持供电多年。其次，当野火开始时，他们必须提供和传递可靠的信息。第一个目标需要积极的能源节约和环境能量收集，而第二个目标需要一系列传感器的连续运行。据我们所知，本文提出了第一个自我维持的网络物理系统，该系统可以动态地共同优化野火探测精度和传感器的活动时间。所提出的方法采用强化学习来训练一个策略，该策略将传感器操作作为环境(即当前传感器读数)、收集的能量和电池水平的函数来控制。所提出的网络物理系统使用现实生活中的温度、风和太阳能收集数据集和开源野火模拟器进行了广泛的评估。在长期(5年)评估中，建议的框架达到89%的正常运行时间，比精心调整的启发式方法高46%。同时，它的平均初始响应时间为2分钟，比相同的启发式方法至少快2.5倍。此外，使用TensorFlow Lite for Micro的TI CC2652R微控制器上的策略网络每天消耗0.6 mJ，与日常传感器套件能耗相比，这可以忽略不计。

{"title":"A Self-Sustained CPS Design for Reliable Wildfire Monitoring","authors":"Yigit Tuncel, Toygun Basaklar, Dina Carpenter-Graffy, Umit Ogras","doi":"10.1145/3608100","DOIUrl":"https://doi.org/10.1145/3608100","url":null,"abstract":"Continuous monitoring of areas nearby the electric grid is critical for preventing and early detection of devastating wildfires. Existing wildfire monitoring systems are intermittent and oblivious to local ambient risk factors, resulting in poor wildfire awareness. Ambient sensor suites deployed near the gridlines can increase the monitoring granularity and detection accuracy. However, these sensors must address two challenging and competing objectives at the same time. First, they must remain powered for years without manual maintenance due to their remote locations. Second, they must provide and transmit reliable information if and when a wildfire starts. The first objective requires aggressive energy savings and ambient energy harvesting, while the second requires continuous operation of a range of sensors. To the best of our knowledge, this paper presents the first self-sustained cyber-physical system that dynamically co-optimizes the wildfire detection accuracy and active time of sensors. The proposed approach employs reinforcement learning to train a policy that controls the sensor operations as a function of the environment (i.e., current sensor readings), harvested energy, and battery level. The proposed cyber-physical system is evaluated extensively using real-life temperature, wind, and solar energy harvesting datasets and an open-source wildfire simulator. In long-term (5 years) evaluations, the proposed framework achieves 89% uptime, which is 46% higher than a carefully tuned heuristic approach. At the same time, it averages a 2-minute initial response time, which is at least 2.5× faster than the same heuristic approach. Furthermore, the policy network consumes 0.6 mJ per day on the TI CC2652R microcontroller using TensorFlow Lite for Micro, which is negligible compared to the daily sensor suite energy consumption.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DASS: Differentiable Architecture Search for Sparse Neural Networks 稀疏神经网络的可微结构搜索

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609385

Hamid Mousavi, Mohammad Loni, Mina Alibeigi, Masoud Daneshtalab

The deployment of Deep Neural Networks (DNNs) on edge devices is hindered by the substantial gap between performance requirements and available computational power. While recent research has made significant strides in developing pruning methods to build a sparse network for reducing the computing overhead of DNNs, there remains considerable accuracy loss, especially at high pruning ratios. We find that the architectures designed for dense networks by differentiable architecture search methods are ineffective when pruning mechanisms are applied to them. The main reason is that the current methods do not support sparse architectures in their search space and use a search objective that is made for dense networks and does not focus on sparsity. This paper proposes a new method to search for sparsity-friendly neural architectures. It is done by adding two new sparse operations to the search space and modifying the search objective. We propose two novel parametric SparseConv and SparseLinear operations in order to expand the search space to include sparse operations. In particular, these operations make a flexible search space due to using sparse parametric versions of linear and convolution operations. The proposed search objective lets us train the architecture based on the sparsity of the search space operations. Quantitative analyses demonstrate that architectures found through DASS outperform those used in the state-of-the-art sparse networks on the CIFAR-10 and ImageNet datasets. In terms of performance and hardware effectiveness, DASS increases the accuracy of the sparse version of MobileNet-v2 from 73.44% to 81.35% (+7.91% improvement) with a 3.87× faster inference time.

深度神经网络(dnn)在边缘设备上的部署受到性能要求和可用计算能力之间巨大差距的阻碍。虽然最近的研究在开发修剪方法以构建稀疏网络以减少dnn的计算开销方面取得了重大进展，但仍然存在相当大的准确性损失，特别是在高修剪比率时。我们发现用可微体系结构搜索方法为密集网络设计的体系结构在应用剪枝机制时是无效的。主要原因是当前的方法在其搜索空间中不支持稀疏架构，并且使用为密集网络设计的搜索目标，而不关注稀疏性。本文提出了一种搜索稀疏友好神经结构的新方法。它通过在搜索空间中增加两个新的稀疏操作和修改搜索目标来实现。我们提出了两种新的参数化的SparseConv和SparseLinear操作，以扩展搜索空间以包含稀疏操作。特别是，由于使用线性和卷积操作的稀疏参数版本，这些操作形成了一个灵活的搜索空间。提出的搜索目标使我们能够基于搜索空间操作的稀疏性来训练体系结构。定量分析表明，通过DASS找到的架构优于CIFAR-10和ImageNet数据集上最先进的稀疏网络中使用的架构。在性能和硬件效率方面，DASS将MobileNet-v2的稀疏版本的准确率从73.44%提高到81.35%(提高7.91%)，推理时间提高了3.87倍。

{"title":"DASS: Differentiable Architecture Search for Sparse Neural Networks","authors":"Hamid Mousavi, Mohammad Loni, Mina Alibeigi, Masoud Daneshtalab","doi":"10.1145/3609385","DOIUrl":"https://doi.org/10.1145/3609385","url":null,"abstract":"The deployment of Deep Neural Networks (DNNs) on edge devices is hindered by the substantial gap between performance requirements and available computational power. While recent research has made significant strides in developing pruning methods to build a sparse network for reducing the computing overhead of DNNs, there remains considerable accuracy loss, especially at high pruning ratios. We find that the architectures designed for dense networks by differentiable architecture search methods are ineffective when pruning mechanisms are applied to them. The main reason is that the current methods do not support sparse architectures in their search space and use a search objective that is made for dense networks and does not focus on sparsity. This paper proposes a new method to search for sparsity-friendly neural architectures. It is done by adding two new sparse operations to the search space and modifying the search objective. We propose two novel parametric SparseConv and SparseLinear operations in order to expand the search space to include sparse operations. In particular, these operations make a flexible search space due to using sparse parametric versions of linear and convolution operations. The proposed search objective lets us train the architecture based on the sparsity of the search space operations. Quantitative analyses demonstrate that architectures found through DASS outperform those used in the state-of-the-art sparse networks on the CIFAR-10 and ImageNet datasets. In terms of performance and hardware effectiveness, DASS increases the accuracy of the sparse version of MobileNet-v2 from 73.44% to 81.35% (+7.91% improvement) with a 3.87× faster inference time.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Optimal Synthesis of Robust IDK Classifier Cascades 鲁棒IDK分类器级联的最优综合

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609129

Sanjoy Baruah, Alan Burns, Robert Ian Davis

An IDK classifier is a computing component that categorizes inputs into one of a number of classes, if it is able to do so with the required level of confidence, otherwise it returns “I Don’t Know” (IDK). IDK classifier cascades have been proposed as a way of balancing the needs for fast response and high accuracy in classification-based machine perception. Efficient algorithms for the synthesis of IDK classifier cascades have been derived; however, the responsiveness of these cascades is highly dependent on the accuracy of predictions regarding the run-time behavior of the classifiers from which they are built. Accurate predictions of such run-time behavior is difficult to obtain for many of the classifiers used for perception. By applying the algorithms using predictions framework, we propose efficient algorithms for the synthesis of IDK classifier cascades that are robust to inaccurate predictions in the following sense: the IDK classifier cascades synthesized by our algorithms have short expected execution durations when the predictions are accurate, and these expected durations increase only within specified bounds when the predictions are inaccurate.

IDK分类器是一个计算组件，如果它能够以所需的置信度将输入分类为多个类中的一个，否则它返回“我不知道”(IDK)。IDK分类器级联被提出作为一种平衡基于分类的机器感知对快速响应和高精度需求的方法。推导了IDK分类器级联综合的有效算法;然而，这些级联的响应性高度依赖于构建它们的分类器的运行时行为预测的准确性。对于用于感知的许多分类器来说，很难获得这种运行时行为的准确预测。通过应用使用预测框架的算法，我们提出了用于合成IDK分类器级联的有效算法，这些算法在以下意义上对不准确的预测具有鲁棒性:当预测准确时，由我们的算法合成的IDK分类器级联具有较短的预期执行持续时间，并且当预测不准确时，这些预期持续时间仅在指定的范围内增加。

{"title":"Optimal Synthesis of Robust IDK Classifier Cascades","authors":"Sanjoy Baruah, Alan Burns, Robert Ian Davis","doi":"10.1145/3609129","DOIUrl":"https://doi.org/10.1145/3609129","url":null,"abstract":"An IDK classifier is a computing component that categorizes inputs into one of a number of classes, if it is able to do so with the required level of confidence, otherwise it returns “I Don’t Know” (IDK). IDK classifier cascades have been proposed as a way of balancing the needs for fast response and high accuracy in classification-based machine perception. Efficient algorithms for the synthesis of IDK classifier cascades have been derived; however, the responsiveness of these cascades is highly dependent on the accuracy of predictions regarding the run-time behavior of the classifiers from which they are built. Accurate predictions of such run-time behavior is difficult to obtain for many of the classifiers used for perception. By applying the algorithms using predictions framework, we propose efficient algorithms for the synthesis of IDK classifier cascades that are robust to inaccurate predictions in the following sense: the IDK classifier cascades synthesized by our algorithms have short expected execution durations when the predictions are accurate, and these expected durations increase only within specified bounds when the predictions are inaccurate.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IOSR: Improving I/O Efficiency for Memory Swapping on Mobile Devices Via Scheduling and Reshaping IOSR:通过调度和重塑来提高移动设备上内存交换的I/O效率

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607923

Wentong Li, Liang Shi, Hang Li, Changlong Li, Edwin Hsing-Mean Sha

Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.

移动系统和应用程序的功能越来越丰富和强大，它们不断受到内存压力的影响，特别是对于配备有限DRAM的设备。将非活动DRAM页交换到存储设备是一种很有前途的扩展物理内存的解决方案。然而，现有的移动设备通常采用闪存作为存储设备，将DRAM页面交换到闪存可能会带来很大的性能开销。本文首先深入分析了基于闪存交换的I/O特性，包括交换子系统中的I/O干扰和交换I/O随机性。为了提高移动设备上基于闪存的内存交换性能，提出了一个内存交换I/O效率优化框架(IOSR)。IOSR包括两种方法:交换I/O调度(SIOS)和交换I/O模式重塑(SIOR)。SIOS设计用于调度交换I/O，以减少与其他进程I/O的干扰。SIOR旨在通过面向进程的交换槽分配和自适应粒度交换预读来重塑交换I/O模式。IOSR在Google Pixel 4上实现。实验结果表明，与现有技术相比，IOSR平均减少了31.7%的应用切换时间，提高了35.5%的换入带宽。

{"title":"IOSR: Improving I/O Efficiency for Memory Swapping on Mobile Devices Via Scheduling and Reshaping","authors":"Wentong Li, Liang Shi, Hang Li, Changlong Li, Edwin Hsing-Mean Sha","doi":"10.1145/3607923","DOIUrl":"https://doi.org/10.1145/3607923","url":null,"abstract":"Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CIM: A Novel Clustering-based Energy-Efficient Data Imputation Method for Human Activity Recognition 基于聚类的人类活动识别节能数据输入方法

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609111

Dina Hussein, Ganapati Bhat

Human activity recognition (HAR) is an important component in a number of health applications, including rehabilitation, Parkinson’s disease, daily activity monitoring, and fitness monitoring. State-of-the-art HAR approaches use multiple sensors on the body to accurately identify activities at runtime. These approaches typically assume that data from all sensors are available for runtime activity recognition. However, data from one or more sensors may be unavailable due to malfunction, energy constraints, or communication challenges between the sensors. Missing data can lead to significant degradation in the accuracy, thus affecting quality of service to users. A common approach for handling missing data is to train classifiers or sensor data recovery algorithms for each combination of missing sensors. However, this results in significant memory and energy overhead on resource-constrained wearable devices. In strong contrast to prior approaches, this paper presents a clustering-based approach (CIM) to impute missing data at runtime. We first define a set of possible clusters and representative data patterns for each sensor in HAR. Then, we create and store a mapping between clusters across sensors. At runtime, when data from a sensor are missing, we utilize the stored mapping table to obtain most likely cluster for the missing sensor. The representative window for the identified cluster is then used as imputation to perform activity classification. We also provide a method to obtain imputation-aware activity prediction sets to handle uncertainty in data when using imputation. Experiments on three HAR datasets show that CIM achieves accuracy within 10% of a baseline without missing data for one missing sensor when providing single activity labels. The accuracy gap drops to less than 1% with imputation-aware classification. Measurements on a low-power processor show that CIM achieves close to 100% energy savings compared to state-of-the-art generative approaches.

人体活动识别(HAR)是许多健康应用的重要组成部分，包括康复、帕金森病、日常活动监测和健身监测。最先进的HAR方法使用身体上的多个传感器来准确识别运行时的活动。这些方法通常假设来自所有传感器的数据可用于运行时活动识别。然而，由于故障、能量限制或传感器之间的通信问题，来自一个或多个传感器的数据可能不可用。数据缺失会导致准确性的显著降低，从而影响对用户的服务质量。处理丢失数据的常用方法是为丢失的每个传感器组合训练分类器或传感器数据恢复算法。然而，这在资源有限的可穿戴设备上导致了显著的内存和能量开销。与之前的方法形成鲜明对比的是，本文提出了一种基于聚类的方法(CIM)来在运行时输入缺失数据。我们首先为HAR中的每个传感器定义了一组可能的集群和代表性数据模式。然后，我们创建并存储跨传感器集群之间的映射。在运行时，当来自传感器的数据丢失时，我们利用存储的映射表来获得丢失传感器的最可能聚类。然后将识别的集群的代表性窗口用作输入来执行活动分类。我们还提供了一种方法，以获得估算感知的活动预测集，以处理数据在使用估算时的不确定性。在三个HAR数据集上的实验表明，当提供单个活动标签时，CIM在不丢失数据的情况下实现了10%的基线精度。使用假设感知分类，准确率差距降至1%以下。对低功耗处理器的测量表明，与最先进的生成方法相比，CIM实现了接近100%的节能。

{"title":"CIM: A Novel Clustering-based Energy-Efficient Data Imputation Method for Human Activity Recognition","authors":"Dina Hussein, Ganapati Bhat","doi":"10.1145/3609111","DOIUrl":"https://doi.org/10.1145/3609111","url":null,"abstract":"Human activity recognition (HAR) is an important component in a number of health applications, including rehabilitation, Parkinson’s disease, daily activity monitoring, and fitness monitoring. State-of-the-art HAR approaches use multiple sensors on the body to accurately identify activities at runtime. These approaches typically assume that data from all sensors are available for runtime activity recognition. However, data from one or more sensors may be unavailable due to malfunction, energy constraints, or communication challenges between the sensors. Missing data can lead to significant degradation in the accuracy, thus affecting quality of service to users. A common approach for handling missing data is to train classifiers or sensor data recovery algorithms for each combination of missing sensors. However, this results in significant memory and energy overhead on resource-constrained wearable devices. In strong contrast to prior approaches, this paper presents a clustering-based approach (CIM) to impute missing data at runtime. We first define a set of possible clusters and representative data patterns for each sensor in HAR. Then, we create and store a mapping between clusters across sensors. At runtime, when data from a sensor are missing, we utilize the stored mapping table to obtain most likely cluster for the missing sensor. The representative window for the identified cluster is then used as imputation to perform activity classification. We also provide a method to obtain imputation-aware activity prediction sets to handle uncertainty in data when using imputation. Experiments on three HAR datasets show that CIM achieves accuracy within 10% of a baseline without missing data for one missing sensor when providing single activity labels. The accuracy gap drops to less than 1% with imputation-aware classification. Measurements on a low-power processor show that CIM achieves close to 100% energy savings compared to state-of-the-art generative approaches.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment 面向异构MPSoC部署的映射感知图神经结构搜索框架

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609386

Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque

Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both ( i ) a real hardware SoC platform (NVIDIA Xavier AGX) and ( ii ) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57 × latency speedup and is 3.38 × more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.

图神经网络(gnn)在基于视觉的应用中越来越受欢迎，因为它们具有对图像框架各部分之间的结构和上下文关系进行建模的内在能力。另一方面，基于深度视觉的边缘应用的日益普及，得益于异构多处理器片上系统(mpsoc)的最新进展，这些应用能够在实时、严格的执行要求下进行推理。通过扩展，用于基于视觉的应用程序的gnn必须遵守相同的执行要求。然而，与典型的深度神经网络相反，图学习操作的不规则流给在这种异构MPSoC平台上运行gnn带来了挑战。在本文中，我们提出了一种新的统一设计映射方法，用于在异构MPSoC平台上高效处理视觉GNN工作负载。特别地，我们开发了MaGNAS，一个映射感知的图神经结构搜索框架。MaGNAS提出了一个GNN架构设计空间，结合异构SoC上的前瞻性映射选项，以确定最大限度提高设备上资源效率的模型架构。为了实现这一目标，MaGNAS采用两层进化搜索来识别最佳gnn和映射配对，从而产生最佳的性能权衡。通过设计一个源自最新视觉GNN (ViG)架构的超级网络，我们使用(i)真正的硬件SoC平台(NVIDIA Xavier AGX)和(ii) DNN加速器的性能/成本模型模拟器在四(04)个最先进的视觉数据集上进行了实验。我们的实验结果表明，对于在Xavier MPSoC上执行的多个视觉数据集，与仅使用gpu的部署相比，MaGNAS能够提供1.57倍的延迟加速和3.38倍的能效，同时保持比基线平均0.11%的精度降低。

{"title":"MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment","authors":"Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque","doi":"10.1145/3609386","DOIUrl":"https://doi.org/10.1145/3609386","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both ( i ) a real hardware SoC platform (NVIDIA Xavier AGX) and ( ii ) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57 × latency speedup and is 3.38 × more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Let Coarse-Grained Resources Be Shared: Mapping Entire Neural Networks on FPGAs 让粗粒度资源共享:在fpga上映射整个神经网络

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609109

Tzung-Han Juang, Christof Schlaak, Christophe Dubach

Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints. This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.

传统的高级综合(HLS)提供了硬件加速器的快速原型设计，而无需使用硬件描述语言(hdl)进行编码。然而，这种方法不能很好地支持在单个现场可编程门阵列(FPGA)器件上分配像整个深度神经网络这样的大型应用。由于资源限制，这种方法导致设计效率低下或不适合fpga。本文提出了基于功能中间表示(IRs)中功能共享的粗粒度资源控制来缩减生成的设计。所提出的编译器通过和重写系统的目的是产生有效的设计点和去除冗余的硬件。这种优化使得在fpga上拟合整个神经网络变得可行，并且与在每层运行专门的内核相比，产生了具有竞争力的性能。

引用次数: 0

DTRL: Decision Tree-based Multi-Objective Reinforcement Learning for Runtime Task Scheduling in Domain-Specific System-on-Chips 基于决策树的多目标强化学习在特定领域片上系统中的运行时任务调度

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609108

Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras

Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.

特定领域的片上系统(dssoc)结合了通用处理器和专用硬件加速器，以提高特定领域的性能和能源效率。以最小的运行时开销将任务优化分配给处理元素(pe)对于实现这一潜力至关重要。然而，这个问题仍然具有挑战性，因为先前的方法受到非最佳调度决策或显著运行时开销的影响。此外，现有的技术只关注于单一的优化目标，例如最大化性能。这项工作提出了DTRL，一种基于决策树的多目标强化学习技术，用于dssoc的运行时任务调度。DTRL训练一个全局可微决策树(DDT)策略，该策略覆盖由偏好向量量化的整个目标空间。我们使用我们的新型强化学习环境进行了广泛的实验评估，证明DTRL捕获了执行时间和功耗之间的权衡，从而使用单个策略生成了Pareto解决方案集。此外，与最先进的启发式、优化和基于机器学习的调度器进行比较表明，DTRL实现了高达9倍的性能提升和高达3.08倍的能耗降低。训练后的DDT策略在Xilinx Zynq ZCU102 FPGA上实现了1.2 GHz的120 ns推理延迟，导致运行时开销可以忽略不计。对相同硬件的评估表明，DTRL的性能比最先进的启发式调度器高出16%。

{"title":"DTRL: Decision Tree-based Multi-Objective Reinforcement Learning for Runtime Task Scheduling in Domain-Specific System-on-Chips","authors":"Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras","doi":"10.1145/3609108","DOIUrl":"https://doi.org/10.1145/3609108","url":null,"abstract":"Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Energy-efficient Personalized Federated Search with Graph for Edge Computing 基于图的边缘计算节能个性化联邦搜索

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609435

Zhao Yang, Qingshuang Sun

Federated Learning (FL) is a popular method for privacy-preserving machine learning on edge devices. However, the heterogeneity of edge devices, including differences in system architecture, data, and co-running applications, can significantly impact the energy efficiency of FL. To address these issues, we propose an energy-efficient personalized federated search framework. This framework has three key components. Firstly, we search for partial models with high inference efficiency to reduce training energy consumption and the occurrence of stragglers in each round. Secondly, we build lightweight search controllers that control the model sampling and respond to runtime variances, mitigating new straggler issues caused by co-running applications. Finally, we design an adaptive search update strategy based on graph aggregation to improve personalized training convergence. Our framework reduces the energy consumption of the training process by lowering the training overhead of each round and speeding up the training convergence rate. Experimental results show that our approach achieves up to 5.02% accuracy and 3.45× energy efficiency improvements.

联邦学习(FL)是一种在边缘设备上进行隐私保护机器学习的流行方法。然而，边缘设备的异构性，包括系统架构、数据和协同运行应用程序的差异，会显著影响FL的能源效率。为了解决这些问题，我们提出了一个节能的个性化联邦搜索框架。这个框架有三个关键组成部分。首先，我们寻找推理效率高的部分模型，以减少训练能量消耗和每轮中掉队者的出现。其次，我们构建了轻量级的搜索控制器来控制模型采样和响应运行时的变化，减轻了由共同运行的应用程序引起的新的掉队问题。最后，设计了一种基于图聚合的自适应搜索更新策略，以提高个性化训练的收敛性。我们的框架通过降低每轮的训练开销和加快训练收敛速度来减少训练过程的能量消耗。实验结果表明，该方法的准确率高达5.02%，能效提高3.45倍。

引用次数: 0

WARM-tree: Making Quadtrees Write-efficient and Space-economic on Persistent Memories 使四叉树在持久内存上具有写效率和空间经济性

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3608033

Shin-Ting Wu, Liang-Chi Chen, Po-Chun Huang, Yuan-Hao Chang, Chien-Chung Ho, Wei-Kuan Shih

Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.

{"title":"WARM-tree: Making Quadtrees Write-efficient and Space-economic on Persistent Memories","authors":"Shin-Ting Wu, Liang-Chi Chen, Po-Chun Huang, Yuan-Hao Chang, Chien-Chung Ho, Wei-Kuan Shih","doi":"10.1145/3608033","DOIUrl":"https://doi.org/10.1145/3608033","url":null,"abstract":"Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0