首页 > 最新文献

ACM Transactions on Embedded Computing Systems最新文献

英文 中文
A Self-Sustained CPS Design for Reliable Wildfire Monitoring 可靠野火监测的自维持CPS设计
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3608100
Yigit Tuncel, Toygun Basaklar, Dina Carpenter-Graffy, Umit Ogras
Continuous monitoring of areas nearby the electric grid is critical for preventing and early detection of devastating wildfires. Existing wildfire monitoring systems are intermittent and oblivious to local ambient risk factors, resulting in poor wildfire awareness. Ambient sensor suites deployed near the gridlines can increase the monitoring granularity and detection accuracy. However, these sensors must address two challenging and competing objectives at the same time. First, they must remain powered for years without manual maintenance due to their remote locations. Second, they must provide and transmit reliable information if and when a wildfire starts. The first objective requires aggressive energy savings and ambient energy harvesting, while the second requires continuous operation of a range of sensors. To the best of our knowledge, this paper presents the first self-sustained cyber-physical system that dynamically co-optimizes the wildfire detection accuracy and active time of sensors. The proposed approach employs reinforcement learning to train a policy that controls the sensor operations as a function of the environment (i.e., current sensor readings), harvested energy, and battery level. The proposed cyber-physical system is evaluated extensively using real-life temperature, wind, and solar energy harvesting datasets and an open-source wildfire simulator. In long-term (5 years) evaluations, the proposed framework achieves 89% uptime, which is 46% higher than a carefully tuned heuristic approach. At the same time, it averages a 2-minute initial response time, which is at least 2.5× faster than the same heuristic approach. Furthermore, the policy network consumes 0.6 mJ per day on the TI CC2652R microcontroller using TensorFlow Lite for Micro, which is negligible compared to the daily sensor suite energy consumption.
对电网附近地区的持续监测对于预防和早期发现毁灭性的野火至关重要。现有的野火监测系统是间歇性的,对当地环境的风险因素一无所知,导致对野火的认识不足。部署在网格线附近的环境传感器套件可以提高监测粒度和检测精度。然而,这些传感器必须同时解决两个具有挑战性和竞争性的目标。首先,由于位置偏远,它们必须在没有人工维护的情况下保持供电多年。其次,当野火开始时,他们必须提供和传递可靠的信息。第一个目标需要积极的能源节约和环境能量收集,而第二个目标需要一系列传感器的连续运行。据我们所知,本文提出了第一个自我维持的网络物理系统,该系统可以动态地共同优化野火探测精度和传感器的活动时间。所提出的方法采用强化学习来训练一个策略,该策略将传感器操作作为环境(即当前传感器读数)、收集的能量和电池水平的函数来控制。所提出的网络物理系统使用现实生活中的温度、风和太阳能收集数据集和开源野火模拟器进行了广泛的评估。在长期(5年)评估中,建议的框架达到89%的正常运行时间,比精心调整的启发式方法高46%。同时,它的平均初始响应时间为2分钟,比相同的启发式方法至少快2.5倍。此外,使用TensorFlow Lite for Micro的TI CC2652R微控制器上的策略网络每天消耗0.6 mJ,与日常传感器套件能耗相比,这可以忽略不计。
{"title":"A Self-Sustained CPS Design for Reliable Wildfire Monitoring","authors":"Yigit Tuncel, Toygun Basaklar, Dina Carpenter-Graffy, Umit Ogras","doi":"10.1145/3608100","DOIUrl":"https://doi.org/10.1145/3608100","url":null,"abstract":"Continuous monitoring of areas nearby the electric grid is critical for preventing and early detection of devastating wildfires. Existing wildfire monitoring systems are intermittent and oblivious to local ambient risk factors, resulting in poor wildfire awareness. Ambient sensor suites deployed near the gridlines can increase the monitoring granularity and detection accuracy. However, these sensors must address two challenging and competing objectives at the same time. First, they must remain powered for years without manual maintenance due to their remote locations. Second, they must provide and transmit reliable information if and when a wildfire starts. The first objective requires aggressive energy savings and ambient energy harvesting, while the second requires continuous operation of a range of sensors. To the best of our knowledge, this paper presents the first self-sustained cyber-physical system that dynamically co-optimizes the wildfire detection accuracy and active time of sensors. The proposed approach employs reinforcement learning to train a policy that controls the sensor operations as a function of the environment (i.e., current sensor readings), harvested energy, and battery level. The proposed cyber-physical system is evaluated extensively using real-life temperature, wind, and solar energy harvesting datasets and an open-source wildfire simulator. In long-term (5 years) evaluations, the proposed framework achieves 89% uptime, which is 46% higher than a carefully tuned heuristic approach. At the same time, it averages a 2-minute initial response time, which is at least 2.5× faster than the same heuristic approach. Furthermore, the policy network consumes 0.6 mJ per day on the TI CC2652R microcontroller using TensorFlow Lite for Micro, which is negligible compared to the daily sensor suite energy consumption.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DASS: Differentiable Architecture Search for Sparse Neural Networks 稀疏神经网络的可微结构搜索
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609385
Hamid Mousavi, Mohammad Loni, Mina Alibeigi, Masoud Daneshtalab
The deployment of Deep Neural Networks (DNNs) on edge devices is hindered by the substantial gap between performance requirements and available computational power. While recent research has made significant strides in developing pruning methods to build a sparse network for reducing the computing overhead of DNNs, there remains considerable accuracy loss, especially at high pruning ratios. We find that the architectures designed for dense networks by differentiable architecture search methods are ineffective when pruning mechanisms are applied to them. The main reason is that the current methods do not support sparse architectures in their search space and use a search objective that is made for dense networks and does not focus on sparsity. This paper proposes a new method to search for sparsity-friendly neural architectures. It is done by adding two new sparse operations to the search space and modifying the search objective. We propose two novel parametric SparseConv and SparseLinear operations in order to expand the search space to include sparse operations. In particular, these operations make a flexible search space due to using sparse parametric versions of linear and convolution operations. The proposed search objective lets us train the architecture based on the sparsity of the search space operations. Quantitative analyses demonstrate that architectures found through DASS outperform those used in the state-of-the-art sparse networks on the CIFAR-10 and ImageNet datasets. In terms of performance and hardware effectiveness, DASS increases the accuracy of the sparse version of MobileNet-v2 from 73.44% to 81.35% (+7.91% improvement) with a 3.87× faster inference time.
深度神经网络(dnn)在边缘设备上的部署受到性能要求和可用计算能力之间巨大差距的阻碍。虽然最近的研究在开发修剪方法以构建稀疏网络以减少dnn的计算开销方面取得了重大进展,但仍然存在相当大的准确性损失,特别是在高修剪比率时。我们发现用可微体系结构搜索方法为密集网络设计的体系结构在应用剪枝机制时是无效的。主要原因是当前的方法在其搜索空间中不支持稀疏架构,并且使用为密集网络设计的搜索目标,而不关注稀疏性。本文提出了一种搜索稀疏友好神经结构的新方法。它通过在搜索空间中增加两个新的稀疏操作和修改搜索目标来实现。我们提出了两种新的参数化的SparseConv和SparseLinear操作,以扩展搜索空间以包含稀疏操作。特别是,由于使用线性和卷积操作的稀疏参数版本,这些操作形成了一个灵活的搜索空间。提出的搜索目标使我们能够基于搜索空间操作的稀疏性来训练体系结构。定量分析表明,通过DASS找到的架构优于CIFAR-10和ImageNet数据集上最先进的稀疏网络中使用的架构。在性能和硬件效率方面,DASS将MobileNet-v2的稀疏版本的准确率从73.44%提高到81.35%(提高7.91%),推理时间提高了3.87倍。
{"title":"DASS: Differentiable Architecture Search for Sparse Neural Networks","authors":"Hamid Mousavi, Mohammad Loni, Mina Alibeigi, Masoud Daneshtalab","doi":"10.1145/3609385","DOIUrl":"https://doi.org/10.1145/3609385","url":null,"abstract":"The deployment of Deep Neural Networks (DNNs) on edge devices is hindered by the substantial gap between performance requirements and available computational power. While recent research has made significant strides in developing pruning methods to build a sparse network for reducing the computing overhead of DNNs, there remains considerable accuracy loss, especially at high pruning ratios. We find that the architectures designed for dense networks by differentiable architecture search methods are ineffective when pruning mechanisms are applied to them. The main reason is that the current methods do not support sparse architectures in their search space and use a search objective that is made for dense networks and does not focus on sparsity. This paper proposes a new method to search for sparsity-friendly neural architectures. It is done by adding two new sparse operations to the search space and modifying the search objective. We propose two novel parametric SparseConv and SparseLinear operations in order to expand the search space to include sparse operations. In particular, these operations make a flexible search space due to using sparse parametric versions of linear and convolution operations. The proposed search objective lets us train the architecture based on the sparsity of the search space operations. Quantitative analyses demonstrate that architectures found through DASS outperform those used in the state-of-the-art sparse networks on the CIFAR-10 and ImageNet datasets. In terms of performance and hardware effectiveness, DASS increases the accuracy of the sparse version of MobileNet-v2 from 73.44% to 81.35% (+7.91% improvement) with a 3.87× faster inference time.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimal Synthesis of Robust IDK Classifier Cascades 鲁棒IDK分类器级联的最优综合
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609129
Sanjoy Baruah, Alan Burns, Robert Ian Davis
An IDK classifier is a computing component that categorizes inputs into one of a number of classes, if it is able to do so with the required level of confidence, otherwise it returns “I Don’t Know” (IDK). IDK classifier cascades have been proposed as a way of balancing the needs for fast response and high accuracy in classification-based machine perception. Efficient algorithms for the synthesis of IDK classifier cascades have been derived; however, the responsiveness of these cascades is highly dependent on the accuracy of predictions regarding the run-time behavior of the classifiers from which they are built. Accurate predictions of such run-time behavior is difficult to obtain for many of the classifiers used for perception. By applying the algorithms using predictions framework, we propose efficient algorithms for the synthesis of IDK classifier cascades that are robust to inaccurate predictions in the following sense: the IDK classifier cascades synthesized by our algorithms have short expected execution durations when the predictions are accurate, and these expected durations increase only within specified bounds when the predictions are inaccurate.
IDK分类器是一个计算组件,如果它能够以所需的置信度将输入分类为多个类中的一个,否则它返回“我不知道”(IDK)。IDK分类器级联被提出作为一种平衡基于分类的机器感知对快速响应和高精度需求的方法。推导了IDK分类器级联综合的有效算法;然而,这些级联的响应性高度依赖于构建它们的分类器的运行时行为预测的准确性。对于用于感知的许多分类器来说,很难获得这种运行时行为的准确预测。通过应用使用预测框架的算法,我们提出了用于合成IDK分类器级联的有效算法,这些算法在以下意义上对不准确的预测具有鲁棒性:当预测准确时,由我们的算法合成的IDK分类器级联具有较短的预期执行持续时间,并且当预测不准确时,这些预期持续时间仅在指定的范围内增加。
{"title":"Optimal Synthesis of Robust IDK Classifier Cascades","authors":"Sanjoy Baruah, Alan Burns, Robert Ian Davis","doi":"10.1145/3609129","DOIUrl":"https://doi.org/10.1145/3609129","url":null,"abstract":"An IDK classifier is a computing component that categorizes inputs into one of a number of classes, if it is able to do so with the required level of confidence, otherwise it returns “I Don’t Know” (IDK). IDK classifier cascades have been proposed as a way of balancing the needs for fast response and high accuracy in classification-based machine perception. Efficient algorithms for the synthesis of IDK classifier cascades have been derived; however, the responsiveness of these cascades is highly dependent on the accuracy of predictions regarding the run-time behavior of the classifiers from which they are built. Accurate predictions of such run-time behavior is difficult to obtain for many of the classifiers used for perception. By applying the algorithms using predictions framework, we propose efficient algorithms for the synthesis of IDK classifier cascades that are robust to inaccurate predictions in the following sense: the IDK classifier cascades synthesized by our algorithms have short expected execution durations when the predictions are accurate, and these expected durations increase only within specified bounds when the predictions are inaccurate.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keep in Balance: Runtime-reconfigurable Intermittent Deep Inference 保持平衡:运行时可重构的间歇深度推理
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3607918
Chih-Hsuan Yen, Hashan Roshantha Mendis, Tei-Wei Kuo, Pi-Cheng Hsiu
Intermittent deep neural network (DNN) inference is a promising technique to enable intelligent applications on tiny devices powered by ambient energy sources. Nonetheless, intermittent execution presents inherent challenges, primarily involving accumulating progress across power cycles and having to refetch volatile data lost due to power loss in each power cycle. Existing approaches typically optimize the inference configuration to maximize data reuse. However, we observe that such a fixed configuration may be significantly inefficient due to the fluctuating balance point between data reuse and data refetch caused by the dynamic nature of ambient energy. This work proposes DynBal , an approach to dynamically reconfigure the inference engine at runtime. DynBal is realized as a middleware plugin that improves inference performance by exploring the interplay between data reuse and data refetch to maintain their balance with respect to the changing level of intermittency. An indirect metric is developed to easily evaluate an inference configuration considering the variability in intermittency, and a lightweight reconfiguration algorithm is employed to efficiently optimize the configuration at runtime. We evaluate the improvement brought by integrating DynBal into a recent intermittent inference approach that uses a fixed configuration. Evaluations were conducted on a Texas Instruments device with various network models and under varied intermittent power strengths. Our experimental results demonstrate that DynBal can speed up intermittent inference by 3.26 times, achieving a greater improvement for a large network under high intermittency and a large gap between memory and computation performance.
间歇性深度神经网络(DNN)推理是一种很有前途的技术,可以在由环境能源驱动的微型设备上实现智能应用。尽管如此,间歇性执行带来了固有的挑战,主要涉及跨电源周期累积进度,并且必须重新获取由于每个电源周期中的电源丢失而丢失的易失性数据。现有的方法通常会优化推理配置以最大化数据重用。然而,我们观察到,由于环境能量的动态性导致数据重用和数据重取之间的平衡点波动,这种固定配置可能会显着降低效率。这项工作提出了DynBal,一种在运行时动态重新配置推理引擎的方法。DynBal是作为一个中间件插件实现的,它通过探索数据重用和数据重取之间的相互作用来提高推理性能,以保持它们在间歇性变化水平方面的平衡。考虑间歇性的可变性,提出了一种间接度量来方便地评估推理配置,并采用了一种轻量级的重构算法来有效地优化运行时的配置。我们评估了将DynBal集成到最近使用固定配置的间歇性推理方法中所带来的改进。在不同网络模型和不同间歇功率强度下,对德州仪器设备进行了评估。我们的实验结果表明,DynBal可以将间歇性推理的速度提高3.26倍,对于高间歇性和内存与计算性能之间较大差距的大型网络实现了更大的改进。
{"title":"Keep in Balance: Runtime-reconfigurable Intermittent Deep Inference","authors":"Chih-Hsuan Yen, Hashan Roshantha Mendis, Tei-Wei Kuo, Pi-Cheng Hsiu","doi":"10.1145/3607918","DOIUrl":"https://doi.org/10.1145/3607918","url":null,"abstract":"Intermittent deep neural network (DNN) inference is a promising technique to enable intelligent applications on tiny devices powered by ambient energy sources. Nonetheless, intermittent execution presents inherent challenges, primarily involving accumulating progress across power cycles and having to refetch volatile data lost due to power loss in each power cycle. Existing approaches typically optimize the inference configuration to maximize data reuse. However, we observe that such a fixed configuration may be significantly inefficient due to the fluctuating balance point between data reuse and data refetch caused by the dynamic nature of ambient energy. This work proposes DynBal , an approach to dynamically reconfigure the inference engine at runtime. DynBal is realized as a middleware plugin that improves inference performance by exploring the interplay between data reuse and data refetch to maintain their balance with respect to the changing level of intermittency. An indirect metric is developed to easily evaluate an inference configuration considering the variability in intermittency, and a lightweight reconfiguration algorithm is employed to efficiently optimize the configuration at runtime. We evaluate the improvement brought by integrating DynBal into a recent intermittent inference approach that uses a fixed configuration. Evaluations were conducted on a Texas Instruments device with various network models and under varied intermittent power strengths. Our experimental results demonstrate that DynBal can speed up intermittent inference by 3.26 times, achieving a greater improvement for a large network under high intermittency and a large gap between memory and computation performance.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stochastic Analysis of Control Systems Subject to Communication and Computation Faults 通信和计算故障下控制系统的随机分析
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609123
Nils Vreman, Martina Maggio
Control theory allows one to design controllers that are robust to external disturbances, model simplification, and modelling inaccuracy. Researchers have investigated whether the robustness carries on to the controller’s digital implementation, mostly looking at how the controller reacts to either communication or computational problems. Communication problems are typically modelled using random variables (i.e., estimating the probability that a fault will occur during a transmission), while computational problems are modelled using deterministic guarantees on the number of deadlines that the control task has to meet. These fault models allow the engineer to both design robust controllers and assess the controllers’ behaviour in the presence of isolated faults. Despite being very relevant for the real-world implementations of control system, the question of what happens when these faults occur simultaneously does not yet have a proper answer. In this paper, we answer this question in the stochastic setting, using the theory of Markov Jump Linear Systems to provide stability contracts with almost sure guarantees of convergence. For linear time-invariant Markov jump linear systems, mean square stability implies almost sure convergence – a property that is central to our investigation. Our research primarily emphasises the validation of this property for closed-loop systems that are subject to packet losses and computational overruns, potentially occurring simultaneously. We apply our method to two case studies from the recent literature and show their robustness to a comprehensive set of faults. We employ closed-loop system simulations to empirically derive performance metrics that elucidate the quality of the controller implementation, such as the system settling time and the integral absolute error.
控制理论允许人们设计对外部干扰、模型简化和建模不准确具有鲁棒性的控制器。研究人员已经研究了鲁棒性是否会延续到控制器的数字实现中,主要是观察控制器对通信或计算问题的反应。通信问题通常使用随机变量进行建模(例如,估计在传输过程中发生故障的概率),而计算问题则使用控制任务必须满足的截止日期数量的确定性保证进行建模。这些故障模型允许工程师设计鲁棒控制器,并在孤立故障存在时评估控制器的行为。尽管与控制系统的实际实现非常相关,但当这些故障同时发生时会发生什么问题还没有一个适当的答案。本文利用马尔可夫跳变线性系统理论给出了收敛性几乎确定保证的稳定性契约,在随机环境下回答了这个问题。对于线性定常马尔可夫跳变线性系统,均方稳定性意味着几乎肯定的收敛性——这是我们研究的中心性质。我们的研究主要强调了闭环系统的这一特性的验证,这些系统可能同时发生数据包丢失和计算超支。我们将我们的方法应用于最近文献中的两个案例研究,并展示了它们对一组全面错误的鲁棒性。我们采用闭环系统模拟来经验推导性能指标,阐明控制器实现的质量,如系统稳定时间和积分绝对误差。
{"title":"Stochastic Analysis of Control Systems Subject to Communication and Computation Faults","authors":"Nils Vreman, Martina Maggio","doi":"10.1145/3609123","DOIUrl":"https://doi.org/10.1145/3609123","url":null,"abstract":"Control theory allows one to design controllers that are robust to external disturbances, model simplification, and modelling inaccuracy. Researchers have investigated whether the robustness carries on to the controller’s digital implementation, mostly looking at how the controller reacts to either communication or computational problems. Communication problems are typically modelled using random variables (i.e., estimating the probability that a fault will occur during a transmission), while computational problems are modelled using deterministic guarantees on the number of deadlines that the control task has to meet. These fault models allow the engineer to both design robust controllers and assess the controllers’ behaviour in the presence of isolated faults. Despite being very relevant for the real-world implementations of control system, the question of what happens when these faults occur simultaneously does not yet have a proper answer. In this paper, we answer this question in the stochastic setting, using the theory of Markov Jump Linear Systems to provide stability contracts with almost sure guarantees of convergence. For linear time-invariant Markov jump linear systems, mean square stability implies almost sure convergence – a property that is central to our investigation. Our research primarily emphasises the validation of this property for closed-loop systems that are subject to packet losses and computational overruns, potentially occurring simultaneously. We apply our method to two case studies from the recent literature and show their robustness to a comprehensive set of faults. We employ closed-loop system simulations to empirically derive performance metrics that elucidate the quality of the controller implementation, such as the system settling time and the integral absolute error.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DTRL: Decision Tree-based Multi-Objective Reinforcement Learning for Runtime Task Scheduling in Domain-Specific System-on-Chips 基于决策树的多目标强化学习在特定领域片上系统中的运行时任务调度
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609108
Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras
Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.
特定领域的片上系统(dssoc)结合了通用处理器和专用硬件加速器,以提高特定领域的性能和能源效率。以最小的运行时开销将任务优化分配给处理元素(pe)对于实现这一潜力至关重要。然而,这个问题仍然具有挑战性,因为先前的方法受到非最佳调度决策或显著运行时开销的影响。此外,现有的技术只关注于单一的优化目标,例如最大化性能。这项工作提出了DTRL,一种基于决策树的多目标强化学习技术,用于dssoc的运行时任务调度。DTRL训练一个全局可微决策树(DDT)策略,该策略覆盖由偏好向量量化的整个目标空间。我们使用我们的新型强化学习环境进行了广泛的实验评估,证明DTRL捕获了执行时间和功耗之间的权衡,从而使用单个策略生成了Pareto解决方案集。此外,与最先进的启发式、优化和基于机器学习的调度器进行比较表明,DTRL实现了高达9倍的性能提升和高达3.08倍的能耗降低。训练后的DDT策略在Xilinx Zynq ZCU102 FPGA上实现了1.2 GHz的120 ns推理延迟,导致运行时开销可以忽略不计。对相同硬件的评估表明,DTRL的性能比最先进的启发式调度器高出16%。
{"title":"DTRL: Decision Tree-based Multi-Objective Reinforcement Learning for Runtime Task Scheduling in Domain-Specific System-on-Chips","authors":"Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras","doi":"10.1145/3609108","DOIUrl":"https://doi.org/10.1145/3609108","url":null,"abstract":"Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GHOST: A Graph Neural Network Accelerator using Silicon Photonics GHOST:使用硅光子学的图形神经网络加速器
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609097
Salma Afifi, Febin Sunny, Amin Shafiee, Mahdi Nikdast, Sudeep Pasricha
Graph neural networks (GNNs) have emerged as a powerful approach for modelling and learning from graph-structured data. Multiple fields have since benefitted enormously from the capabilities of GNNs, such as recommendation systems, social network analysis, drug discovery, and robotics. However, accelerating and efficiently processing GNNs require a unique approach that goes beyond conventional artificial neural network accelerators, due to the substantial computational and memory requirements of GNNs. The slowdown of scaling in CMOS platforms also motivates a search for alternative implementation substrates. In this paper, we present GHOST , the first silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates the costs associated with both vertex-centric and edge-centric operations. It implements separately the three main stages involved in running GNNs in the optical domain, allowing it to be used for the inference of various widely used GNN models and architectures, such as graph convolution networks and graph attention networks. Our simulation studies indicate that GHOST exhibits at least 10.2 × better throughput and 3.8 × better energy efficiency when compared to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators.
图神经网络(gnn)已经成为一种强大的建模和从图结构数据中学习的方法。从那以后,许多领域都从gnn的能力中受益匪浅,比如推荐系统、社交网络分析、药物发现和机器人技术。然而,加速和有效处理gnn需要一种超越传统人工神经网络加速器的独特方法,因为gnn需要大量的计算和内存需求。CMOS平台的扩展速度放缓也促使人们寻找替代的实现基板。在本文中,我们提出了GHOST,第一个用于gnn的硅光子硬件加速器。GHOST有效地降低了以顶点为中心和以边缘为中心操作的相关成本。它分别实现了在光学域中运行GNN所涉及的三个主要阶段,使其能够用于各种广泛使用的GNN模型和架构的推理,例如图卷积网络和图注意网络。我们的仿真研究表明,与GPU、TPU、CPU和多个最先进的GNN硬件加速器相比,GHOST具有至少10.2倍的吞吐量和3.8倍的能效。
{"title":"GHOST: A Graph Neural Network Accelerator using Silicon Photonics","authors":"Salma Afifi, Febin Sunny, Amin Shafiee, Mahdi Nikdast, Sudeep Pasricha","doi":"10.1145/3609097","DOIUrl":"https://doi.org/10.1145/3609097","url":null,"abstract":"Graph neural networks (GNNs) have emerged as a powerful approach for modelling and learning from graph-structured data. Multiple fields have since benefitted enormously from the capabilities of GNNs, such as recommendation systems, social network analysis, drug discovery, and robotics. However, accelerating and efficiently processing GNNs require a unique approach that goes beyond conventional artificial neural network accelerators, due to the substantial computational and memory requirements of GNNs. The slowdown of scaling in CMOS platforms also motivates a search for alternative implementation substrates. In this paper, we present GHOST , the first silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates the costs associated with both vertex-centric and edge-centric operations. It implements separately the three main stages involved in running GNNs in the optical domain, allowing it to be used for the inference of various widely used GNN models and architectures, such as graph convolution networks and graph attention networks. Our simulation studies indicate that GHOST exhibits at least 10.2 × better throughput and 3.8 × better energy efficiency when compared to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ANV-PUF: Machine-Learning-Resilient NVM-Based Arbiter PUF ANV-PUF:基于机器学习弹性nvm的仲裁PUF
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609388
Hassan Nassar, Lars Bauer, Jörg Henkel
Physical Unclonable Functions (PUFs) have been widely considered an attractive security primitive. They use the deviations in the fabrication process to have unique responses from each device. Due to their nature, they serve as a DNA-like identity of the device. But PUFs have also been targeted for attacks. It has been proven that machine learning (ML) can be used to effectively model a PUF design and predict its behavior, leading to leakage of the internal secrets. To combat such attacks, several designs have been proposed to make it harder to model PUFs. One design direction is to use Non-Volatile Memory (NVM) as the building block of the PUF. NVM typically are multi-level cells, i.e, they have several internal states, which makes it harder to model them. However, the current state of the art of NVM-based PUFs is limited to ‘weak PUFs’, i.e., the number of outputs grows only linearly with the number of inputs, which limits the number of possible secret values that can be stored using the PUF. To overcome this limitation, in this work we design the Arbiter Non-Volatile PUF (ANV-PUF) that is exponential in the number of inputs and that is resilient against ML-based modeling. The concept is based on the famous delay-based Arbiter PUF (which is not resilient against modeling attacks) while using NVM as a building block instead of switches. Hence, we replace the switch delays (which are easy to model via ML) with the multi-level property of NVM (which is hard to model via ML). Consequently, our design has the exponential output characteristics of the Arbiter PUF and the resilience against attacks from the NVM-based PUFs. Our results show that the resilience to ML modeling, uniqueness, and uniformity are all in the ideal range of 50%. Thus, in contrast to the state-of-the-art, ANV-PUF is able to be resilient to attacks, while having an exponential number of outputs.
物理不可克隆函数(puf)被广泛认为是一种有吸引力的安全原语。他们利用制造过程中的偏差来对每个设备产生独特的响应。由于它们的性质,它们作为设备的dna一样的身份。但是puf也成为了攻击的目标。已经证明,机器学习(ML)可以用来有效地建模PUF设计并预测其行为,从而导致内部秘密的泄露。为了对抗这种攻击,已经提出了几种设计,以使puf更难建模。一个设计方向是使用非易失性存储器(NVM)作为PUF的构建块。NVM通常是多级单元,也就是说,它们有几个内部状态,这使得建模变得更加困难。然而,目前基于nvm的PUF技术仅限于“弱PUF”,即输出的数量仅随输入的数量线性增长,这限制了可以使用PUF存储的可能的秘密值的数量。为了克服这一限制,在这项工作中,我们设计了仲裁者非易失性PUF (ANV-PUF),它的输入数量呈指数增长,并且对基于ml的建模具有弹性。这个概念是基于著名的基于延迟的仲裁器PUF(它对建模攻击没有弹性),同时使用NVM作为构建块而不是交换机。因此,我们用NVM的多级属性(很难通过ML建模)取代了开关延迟(很容易通过ML建模)。因此,我们的设计具有Arbiter PUF的指数输出特性和抵御基于nvm的PUF攻击的弹性。我们的结果表明,对ML建模的弹性、唯一性和均匀性都在50%的理想范围内。因此,与最先进的技术相比,ANV-PUF能够抵御攻击,同时具有指数级的输出数量。
{"title":"ANV-PUF: Machine-Learning-Resilient NVM-Based Arbiter PUF","authors":"Hassan Nassar, Lars Bauer, Jörg Henkel","doi":"10.1145/3609388","DOIUrl":"https://doi.org/10.1145/3609388","url":null,"abstract":"Physical Unclonable Functions (PUFs) have been widely considered an attractive security primitive. They use the deviations in the fabrication process to have unique responses from each device. Due to their nature, they serve as a DNA-like identity of the device. But PUFs have also been targeted for attacks. It has been proven that machine learning (ML) can be used to effectively model a PUF design and predict its behavior, leading to leakage of the internal secrets. To combat such attacks, several designs have been proposed to make it harder to model PUFs. One design direction is to use Non-Volatile Memory (NVM) as the building block of the PUF. NVM typically are multi-level cells, i.e, they have several internal states, which makes it harder to model them. However, the current state of the art of NVM-based PUFs is limited to ‘weak PUFs’, i.e., the number of outputs grows only linearly with the number of inputs, which limits the number of possible secret values that can be stored using the PUF. To overcome this limitation, in this work we design the Arbiter Non-Volatile PUF (ANV-PUF) that is exponential in the number of inputs and that is resilient against ML-based modeling. The concept is based on the famous delay-based Arbiter PUF (which is not resilient against modeling attacks) while using NVM as a building block instead of switches. Hence, we replace the switch delays (which are easy to model via ML) with the multi-level property of NVM (which is hard to model via ML). Consequently, our design has the exponential output characteristics of the Arbiter PUF and the resilience against attacks from the NVM-based PUFs. Our results show that the resilience to ML modeling, uniqueness, and uniformity are all in the ideal range of 50%. Thus, in contrast to the state-of-the-art, ANV-PUF is able to be resilient to attacks, while having an exponential number of outputs.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CIM: A Novel Clustering-based Energy-Efficient Data Imputation Method for Human Activity Recognition 基于聚类的人类活动识别节能数据输入方法
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609111
Dina Hussein, Ganapati Bhat
Human activity recognition (HAR) is an important component in a number of health applications, including rehabilitation, Parkinson’s disease, daily activity monitoring, and fitness monitoring. State-of-the-art HAR approaches use multiple sensors on the body to accurately identify activities at runtime. These approaches typically assume that data from all sensors are available for runtime activity recognition. However, data from one or more sensors may be unavailable due to malfunction, energy constraints, or communication challenges between the sensors. Missing data can lead to significant degradation in the accuracy, thus affecting quality of service to users. A common approach for handling missing data is to train classifiers or sensor data recovery algorithms for each combination of missing sensors. However, this results in significant memory and energy overhead on resource-constrained wearable devices. In strong contrast to prior approaches, this paper presents a clustering-based approach (CIM) to impute missing data at runtime. We first define a set of possible clusters and representative data patterns for each sensor in HAR. Then, we create and store a mapping between clusters across sensors. At runtime, when data from a sensor are missing, we utilize the stored mapping table to obtain most likely cluster for the missing sensor. The representative window for the identified cluster is then used as imputation to perform activity classification. We also provide a method to obtain imputation-aware activity prediction sets to handle uncertainty in data when using imputation. Experiments on three HAR datasets show that CIM achieves accuracy within 10% of a baseline without missing data for one missing sensor when providing single activity labels. The accuracy gap drops to less than 1% with imputation-aware classification. Measurements on a low-power processor show that CIM achieves close to 100% energy savings compared to state-of-the-art generative approaches.
人体活动识别(HAR)是许多健康应用的重要组成部分,包括康复、帕金森病、日常活动监测和健身监测。最先进的HAR方法使用身体上的多个传感器来准确识别运行时的活动。这些方法通常假设来自所有传感器的数据可用于运行时活动识别。然而,由于故障、能量限制或传感器之间的通信问题,来自一个或多个传感器的数据可能不可用。数据缺失会导致准确性的显著降低,从而影响对用户的服务质量。处理丢失数据的常用方法是为丢失的每个传感器组合训练分类器或传感器数据恢复算法。然而,这在资源有限的可穿戴设备上导致了显著的内存和能量开销。与之前的方法形成鲜明对比的是,本文提出了一种基于聚类的方法(CIM)来在运行时输入缺失数据。我们首先为HAR中的每个传感器定义了一组可能的集群和代表性数据模式。然后,我们创建并存储跨传感器集群之间的映射。在运行时,当来自传感器的数据丢失时,我们利用存储的映射表来获得丢失传感器的最可能聚类。然后将识别的集群的代表性窗口用作输入来执行活动分类。我们还提供了一种方法,以获得估算感知的活动预测集,以处理数据在使用估算时的不确定性。在三个HAR数据集上的实验表明,当提供单个活动标签时,CIM在不丢失数据的情况下实现了10%的基线精度。使用假设感知分类,准确率差距降至1%以下。对低功耗处理器的测量表明,与最先进的生成方法相比,CIM实现了接近100%的节能。
{"title":"CIM: A Novel Clustering-based Energy-Efficient Data Imputation Method for Human Activity Recognition","authors":"Dina Hussein, Ganapati Bhat","doi":"10.1145/3609111","DOIUrl":"https://doi.org/10.1145/3609111","url":null,"abstract":"Human activity recognition (HAR) is an important component in a number of health applications, including rehabilitation, Parkinson’s disease, daily activity monitoring, and fitness monitoring. State-of-the-art HAR approaches use multiple sensors on the body to accurately identify activities at runtime. These approaches typically assume that data from all sensors are available for runtime activity recognition. However, data from one or more sensors may be unavailable due to malfunction, energy constraints, or communication challenges between the sensors. Missing data can lead to significant degradation in the accuracy, thus affecting quality of service to users. A common approach for handling missing data is to train classifiers or sensor data recovery algorithms for each combination of missing sensors. However, this results in significant memory and energy overhead on resource-constrained wearable devices. In strong contrast to prior approaches, this paper presents a clustering-based approach (CIM) to impute missing data at runtime. We first define a set of possible clusters and representative data patterns for each sensor in HAR. Then, we create and store a mapping between clusters across sensors. At runtime, when data from a sensor are missing, we utilize the stored mapping table to obtain most likely cluster for the missing sensor. The representative window for the identified cluster is then used as imputation to perform activity classification. We also provide a method to obtain imputation-aware activity prediction sets to handle uncertainty in data when using imputation. Experiments on three HAR datasets show that CIM achieves accuracy within 10% of a baseline without missing data for one missing sensor when providing single activity labels. The accuracy gap drops to less than 1% with imputation-aware classification. Measurements on a low-power processor show that CIM achieves close to 100% energy savings compared to state-of-the-art generative approaches.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment 面向异构MPSoC部署的映射感知图神经结构搜索框架
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609386
Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque
Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both ( i ) a real hardware SoC platform (NVIDIA Xavier AGX) and ( ii ) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57 × latency speedup and is 3.38 × more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.
图神经网络(gnn)在基于视觉的应用中越来越受欢迎,因为它们具有对图像框架各部分之间的结构和上下文关系进行建模的内在能力。另一方面,基于深度视觉的边缘应用的日益普及,得益于异构多处理器片上系统(mpsoc)的最新进展,这些应用能够在实时、严格的执行要求下进行推理。通过扩展,用于基于视觉的应用程序的gnn必须遵守相同的执行要求。然而,与典型的深度神经网络相反,图学习操作的不规则流给在这种异构MPSoC平台上运行gnn带来了挑战。在本文中,我们提出了一种新的统一设计映射方法,用于在异构MPSoC平台上高效处理视觉GNN工作负载。特别地,我们开发了MaGNAS,一个映射感知的图神经结构搜索框架。MaGNAS提出了一个GNN架构设计空间,结合异构SoC上的前瞻性映射选项,以确定最大限度提高设备上资源效率的模型架构。为了实现这一目标,MaGNAS采用两层进化搜索来识别最佳gnn和映射配对,从而产生最佳的性能权衡。通过设计一个源自最新视觉GNN (ViG)架构的超级网络,我们使用(i)真正的硬件SoC平台(NVIDIA Xavier AGX)和(ii) DNN加速器的性能/成本模型模拟器在四(04)个最先进的视觉数据集上进行了实验。我们的实验结果表明,对于在Xavier MPSoC上执行的多个视觉数据集,与仅使用gpu的部署相比,MaGNAS能够提供1.57倍的延迟加速和3.38倍的能效,同时保持比基线平均0.11%的精度降低。
{"title":"MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment","authors":"Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque","doi":"10.1145/3609386","DOIUrl":"https://doi.org/10.1145/3609386","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both ( i ) a real hardware SoC platform (NVIDIA Xavier AGX) and ( ii ) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57 × latency speedup and is 3.38 × more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
ACM Transactions on Embedded Computing Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1