IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems最新文献_第3页

Statistical Reachability Analysis of Stochastic Cyber-Physical Systems Under Distribution Shift 分布偏移下随机网络物理系统的统计可达性分析

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3438072

Navid Hashemi;Lars Lindemann;Jyotirmoy V. Deshmukh

Reachability analysis is a popular method to give safety guarantees for stochastic cyber-physical systems (SCPSs) that takes in a symbolic description of the system dynamics and uses set-propagation methods to compute an overapproximation of the set of reachable states over a bounded time horizon. In this article, we investigate the problem of performing reachability analysis for an SCPS that does not have a symbolic description of the dynamics, but instead is described using a digital twin model that can be simulated to generate system trajectories. An important challenge is that the simulator implicitly models a probability distribution over the set of trajectories of the SCPS; however, it is typical to have a sim2real gap, i.e., the actual distribution of the trajectories in a deployment setting may be shifted from the distribution assumed by the simulator. We thus propose a statistical reachability analysis technique that, given a user-provided threshold

$1-epsilon $

, provides a set that guarantees that any trajectory during deployment lies in this set with probability not smaller than this threshold. Our method is based on three main steps: 1) learning a deterministic surrogate model from sampled trajectories; 2) conducting reachability analysis over the surrogate model; and 3) employing robust conformal inference (CI) using an additional set of sampled trajectories to quantify the surrogate model’s distribution shift with respect to the deployed SCPS. To counter conservatism in reachable sets, we propose a novel method to train surrogate models that minimizes a quantile loss term (instead of the usual mean squared loss), and a new method that provides tighter guarantees using CI using a normalized surrogate error. We demonstrate the effectiveness of our technique on various case studies.

可达性分析是一种为随机网络物理系统（SCPS）提供安全保证的流行方法，它采用系统动态的符号描述，并使用集合传播方法计算在有界时间范围内的可达状态集合的过度近似值。在本文中，我们研究了对 SCPS 进行可达性分析的问题，这种 SCPS 没有动态的符号描述，而是使用数字孪生模型进行描述，可以通过模拟生成系统轨迹。一个重要的挑战是，模拟器对 SCPS 轨迹集的概率分布进行隐式建模；然而，典型的情况是存在模拟与实际之间的差距，即部署环境中轨迹的实际分布可能与模拟器假设的分布存在偏差。因此，我们提出了一种统计可达性分析技术，在给定用户提供的阈值 $1-epsilon $ 的情况下，提供一个集合，保证在部署过程中任何轨迹位于该集合中的概率不小于该阈值。我们的方法基于三个主要步骤：1) 从采样轨迹中学习确定性代用模型；2) 对代用模型进行可达性分析；3) 使用额外的采样轨迹集进行稳健保形推理 (CI)，以量化代用模型相对于部署 SCPS 的分布偏移。为了应对可达集的保守性，我们提出了一种训练代用模型的新方法，该方法可使量子损失项（而非通常的均方损失）最小化，还提出了一种使用归一化代用误差的 CI 新方法，该方法可提供更严格的保证。我们在各种案例研究中展示了我们技术的有效性。

{"title":"Statistical Reachability Analysis of Stochastic Cyber-Physical Systems Under Distribution Shift","authors":"Navid Hashemi;Lars Lindemann;Jyotirmoy V. Deshmukh","doi":"10.1109/TCAD.2024.3438072","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438072","url":null,"abstract":"Reachability analysis is a popular method to give safety guarantees for stochastic cyber-physical systems (SCPSs) that takes in a symbolic description of the system dynamics and uses set-propagation methods to compute an overapproximation of the set of reachable states over a bounded time horizon. In this article, we investigate the problem of performing reachability analysis for an SCPS that does not have a symbolic description of the dynamics, but instead is described using a digital twin model that can be simulated to generate system trajectories. An important challenge is that the simulator implicitly models a probability distribution over the set of trajectories of the SCPS; however, it is typical to have a sim2real gap, i.e., the actual distribution of the trajectories in a deployment setting may be shifted from the distribution assumed by the simulator. We thus propose a statistical reachability analysis technique that, given a user-provided threshold \u0000<inline-formula> <tex-math>$1-epsilon $ </tex-math></inline-formula>\u0000, provides a set that guarantees that any trajectory during deployment lies in this set with probability not smaller than this threshold. Our method is based on three main steps: 1) learning a deterministic surrogate model from sampled trajectories; 2) conducting reachability analysis over the surrogate model; and 3) employing robust conformal inference (CI) using an additional set of sampled trajectories to quantify the surrogate model’s distribution shift with respect to the deployed SCPS. To counter conservatism in reachable sets, we propose a novel method to train surrogate models that minimizes a quantile loss term (instead of the usual mean squared loss), and a new method that provides tighter guarantees using CI using a normalized surrogate error. We demonstrate the effectiveness of our technique on various case studies.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4250-4261"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approximate Conformance Checking for Closed-Loop Systems With Neural Network Controllers 利用神经网络控制器对闭环系统进行近似一致性检查

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3445813

P. Habeeb;Lipsy Gupta;Pavithra Prabhakar

In this article, we consider the problem of checking approximate conformance of closed-loop systems with the same plant but different neural network (NN) controllers. First, we introduce a notion of approximate conformance on NNs, which allows us to quantify semantically the deviations in closed-loop system behaviors with different NN controllers. Next, we consider the problem of computationally checking this notion of approximate conformance on two NNs. We reduce this problem to that of reachability analysis on a combined NN, thereby, enabling the use of existing NN verification tools for conformance checking. Our experimental results on an autonomous rocket landing system demonstrate the feasibility of checking approximate conformance on different NNs trained for the same dynamics, as well as the practical semantic closeness exhibited by the corresponding closed-loop systems.

在本文中，我们考虑了检查具有相同工厂但不同神经网络 (NN) 控制器的闭环系统的近似一致性问题。首先，我们引入了神经网络近似一致性的概念，通过这一概念，我们可以从语义上量化不同神经网络控制器的闭环系统行为偏差。接下来，我们将考虑对两个 NN 的近似一致性概念进行计算检查的问题。我们将这一问题简化为对组合 NN 的可达性分析，从而使现有的 NN 验证工具能够用于一致性检查。我们在一个自主火箭着陆系统上的实验结果表明，在为相同动力学训练的不同 NN 上检查近似一致性是可行的，相应的闭环系统也表现出了实际的语义接近性。

引用次数: 0

CaBaFL: Asynchronous Federated Learning via Hierarchical Cache and Feature Balance CaBaFL：通过分层缓存和特征平衡进行异步联合学习

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3446881

Zeke Xia;Ming Hu;Dengke Yan;Xiaofei Xie;Tianlin Li;Anran Li;Junlong Zhou;Mingsong Chen

Federated learning (FL) as a promising distributed machine learning paradigm has been widely adopted in Artificial Intelligence of Things (AIoT) applications. However, the efficiency and inference capability of FL is seriously limited due to the presence of stragglers and data imbalance across massive AIoT devices, respectively. To address the above challenges, we present a novel asynchronous FL approach named CaBaFL, which includes a hierarchical cache-based aggregation mechanism and a feature balance-guided device selection strategy. CaBaFL maintains multiple intermediate models simultaneously for local training. The hierarchical cache-based aggregation mechanism enables each intermediate model to be trained on multiple devices to align the training time and mitigate the straggler issue. In specific, each intermediate model is stored in a low-level cache for local training and when it is trained by sufficient local devices, it will be stored in a high-level cache for aggregation. To address the problem of imbalanced data, the feature balance-guided device selection strategy in CaBaFL adopts the activation distribution as a metric, which enables each intermediate model to be trained across devices with totally balanced data distributions before aggregation. Experimental results show that compared to the state-of-the-art FL methods, CaBaFL achieves up to 9.26X training acceleration and 19.71% accuracy improvements.

联合学习（FL）作为一种前景广阔的分布式机器学习范式，已在人工智能物联网（AIoT）应用中得到广泛采用。然而，由于海量 AIoT 设备中存在散兵游勇和数据不平衡，FL 的效率和推理能力受到严重限制。为了应对上述挑战，我们提出了一种名为 CaBaFL 的新型异步 FL 方法，其中包括基于缓存的分层聚合机制和特征平衡指导的设备选择策略。CaBaFL 同时维护多个中间模型，用于本地训练。基于缓存的分层聚合机制使每个中间模型都能在多个设备上进行训练，以调整训练时间并减少滞后问题。具体来说，每个中间模型都存储在一个低级缓存中进行本地训练，当有足够多的本地设备对其进行训练后，它将被存储在一个高级缓存中进行聚合。为了解决数据不平衡的问题，CaBaFL 的特征平衡指导设备选择策略采用了激活分布作为度量标准，这使得每个中间模型在聚合前都能在数据分布完全平衡的设备上进行训练。实验结果表明，与最先进的 FL 方法相比，CaBaFL 的训练速度提高了 9.26 倍，准确率提高了 19.71%。

{"title":"CaBaFL: Asynchronous Federated Learning via Hierarchical Cache and Feature Balance","authors":"Zeke Xia;Ming Hu;Dengke Yan;Xiaofei Xie;Tianlin Li;Anran Li;Junlong Zhou;Mingsong Chen","doi":"10.1109/TCAD.2024.3446881","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446881","url":null,"abstract":"Federated learning (FL) as a promising distributed machine learning paradigm has been widely adopted in Artificial Intelligence of Things (AIoT) applications. However, the efficiency and inference capability of FL is seriously limited due to the presence of stragglers and data imbalance across massive AIoT devices, respectively. To address the above challenges, we present a novel asynchronous FL approach named CaBaFL, which includes a hierarchical cache-based aggregation mechanism and a feature balance-guided device selection strategy. CaBaFL maintains multiple intermediate models simultaneously for local training. The hierarchical cache-based aggregation mechanism enables each intermediate model to be trained on multiple devices to align the training time and mitigate the straggler issue. In specific, each intermediate model is stored in a low-level cache for local training and when it is trained by sufficient local devices, it will be stored in a high-level cache for aggregation. To address the problem of imbalanced data, the feature balance-guided device selection strategy in CaBaFL adopts the activation distribution as a metric, which enables each intermediate model to be trained across devices with totally balanced data distributions before aggregation. Experimental results show that compared to the state-of-the-art FL methods, CaBaFL achieves up to 9.26X training acceleration and 19.71% accuracy improvements.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4057-4068"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Backdoor Attacks on Safe Reinforcement Learning-Enabled Cyber–Physical Systems 对安全强化学习网络物理系统的后门攻击

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3447468

Shixiong Jiang;Mengyu Liu;Fanxin Kong

Safe reinforcement learning (RL) aims to derive a control policy that navigates a safety-critical system while avoiding unsafe explorations and adhering to safety constraints. While safe RL has been extensively studied, its vulnerabilities during the policy training have barely been explored in an adversarial setting. This article bridges this gap and investigates the training time vulnerability of formal language-guided safe RL. Such vulnerability allows a malicious adversary to inject backdoor behavior into the learned control policy. First, we formally define backdoor attacks for safe RL and divide them into active and passive ones depending on whether to manipulate the observation. Second, we propose two novel algorithms to synthesize the two kinds of attacks, respectively. Both algorithms generate backdoor behaviors that may go unnoticed after deployment but can be triggered when specific states are reached, leading to safety violations. Finally, we conduct both theoretical analysis and extensive experiments to show the effectiveness and stealthiness of our methods.

安全强化学习（RL）旨在推导出一种控制策略，在避免不安全探索和遵守安全约束的同时，为安全关键型系统导航。虽然安全强化学习已经得到了广泛的研究，但在对抗环境下，其在策略训练过程中的弱点却几乎没有被探索过。本文弥补了这一空白，研究了形式语言引导的安全 RL 在训练时的脆弱性。这种漏洞允许恶意对手向学习到的控制策略注入后门行为。首先，我们正式定义了安全 RL 的后门攻击，并根据是否操纵观察结果将其分为主动和被动攻击。其次，我们提出了两种新算法来分别合成这两种攻击。这两种算法生成的后门行为在部署后可能不会被注意到，但在达到特定状态时会被触发，从而导致安全违规。最后，我们进行了理论分析和大量实验，以展示我们方法的有效性和隐蔽性。

引用次数: 0

Bank on Compute-Near-Memory: Design Space Exploration of Processing-Near-Bank Architectures 计算-近存银行：处理近存储架构的设计空间探索

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3442989

Rafael Medina;Giovanni Ansaloni;Marina Zapater;Alexandre Levisse;Saeideh Alinezhad Chamazcoti;Timon Evenblij;Dwaipayan Biswas;Francky Catthoor;David Atienza

Near-DRAM computing strategies advocate for providing computational capabilities close to where data is stored. Although this paradigm can effectively address the memory-to-processor communication bottleneck, it also presents new challenges: The strict resource constraints in the memory periphery demand careful tailoring of architectural elements. We herein propose a novel framework and methodology to explore compute-near-memory designs that interface to DRAM memory banks, demonstrating the area, energy, and performance tradeoffs subject to the architectural configuration. We exemplify this methodology by conducting two studies on compute-near-bank designs: 1) analyzing the interaction between control and data resources, and 2) exploring the integration of processing units with different DRAM standards. According to our study, the optimal size ratios between instruction and data capacity vary from

$2times $

to

$4times $

across benchmarks from representative application domains. The retrieved Pareto-optimal solutions from our framework improve state-of-the-art designs, e.g., achieving a 50% performance increase on matrix operations with 15% energy overhead relative to the FIMDRAM design. In addition, the exploration of DRAM shows the interplay between available internal bandwidth, performance, and area overhead. For example, a threefold increase in bandwidth rises performance by 47% across workloads at a 34% extra area cost.

近内存计算战略主张在数据存储地附近提供计算能力。虽然这种模式可以有效解决内存到处理器的通信瓶颈，但也带来了新的挑战：内存外围严格的资源限制要求对架构元素进行精心定制。在此，我们提出了一种新颖的框架和方法，用于探索与 DRAM 存储库接口的计算近程内存设计，并展示了受架构配置影响的面积、能耗和性能权衡。我们通过对计算近端存储器设计进行两项研究来体现这一方法：1) 分析控制资源和数据资源之间的相互作用；2) 探索不同 DRAM 标准的处理单元的集成。根据我们的研究，在具有代表性的应用领域基准中，指令和数据容量之间的最佳大小比从 2 美元/次到 4 美元/次不等。从我们的框架中检索到的帕累托最优解决方案改进了最先进的设计，例如，与 FIMDRAM 设计相比，矩阵运算的性能提高了 50%，而能量开销仅为 15%。此外，对 DRAM 的探索显示了可用内部带宽、性能和面积开销之间的相互作用。例如，将带宽提高三倍可将各种工作负载的性能提高 47%，而额外的面积成本为 34%。

{"title":"Bank on Compute-Near-Memory: Design Space Exploration of Processing-Near-Bank Architectures","authors":"Rafael Medina;Giovanni Ansaloni;Marina Zapater;Alexandre Levisse;Saeideh Alinezhad Chamazcoti;Timon Evenblij;Dwaipayan Biswas;Francky Catthoor;David Atienza","doi":"10.1109/TCAD.2024.3442989","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3442989","url":null,"abstract":"Near-DRAM computing strategies advocate for providing computational capabilities close to where data is stored. Although this paradigm can effectively address the memory-to-processor communication bottleneck, it also presents new challenges: The strict resource constraints in the memory periphery demand careful tailoring of architectural elements. We herein propose a novel framework and methodology to explore compute-near-memory designs that interface to DRAM memory banks, demonstrating the area, energy, and performance tradeoffs subject to the architectural configuration. We exemplify this methodology by conducting two studies on compute-near-bank designs: 1) analyzing the interaction between control and data resources, and 2) exploring the integration of processing units with different DRAM standards. According to our study, the optimal size ratios between instruction and data capacity vary from \u0000<inline-formula> <tex-math>$2times $ </tex-math></inline-formula>\u0000 to \u0000<inline-formula> <tex-math>$4times $ </tex-math></inline-formula>\u0000 across benchmarks from representative application domains. The retrieved Pareto-optimal solutions from our framework improve state-of-the-art designs, e.g., achieving a 50% performance increase on matrix operations with 15% energy overhead relative to the FIMDRAM design. In addition, the exploration of DRAM shows the interplay between available internal bandwidth, performance, and area overhead. For example, a threefold increase in bandwidth rises performance by 47% across workloads at a 34% extra area cost.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4117-4129"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlexFL: Heterogeneous Federated Learning via APoZ-Guided Flexible Pruning in Uncertain Scenarios FlexFL：在不确定场景中通过 APoZ 引导的灵活剪枝进行异构联合学习

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3444695

Zekai Chen;Chentao Jia;Ming Hu;Xiaofei Xie;Anran Li;Mingsong Chen

Along with the increasing popularity of deep learning (DL) techniques, more and more Artificial Intelligence of Things (AIoT) systems are adopting federated learning (FL) to enable privacy-aware collaborative learning among the AIoT devices. However, due to the inherent data and device heterogeneity issues, the existing FL-based AIoT systems suffer from the model selection problem. Although various heterogeneous FL methods have been investigated to enable collaborative training among the heterogeneous models, there is still a lack of 1) wise heterogeneous model generation methods for the devices; 2) consideration of uncertain factors; and 3) performance guarantee for the large models, thus strongly limiting the overall FL performance. To address the above issues, this article introduces a novel heterogeneous FL framework named FlexFL. By adopting our average percentage of zeros (APoZ)-guided flexible pruning strategy, FlexFL can effectively derive best-fit models for the heterogeneous devices to explore their greatest potential. Meanwhile, our proposed adaptive local pruning strategy allows the AIoT devices to prune their received models according to their varying resources within uncertain scenarios. Moreover, based on the self-knowledge distillation, FlexFL can enhance the inference performance of the large models by learning the knowledge from the small models. Comprehensive experimental results show that, compared to the state-of-the-art heterogeneous FL methods, FlexFL can significantly improve the overall inference accuracy by up to 14.24%. Our code can be found here https://github.com/mastlab-T3S/FlexFL.

随着深度学习（DL）技术的日益普及，越来越多的人工智能物联网（AIoT）系统开始采用联合学习（FL）技术，以实现人工智能物联网设备之间的隐私感知协作学习。然而，由于固有的数据和设备异构问题，现有的基于联合学习的人工智能物联网系统存在模型选择问题。尽管已经研究了多种异构 FL 方法来实现异构模型之间的协同训练，但仍然缺乏：1）针对设备的明智的异构模型生成方法；2）对不确定因素的考虑；3）大型模型的性能保证，从而严重限制了 FL 的整体性能。针对上述问题，本文介绍了一种名为 FlexFL 的新型异构 FL 框架。通过采用以平均零点百分比（APoZ）为指导的灵活剪枝策略，FlexFL 可以有效地为异构设备推导出最合适的模型，以挖掘其最大潜力。同时，我们提出的自适应局部剪枝策略允许 AIoT 设备根据其在不确定场景下的不同资源剪枝其接收到的模型。此外，基于自我知识提炼，FlexFL 可以通过学习小模型的知识来提高大模型的推理性能。综合实验结果表明，与最先进的异构 FL 方法相比，FlexFL 可以显著提高整体推理准确率，最高可达 14.24%。我们的代码见 https://github.com/mastlab-T3S/FlexFL。

{"title":"FlexFL: Heterogeneous Federated Learning via APoZ-Guided Flexible Pruning in Uncertain Scenarios","authors":"Zekai Chen;Chentao Jia;Ming Hu;Xiaofei Xie;Anran Li;Mingsong Chen","doi":"10.1109/TCAD.2024.3444695","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3444695","url":null,"abstract":"Along with the increasing popularity of deep learning (DL) techniques, more and more Artificial Intelligence of Things (AIoT) systems are adopting federated learning (FL) to enable privacy-aware collaborative learning among the AIoT devices. However, due to the inherent data and device heterogeneity issues, the existing FL-based AIoT systems suffer from the model selection problem. Although various heterogeneous FL methods have been investigated to enable collaborative training among the heterogeneous models, there is still a lack of 1) wise heterogeneous model generation methods for the devices; 2) consideration of uncertain factors; and 3) performance guarantee for the large models, thus strongly limiting the overall FL performance. To address the above issues, this article introduces a novel heterogeneous FL framework named FlexFL. By adopting our average percentage of zeros (APoZ)-guided flexible pruning strategy, FlexFL can effectively derive best-fit models for the heterogeneous devices to explore their greatest potential. Meanwhile, our proposed adaptive local pruning strategy allows the AIoT devices to prune their received models according to their varying resources within uncertain scenarios. Moreover, based on the self-knowledge distillation, FlexFL can enhance the inference performance of the large models by learning the knowledge from the small models. Comprehensive experimental results show that, compared to the state-of-the-art heterogeneous FL methods, FlexFL can significantly improve the overall inference accuracy by up to 14.24%. Our code can be found here \u0000<uri>https://github.com/mastlab-T3S/FlexFL</uri>\u0000.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4069-4080"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers Deeploy：在异构微控制器上实现小型语言模型的高能效部署

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443718

Moritz Scherer;Luka Macan;Victor J. B. Jung;Philip Wiese;Luca Bompani;Alessio Burrello;Francesco Conti;Luca Benini

With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores’ instruction extensions and the NPU. We achieve leading-edge energy and throughput of

$490 ; mu $

J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.

随着嵌入式基础模型（EFM），特别是小语言模型（SLM）的兴起，为边缘应用改造变形器已成为一个非常活跃的研究领域。然而，在微控制器（MCU）级芯片上实现 SLM 的端到端部署而无需高带宽的片外主内存访问仍然是一个公开的挑战。在本文中，我们展示了在多核 RISC-V (RV32) MCU 上高效部署端到端 SLM 的情况，该 MCU 增强了 ML 指令扩展和硬件神经处理单元 (NPU)。为了自动探索在异构（多核+NPU）资源上积极部署 SLM 所涉及的受限多维内存与计算之间的权衡，我们引入了 Deeploy，这是一种新型深度神经网络（DNN）编译器，可生成高度优化的 C 代码，只需最少的运行时支持。我们演示了 Deeploy 生成执行 SLM 的端到端代码，充分利用了 RV32 内核的指令扩展和 NPU。我们在TinyStories数据集上训练的SLM首次在没有外部存储器的MCU级设备上运行时，以每秒340个令牌的速度，实现了每令牌490美元的领先能耗和吞吐量。

{"title":"Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers","authors":"Moritz Scherer;Luka Macan;Victor J. B. Jung;Philip Wiese;Luca Bompani;Alessio Burrello;Francesco Conti;Luca Benini","doi":"10.1109/TCAD.2024.3443718","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443718","url":null,"abstract":"With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores’ instruction extensions and the NPU. We achieve leading-edge energy and throughput of \u0000<inline-formula> <tex-math>$490 ; mu $ </tex-math></inline-formula>\u0000J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4009-4020"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analysis and Prevention of MCAS-Induced Crashes 分析和预防 MCAS 引发的碰撞事故

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3438105

Noah T. Curran;Thomas W. Kennings;Kang G. Shin

Semi-autonomous (SA) systems face the Challenge of determining which source to prioritize for control, whether it is from the human operator or the autonomous controller, especially when they conflict with each other. While one may design an SA system to default to accepting control from one or the other, such design choices can have catastrophic consequences in safety-critical settings. For instance, the sensors an autonomous controller relies upon may provide incorrect information about the environment due to tampering or natural fault. On the other hand, the human operator may also provide erroneous input. To better understand the consequences and resolution of this safety-critical design choice, we investigate a specific application of an SA system that failed due to a static assignment of control authority: the well-publicized Boeing 737-MAX maneuvering characteristics augmentation system (MCAS) that caused the crashes of Lion Air Flight 610 and Ethiopian Airlines Flight 302. First, using a representative simulation, we analyze and demonstrate the ease by which the original MCAS design could fail. Our analysis reveals the most robust public analysis of aircraft recoverability under MCAS faults, offering bounds for those scenarios beyond the original crashes. We also analyze Boeing’s updated MCAS and show how it falls short of its intended goals and continues to rely upon on a fault-prone static assignment of control priority. Using these insights, we present SA-MCAS, a new MCAS that both meets the intended goals of MCAS and avoids the failure cases that plague both MCAS designs. We demonstrate SA-MCAS’s ability to make safer and timely control decisions of the aircraft, even when the human and autonomous operators provide conflicting control inputs.

半自主（SA）系统面临的挑战是如何确定优先从人类操作员还是自主控制器获取控制权，尤其是当两者相互冲突时。虽然人们在设计半自主系统时可能会默认接受其中一个来源的控制，但这种设计选择可能会在安全关键环境中造成灾难性后果。例如，自主控制器所依赖的传感器可能会因篡改或自然故障而提供错误的环境信息。另一方面，人类操作员也可能提供错误的输入。为了更好地理解这种对安全至关重要的设计选择的后果和解决方法，我们研究了一个因静态分配控制权而失败的 SA 系统的具体应用：广为人知的波音 737-MAX 操纵特性增强系统 (MCAS)，它导致了狮航 610 号航班和埃塞俄比亚航空 302 号航班的坠毁。首先，我们使用具有代表性的模拟，分析并展示了 MCAS 原始设计失效的可能性。我们的分析揭示了在 MCAS 故障情况下飞机可恢复性的最可靠公开分析，为最初坠机以外的情况提供了界限。我们还分析了波音公司更新后的 MCAS，并展示了它是如何达不到预期目标，并继续依赖于容易出错的控制优先级静态分配的。利用这些见解，我们提出了 SA-MCAS，一种新的 MCAS，它既能实现 MCAS 的预期目标，又能避免困扰这两种 MCAS 设计的故障情况。我们展示了 SA-MCAS 的能力，即使在人类和自主操作员提供相互冲突的控制输入时，它也能对飞机做出更安全、更及时的控制决策。

{"title":"Analysis and Prevention of MCAS-Induced Crashes","authors":"Noah T. Curran;Thomas W. Kennings;Kang G. Shin","doi":"10.1109/TCAD.2024.3438105","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438105","url":null,"abstract":"Semi-autonomous (SA) systems face the C\u0000<sc>hallenge</small>\u0000 of determining which source to prioritize for control, whether it is from the human operator or the autonomous controller, especially when they conflict with each other. While one may design an SA system to default to accepting control from one or the other, such design choices can have catastrophic consequences in safety-critical settings. For instance, the sensors an autonomous controller relies upon may provide incorrect information about the environment due to tampering or natural fault. On the other hand, the human operator may also provide erroneous input. To better understand the consequences and resolution of this safety-critical design choice, we investigate a specific application of an SA system that failed due to a static assignment of control authority: the well-publicized Boeing 737-MAX maneuvering characteristics augmentation system (MCAS) that caused the crashes of Lion Air Flight 610 and Ethiopian Airlines Flight 302. First, using a representative simulation, we analyze and demonstrate the ease by which the original MCAS design could fail. Our analysis reveals the most robust public analysis of aircraft recoverability under MCAS faults, offering bounds for those scenarios beyond the original crashes. We also analyze Boeing’s updated MCAS and show how it falls short of its intended goals and continues to rely upon on a fault-prone static assignment of control priority. Using these insights, we present SA-MCAS, a new MCAS that both meets the intended goals of MCAS and avoids the failure cases that plague both MCAS designs. We demonstrate SA-MCAS’s ability to make safer and timely control decisions of the aircraft, even when the human and autonomous operators provide conflicting control inputs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3382-3394"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HLS-Based Approach for Embedded Real-Time Ray Tracing in Wireless Communications 基于 HLS 的无线通信嵌入式实时光线跟踪方法

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3446710

Jintong An;Selma Saidi

With the development of wireless communication technology, complex and dynamic scenarios pose great challenges to the Quality of Service (QoS) of wireless communication, especially in indoor scenarios. The quality of beam management can be greatly improved if signal ray-tracing module is embedded in wireless devices to handle synthetic multipath transmissions in real time. In this article, a novel reflection path derivation algorithm for ray tracing of signal beams is proposed, which builds the core mechanism of the proposed FPGA accelerator for ray tracing: by decomposing the computation of the entire ray path into mutually independent subproblems associated with the respective planes involved in the reflection and implemented by independent processing element on FPGAs, the parallelization of the entire ray tracing is realized, which significantly improves the convergence speed of the ray tracing; meanwhile, a new high-level synthesis workflow corresponds to the proposed algorithm and hardware architecture is proposed, which opens the door on synthesizing embedded hardware dedicated for robust and real-time wireless communication. After validation, the method proposed in this article can generate FPGA accelerator for real-time ray-tracing effectively, which achieves ray-tracing simulation in milliseconds.

随着无线通信技术的发展，复杂多变的场景给无线通信的服务质量（QoS）带来了巨大挑战，尤其是在室内场景。如果在无线设备中嵌入信号光线跟踪模块，实时处理合成多径传输，就能大大提高波束管理的质量。本文提出了一种用于信号波束光线跟踪的新型反射路径推导算法，该算法构建了用于光线跟踪的 FPGA 加速器的核心机制：通过将整个光线路径的计算分解为与反射所涉及的各个平面相关的相互独立的子问题，并由 FPGA 上的独立处理元件实现，实现了整个光线跟踪的并行化，从而显著提高了光线跟踪的收敛速度；同时，提出了与所提算法和硬件架构相对应的新的高级合成工作流程，为合成鲁棒和实时无线通信专用的嵌入式硬件打开了大门。经过验证，本文提出的方法能有效生成用于实时光线跟踪的 FPGA 加速器，实现毫秒级的光线跟踪仿真。

{"title":"HLS-Based Approach for Embedded Real-Time Ray Tracing in Wireless Communications","authors":"Jintong An;Selma Saidi","doi":"10.1109/TCAD.2024.3446710","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446710","url":null,"abstract":"With the development of wireless communication technology, complex and dynamic scenarios pose great challenges to the Quality of Service (QoS) of wireless communication, especially in indoor scenarios. The quality of beam management can be greatly improved if signal ray-tracing module is embedded in wireless devices to handle synthetic multipath transmissions in real time. In this article, a novel reflection path derivation algorithm for ray tracing of signal beams is proposed, which builds the core mechanism of the proposed FPGA accelerator for ray tracing: by decomposing the computation of the entire ray path into mutually independent subproblems associated with the respective planes involved in the reflection and implemented by independent processing element on FPGAs, the parallelization of the entire ray tracing is realized, which significantly improves the convergence speed of the ray tracing; meanwhile, a new high-level synthesis workflow corresponds to the proposed algorithm and hardware architecture is proposed, which opens the door on synthesizing embedded hardware dedicated for robust and real-time wireless communication. After validation, the method proposed in this article can generate FPGA accelerator for real-time ray-tracing effectively, which achieves ray-tracing simulation in milliseconds.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3720-3731"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MII: A Multifaceted Framework for Intermittence-Aware Inference and Scheduling MII：间歇感知推理和调度的多元框架

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443710

Ziliang Zhang;Cong Liu;Hyoseung Kim

The concurrent execution of deep neural networks (DNNs) inference tasks on the intermittently-powered batteryless devices (IPDs) has recently garnered much attention due to its potential in a broad range of smart sensing applications. While the checkpointing mechanisms (CMs) provided by the state-of-the-art make this possible, scheduling inference tasks on IPDs is still a complex problem due to significant performance variations across the DNN layers and CM choices. This complexity is further accentuated by dynamic environmental conditions and inherent resource constraints of IPDs. To tackle these challenges, we present MII, a framework designed for the intermittence-aware inference and scheduling on IPDs. MII formulates the shutdown and live time functions of an IPD from profiling the data, which our offline intermittence-aware search scheme uses to find the optimal layer-wise CMs for each task. At runtime, MII enhances the job success rates by dynamically making scheduling decisions to mitigate the workload losses from the power interruptions and adjusting these CMs in response to the actual energy patterns. Our evaluation demonstrates the superiority of MII over the state-of-the-art. In controlled environments, MII achieves an average increase of 21% and 39% in successful jobs under the stable and dynamic energy patterns. In the real-world settings, MII achieves 33% and 24% more successful jobs indoors and outdoors.

在间歇供电的无电池设备（IPD）上并发执行深度神经网络（DNN）推理任务，因其在广泛的智能传感应用中的潜力而最近备受关注。虽然最先进的检查点机制（CM）使其成为可能，但由于 DNN 各层和 CM 选择之间存在显著的性能差异，在 IPD 上调度推理任务仍然是一个复杂的问题。动态环境条件和 IPD 固有的资源限制进一步加剧了这一复杂性。为了应对这些挑战，我们提出了 MII，一个为 IPD 的间歇感知推理和调度而设计的框架。MII 通过对数据的剖析来制定 IPD 的关机和运行时间函数，我们的离线间歇感知搜索方案利用这些函数来为每个任务找到最优的层级 CM。在运行时，MII 通过动态调度决策来减轻电力中断造成的工作负载损失，并根据实际能源模式调整这些 CM，从而提高任务成功率。我们的评估表明，MII 优于最先进的技术。在受控环境中，MII 在稳定和动态能源模式下的成功作业分别平均增加了 21% 和 39%。在实际环境中，MII 在室内和室外的成功率分别提高了 33% 和 24%。

{"title":"MII: A Multifaceted Framework for Intermittence-Aware Inference and Scheduling","authors":"Ziliang Zhang;Cong Liu;Hyoseung Kim","doi":"10.1109/TCAD.2024.3443710","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443710","url":null,"abstract":"The concurrent execution of deep neural networks (DNNs) inference tasks on the intermittently-powered batteryless devices (IPDs) has recently garnered much attention due to its potential in a broad range of smart sensing applications. While the checkpointing mechanisms (CMs) provided by the state-of-the-art make this possible, scheduling inference tasks on IPDs is still a complex problem due to significant performance variations across the DNN layers and CM choices. This complexity is further accentuated by dynamic environmental conditions and inherent resource constraints of IPDs. To tackle these challenges, we present MII, a framework designed for the intermittence-aware inference and scheduling on IPDs. MII formulates the shutdown and live time functions of an IPD from profiling the data, which our offline intermittence-aware search scheme uses to find the optimal layer-wise CMs for each task. At runtime, MII enhances the job success rates by dynamically making scheduling decisions to mitigate the workload losses from the power interruptions and adjusting these CMs in response to the actual energy patterns. Our evaluation demonstrates the superiority of MII over the state-of-the-art. In controlled environments, MII achieves an average increase of 21% and 39% in successful jobs under the stable and dynamic energy patterns. In the real-world settings, MII achieves 33% and 24% more successful jobs indoors and outdoors.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3708-3719"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0