ACM Transactions on Embedded Computing Systems最新文献_第10页

STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators 神经网络加速器的光子随机梯度下降

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607920

Chengpeng Xia, Yawen Chen, Haibo Zhang, Jigang Wu

Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.

深度神经网络(dnn)在图像识别和文本分析等领域取得了巨大的成功。然而，DNN模型和训练数据集的不断增长的规模使得深度学习非常需要计算和内存。近年来，光子计算已经成为一种很有前途的加速深度神经网络的技术。虽然用于深度神经网络推理和深度神经网络训练正向传播的光子加速器的设计已经得到了广泛的研究，但用于同样重要的深度神经网络训练反向传播的光子加速器的设计还没有得到很好的研究。在本文中，我们提出了一种新的基于硅光子的反向传播加速器，用于高性能深度神经网络的训练。具体而言，设计了通用光子梯度下降单元STADIA，利用马赫-曾达干涉仪(MZI)和微核谐振器(MRR)等成熟的光学器件实现梯度计算所需的乘法、累加和减法运算，显著降低了训练延迟，提高了反向传播的能量效率。为了演示高效的并行计算，我们提出了一个基于stadia的反向传播加速架构，并通过波分复用(WDM)设计了一个数据流。我们通过量化损耗和噪声对STADIA精度的限制来分析其精度。此外，我们还基于AlexNet、VGG19和ResNet等DNN模型，通过分析光子加速器的功率、面积和时间延迟，对不同元件尺寸的STADIA进行了评估。仿真结果表明，与目前最先进的基于光忆阻器的反向传播加速器相比，所提出的结构STADIA的时间效率提高了9.7倍，能量效率提高了147.2倍。

{"title":"STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators","authors":"Chengpeng Xia, Yawen Chen, Haibo Zhang, Jigang Wu","doi":"10.1145/3607920","DOIUrl":"https://doi.org/10.1145/3607920","url":null,"abstract":"Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GHOST: A Graph Neural Network Accelerator using Silicon Photonics GHOST:使用硅光子学的图形神经网络加速器

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609097

Salma Afifi, Febin Sunny, Amin Shafiee, Mahdi Nikdast, Sudeep Pasricha

Graph neural networks (GNNs) have emerged as a powerful approach for modelling and learning from graph-structured data. Multiple fields have since benefitted enormously from the capabilities of GNNs, such as recommendation systems, social network analysis, drug discovery, and robotics. However, accelerating and efficiently processing GNNs require a unique approach that goes beyond conventional artificial neural network accelerators, due to the substantial computational and memory requirements of GNNs. The slowdown of scaling in CMOS platforms also motivates a search for alternative implementation substrates. In this paper, we present GHOST , the first silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates the costs associated with both vertex-centric and edge-centric operations. It implements separately the three main stages involved in running GNNs in the optical domain, allowing it to be used for the inference of various widely used GNN models and architectures, such as graph convolution networks and graph attention networks. Our simulation studies indicate that GHOST exhibits at least 10.2 × better throughput and 3.8 × better energy efficiency when compared to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators.

图神经网络(gnn)已经成为一种强大的建模和从图结构数据中学习的方法。从那以后，许多领域都从gnn的能力中受益匪浅，比如推荐系统、社交网络分析、药物发现和机器人技术。然而，加速和有效处理gnn需要一种超越传统人工神经网络加速器的独特方法，因为gnn需要大量的计算和内存需求。CMOS平台的扩展速度放缓也促使人们寻找替代的实现基板。在本文中，我们提出了GHOST，第一个用于gnn的硅光子硬件加速器。GHOST有效地降低了以顶点为中心和以边缘为中心操作的相关成本。它分别实现了在光学域中运行GNN所涉及的三个主要阶段，使其能够用于各种广泛使用的GNN模型和架构的推理，例如图卷积网络和图注意网络。我们的仿真研究表明，与GPU、TPU、CPU和多个最先进的GNN硬件加速器相比，GHOST具有至少10.2倍的吞吐量和3.8倍的能效。

{"title":"GHOST: A Graph Neural Network Accelerator using Silicon Photonics","authors":"Salma Afifi, Febin Sunny, Amin Shafiee, Mahdi Nikdast, Sudeep Pasricha","doi":"10.1145/3609097","DOIUrl":"https://doi.org/10.1145/3609097","url":null,"abstract":"Graph neural networks (GNNs) have emerged as a powerful approach for modelling and learning from graph-structured data. Multiple fields have since benefitted enormously from the capabilities of GNNs, such as recommendation systems, social network analysis, drug discovery, and robotics. However, accelerating and efficiently processing GNNs require a unique approach that goes beyond conventional artificial neural network accelerators, due to the substantial computational and memory requirements of GNNs. The slowdown of scaling in CMOS platforms also motivates a search for alternative implementation substrates. In this paper, we present GHOST , the first silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates the costs associated with both vertex-centric and edge-centric operations. It implements separately the three main stages involved in running GNNs in the optical domain, allowing it to be used for the inference of various widely used GNN models and architectures, such as graph convolution networks and graph attention networks. Our simulation studies indicate that GHOST exhibits at least 10.2 × better throughput and 3.8 × better energy efficiency when compared to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Building Verifiable CPS using Lingua Franca 使用通用语言构建可验证的CPS

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609134

Shaokai Lin, Yatin A. Manerkar, Marten Lohstroh, Elizabeth Polgreen, Sheng-Jung Yu, Chadlia Jerad, Edward A. Lee, Sanjit A. Seshia

Formal verification of cyber-physical systems (CPS) is challenging because it has to consider real-time and concurrency aspects that are often absent in ordinary software. Moreover, the software in CPS is often complex and low-level, making it hard to assure that a formal model of the system used for verification is a faithful representation of the actual implementation, which can undermine the value of a verification result. To address this problem, we propose a methodology for building verifiable CPS based on the principle that a formal model of the software can be derived automatically from its implementation. Our approach requires that the system implementation is specified in Lingua Franca (LF), a polyglot coordination language tailored for real-time, concurrent CPS, which we made amenable to the specification of safety properties via annotations in the code. The program structure and the deterministic semantics of LF enable automatic construction of formal axiomatic models directly from LF programs. The generated models are automatically checked using Bounded Model Checking (BMC) by the verification engine Uclid5 using the Z3 SMT solver. The proposed technique enables checking a well-defined fragment of Safety Metric Temporal Logic (Safety MTL) formulas. To ensure the completeness of BMC, we present a method to derive an upper bound on the completeness threshold of an axiomatic model based on the semantics of LF. We implement our approach in the LF V erifier and evaluate it using a benchmark suite with 22 programs sampled from real-life applications and benchmarks for Erlang, Lustre, actor-oriented languages, and RTOSes. The LF V erifier correctly checks 21 out of 22 programs automatically.

网络物理系统(CPS)的正式验证具有挑战性，因为它必须考虑在普通软件中经常缺失的实时和并发性方面。此外，CPS中的软件通常是复杂和低级的，这使得很难保证用于验证的系统的正式模型是实际实现的忠实表示，这可能会破坏验证结果的价值。为了解决这个问题，我们提出了一种构建可验证的CPS的方法，该方法基于软件的正式模型可以从其实现中自动导出的原则。我们的方法要求系统实现用Lingua Franca (LF)指定，这是一种为实时、并发CPS量身定制的多语言协调语言，我们通过代码中的注释使其符合安全属性的规范。LF的程序结构和确定性语义使得直接从LF程序自动构造形式化公理模型成为可能。生成的模型由验证引擎Uclid5使用Z3 SMT求解器自动使用BMC (Bounded Model Checking)进行检查。所提出的技术能够检查定义良好的安全度量时间逻辑(Safety MTL)公式片段。为了保证BMC的完备性，我们提出了一种基于LF语义的公理模型完备性阈值上界的推导方法。我们在LF V验证器中实现了我们的方法，并使用一个包含22个程序的基准测试套件来评估它，这些程序来自于现实生活中的应用程序和Erlang、Lustre、面向角色的语言和rtos的基准测试。LF V验证器自动正确检查22个程序中的21个。

{"title":"Towards Building Verifiable CPS using Lingua Franca","authors":"Shaokai Lin, Yatin A. Manerkar, Marten Lohstroh, Elizabeth Polgreen, Sheng-Jung Yu, Chadlia Jerad, Edward A. Lee, Sanjit A. Seshia","doi":"10.1145/3609134","DOIUrl":"https://doi.org/10.1145/3609134","url":null,"abstract":"Formal verification of cyber-physical systems (CPS) is challenging because it has to consider real-time and concurrency aspects that are often absent in ordinary software. Moreover, the software in CPS is often complex and low-level, making it hard to assure that a formal model of the system used for verification is a faithful representation of the actual implementation, which can undermine the value of a verification result. To address this problem, we propose a methodology for building verifiable CPS based on the principle that a formal model of the software can be derived automatically from its implementation. Our approach requires that the system implementation is specified in Lingua Franca (LF), a polyglot coordination language tailored for real-time, concurrent CPS, which we made amenable to the specification of safety properties via annotations in the code. The program structure and the deterministic semantics of LF enable automatic construction of formal axiomatic models directly from LF programs. The generated models are automatically checked using Bounded Model Checking (BMC) by the verification engine Uclid5 using the Z3 SMT solver. The proposed technique enables checking a well-defined fragment of Safety Metric Temporal Logic (Safety MTL) formulas. To ensure the completeness of BMC, we present a method to derive an upper bound on the completeness threshold of an axiomatic model based on the semantics of LF. We implement our approach in the LF V erifier and evaluate it using a benchmark suite with 22 programs sampled from real-life applications and benchmarks for Erlang, Lustre, actor-oriented languages, and RTOSes. The LF V erifier correctly checks 21 out of 22 programs automatically.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ANV-PUF: Machine-Learning-Resilient NVM-Based Arbiter PUF ANV-PUF:基于机器学习弹性nvm的仲裁PUF

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609388

Hassan Nassar, Lars Bauer, Jörg Henkel

Physical Unclonable Functions (PUFs) have been widely considered an attractive security primitive. They use the deviations in the fabrication process to have unique responses from each device. Due to their nature, they serve as a DNA-like identity of the device. But PUFs have also been targeted for attacks. It has been proven that machine learning (ML) can be used to effectively model a PUF design and predict its behavior, leading to leakage of the internal secrets. To combat such attacks, several designs have been proposed to make it harder to model PUFs. One design direction is to use Non-Volatile Memory (NVM) as the building block of the PUF. NVM typically are multi-level cells, i.e, they have several internal states, which makes it harder to model them. However, the current state of the art of NVM-based PUFs is limited to ‘weak PUFs’, i.e., the number of outputs grows only linearly with the number of inputs, which limits the number of possible secret values that can be stored using the PUF. To overcome this limitation, in this work we design the Arbiter Non-Volatile PUF (ANV-PUF) that is exponential in the number of inputs and that is resilient against ML-based modeling. The concept is based on the famous delay-based Arbiter PUF (which is not resilient against modeling attacks) while using NVM as a building block instead of switches. Hence, we replace the switch delays (which are easy to model via ML) with the multi-level property of NVM (which is hard to model via ML). Consequently, our design has the exponential output characteristics of the Arbiter PUF and the resilience against attacks from the NVM-based PUFs. Our results show that the resilience to ML modeling, uniqueness, and uniformity are all in the ideal range of 50%. Thus, in contrast to the state-of-the-art, ANV-PUF is able to be resilient to attacks, while having an exponential number of outputs.

物理不可克隆函数(puf)被广泛认为是一种有吸引力的安全原语。他们利用制造过程中的偏差来对每个设备产生独特的响应。由于它们的性质，它们作为设备的dna一样的身份。但是puf也成为了攻击的目标。已经证明，机器学习(ML)可以用来有效地建模PUF设计并预测其行为，从而导致内部秘密的泄露。为了对抗这种攻击，已经提出了几种设计，以使puf更难建模。一个设计方向是使用非易失性存储器(NVM)作为PUF的构建块。NVM通常是多级单元，也就是说，它们有几个内部状态，这使得建模变得更加困难。然而，目前基于nvm的PUF技术仅限于“弱PUF”，即输出的数量仅随输入的数量线性增长，这限制了可以使用PUF存储的可能的秘密值的数量。为了克服这一限制，在这项工作中，我们设计了仲裁者非易失性PUF (ANV-PUF)，它的输入数量呈指数增长，并且对基于ml的建模具有弹性。这个概念是基于著名的基于延迟的仲裁器PUF(它对建模攻击没有弹性)，同时使用NVM作为构建块而不是交换机。因此，我们用NVM的多级属性(很难通过ML建模)取代了开关延迟(很容易通过ML建模)。因此，我们的设计具有Arbiter PUF的指数输出特性和抵御基于nvm的PUF攻击的弹性。我们的结果表明，对ML建模的弹性、唯一性和均匀性都在50%的理想范围内。因此，与最先进的技术相比，ANV-PUF能够抵御攻击，同时具有指数级的输出数量。

{"title":"ANV-PUF: Machine-Learning-Resilient NVM-Based Arbiter PUF","authors":"Hassan Nassar, Lars Bauer, Jörg Henkel","doi":"10.1145/3609388","DOIUrl":"https://doi.org/10.1145/3609388","url":null,"abstract":"Physical Unclonable Functions (PUFs) have been widely considered an attractive security primitive. They use the deviations in the fabrication process to have unique responses from each device. Due to their nature, they serve as a DNA-like identity of the device. But PUFs have also been targeted for attacks. It has been proven that machine learning (ML) can be used to effectively model a PUF design and predict its behavior, leading to leakage of the internal secrets. To combat such attacks, several designs have been proposed to make it harder to model PUFs. One design direction is to use Non-Volatile Memory (NVM) as the building block of the PUF. NVM typically are multi-level cells, i.e, they have several internal states, which makes it harder to model them. However, the current state of the art of NVM-based PUFs is limited to ‘weak PUFs’, i.e., the number of outputs grows only linearly with the number of inputs, which limits the number of possible secret values that can be stored using the PUF. To overcome this limitation, in this work we design the Arbiter Non-Volatile PUF (ANV-PUF) that is exponential in the number of inputs and that is resilient against ML-based modeling. The concept is based on the famous delay-based Arbiter PUF (which is not resilient against modeling attacks) while using NVM as a building block instead of switches. Hence, we replace the switch delays (which are easy to model via ML) with the multi-level property of NVM (which is hard to model via ML). Consequently, our design has the exponential output characteristics of the Arbiter PUF and the resilience against attacks from the NVM-based PUFs. Our results show that the resilience to ML modeling, uniqueness, and uniformity are all in the ideal range of 50%. Thus, in contrast to the state-of-the-art, ANV-PUF is able to be resilient to attacks, while having an exponential number of outputs.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LaDy: Enabling L ocality- a ware D eduplication Technolog y on Shingled Magnetic Recording Drives 使能局域性——瓦式磁记录驱动器上的一种软件重复技术

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607921

Jung-Hsiu Chang, Tzu-Yu Chang, Yi-Chao Shih, Tseng-Yi Chen

The continuous increase in data volume has led to the adoption of shingled-magnetic recording (SMR) as the primary technology for modern storage drives. This technology offers high storage density and low unit cost but introduces significant performance overheads due to the read-update-write operation and garbage collection (GC) process. To reduce these overheads, data deduplication has been identified as an effective solution as it reduces the amount of written data to an SMR-based storage device. However, deduplication can result in poor data locality, leading to decreased read performance. To tackle this problem, this study proposes a data locality-aware deduplication technology, LaDy, that considers both the overheads of writing duplicate data and the impact on data locality to determine whether the duplicate data should be written. LaDy integrates with DiskSim, an open-source project, and modifies it to simulate an SMR-based drive. The experimental results demonstrate that LaDy can significantly reduce the response time in the best-case scenario by 87.3% compared with CAFTL on the SMR drive. LaDy achieves this by selectively writing duplicate data, which preserves data locality, resulting in improved read performance. The proposed solution provides an effective and efficient method for mitigating the performance overheads associated with data deduplication in SMR-based storage devices.

数据量的不断增加导致采用瓦式磁记录(SMR)作为现代存储驱动器的主要技术。该技术提供了高存储密度和低单位成本，但由于读-更新-写操作和垃圾收集(GC)过程，引入了显著的性能开销。为了减少这些开销，重复数据删除被认为是一种有效的解决方案，因为它减少了向基于smr的存储设备写入的数据量。但是，重复数据删除会导致数据局部性差，从而降低读性能。为了解决这个问题，本研究提出了一种数据位置感知的重复数据删除技术LaDy，该技术考虑了写入重复数据的开销和对数据位置的影响，以确定是否应该写入重复数据。LaDy集成了一个开源项目DiskSim，并对其进行了修改，以模拟基于smr的驱动器。实验结果表明，在最佳情况下，与在SMR驱动器上的CAFTL相比，LaDy可以显著减少87.3%的响应时间。LaDy通过选择性地写入重复数据来实现这一点，从而保留了数据的局域性，从而提高了读性能。该解决方案为降低基于smr的存储设备中重复数据删除带来的性能开销提供了一种有效的方法。

{"title":"LaDy: Enabling L ocality- a ware D eduplication Technolog y on Shingled Magnetic Recording Drives","authors":"Jung-Hsiu Chang, Tzu-Yu Chang, Yi-Chao Shih, Tseng-Yi Chen","doi":"10.1145/3607921","DOIUrl":"https://doi.org/10.1145/3607921","url":null,"abstract":"The continuous increase in data volume has led to the adoption of shingled-magnetic recording (SMR) as the primary technology for modern storage drives. This technology offers high storage density and low unit cost but introduces significant performance overheads due to the read-update-write operation and garbage collection (GC) process. To reduce these overheads, data deduplication has been identified as an effective solution as it reduces the amount of written data to an SMR-based storage device. However, deduplication can result in poor data locality, leading to decreased read performance. To tackle this problem, this study proposes a data locality-aware deduplication technology, LaDy, that considers both the overheads of writing duplicate data and the impact on data locality to determine whether the duplicate data should be written. LaDy integrates with DiskSim, an open-source project, and modifies it to simulate an SMR-based drive. The experimental results demonstrate that LaDy can significantly reduce the response time in the best-case scenario by 87.3% compared with CAFTL on the SMR drive. LaDy achieves this by selectively writing duplicate data, which preserves data locality, resulting in improved read performance. The proposed solution provides an effective and efficient method for mitigating the performance overheads associated with data deduplication in SMR-based storage devices.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stochastic Analysis of Control Systems Subject to Communication and Computation Faults 通信和计算故障下控制系统的随机分析

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609123

Nils Vreman, Martina Maggio

Control theory allows one to design controllers that are robust to external disturbances, model simplification, and modelling inaccuracy. Researchers have investigated whether the robustness carries on to the controller’s digital implementation, mostly looking at how the controller reacts to either communication or computational problems. Communication problems are typically modelled using random variables (i.e., estimating the probability that a fault will occur during a transmission), while computational problems are modelled using deterministic guarantees on the number of deadlines that the control task has to meet. These fault models allow the engineer to both design robust controllers and assess the controllers’ behaviour in the presence of isolated faults. Despite being very relevant for the real-world implementations of control system, the question of what happens when these faults occur simultaneously does not yet have a proper answer. In this paper, we answer this question in the stochastic setting, using the theory of Markov Jump Linear Systems to provide stability contracts with almost sure guarantees of convergence. For linear time-invariant Markov jump linear systems, mean square stability implies almost sure convergence – a property that is central to our investigation. Our research primarily emphasises the validation of this property for closed-loop systems that are subject to packet losses and computational overruns, potentially occurring simultaneously. We apply our method to two case studies from the recent literature and show their robustness to a comprehensive set of faults. We employ closed-loop system simulations to empirically derive performance metrics that elucidate the quality of the controller implementation, such as the system settling time and the integral absolute error.

控制理论允许人们设计对外部干扰、模型简化和建模不准确具有鲁棒性的控制器。研究人员已经研究了鲁棒性是否会延续到控制器的数字实现中，主要是观察控制器对通信或计算问题的反应。通信问题通常使用随机变量进行建模(例如，估计在传输过程中发生故障的概率)，而计算问题则使用控制任务必须满足的截止日期数量的确定性保证进行建模。这些故障模型允许工程师设计鲁棒控制器，并在孤立故障存在时评估控制器的行为。尽管与控制系统的实际实现非常相关，但当这些故障同时发生时会发生什么问题还没有一个适当的答案。本文利用马尔可夫跳变线性系统理论给出了收敛性几乎确定保证的稳定性契约，在随机环境下回答了这个问题。对于线性定常马尔可夫跳变线性系统，均方稳定性意味着几乎肯定的收敛性——这是我们研究的中心性质。我们的研究主要强调了闭环系统的这一特性的验证，这些系统可能同时发生数据包丢失和计算超支。我们将我们的方法应用于最近文献中的两个案例研究，并展示了它们对一组全面错误的鲁棒性。我们采用闭环系统模拟来经验推导性能指标，阐明控制器实现的质量，如系统稳定时间和积分绝对误差。

{"title":"Stochastic Analysis of Control Systems Subject to Communication and Computation Faults","authors":"Nils Vreman, Martina Maggio","doi":"10.1145/3609123","DOIUrl":"https://doi.org/10.1145/3609123","url":null,"abstract":"Control theory allows one to design controllers that are robust to external disturbances, model simplification, and modelling inaccuracy. Researchers have investigated whether the robustness carries on to the controller’s digital implementation, mostly looking at how the controller reacts to either communication or computational problems. Communication problems are typically modelled using random variables (i.e., estimating the probability that a fault will occur during a transmission), while computational problems are modelled using deterministic guarantees on the number of deadlines that the control task has to meet. These fault models allow the engineer to both design robust controllers and assess the controllers’ behaviour in the presence of isolated faults. Despite being very relevant for the real-world implementations of control system, the question of what happens when these faults occur simultaneously does not yet have a proper answer. In this paper, we answer this question in the stochastic setting, using the theory of Markov Jump Linear Systems to provide stability contracts with almost sure guarantees of convergence. For linear time-invariant Markov jump linear systems, mean square stability implies almost sure convergence – a property that is central to our investigation. Our research primarily emphasises the validation of this property for closed-loop systems that are subject to packet losses and computational overruns, potentially occurring simultaneously. We apply our method to two case studies from the recent literature and show their robustness to a comprehensive set of faults. We employ closed-loop system simulations to empirically derive performance metrics that elucidate the quality of the controller implementation, such as the system settling time and the integral absolute error.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Keep in Balance: Runtime-reconfigurable Intermittent Deep Inference 保持平衡:运行时可重构的间歇深度推理

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3607918

Chih-Hsuan Yen, Hashan Roshantha Mendis, Tei-Wei Kuo, Pi-Cheng Hsiu

Intermittent deep neural network (DNN) inference is a promising technique to enable intelligent applications on tiny devices powered by ambient energy sources. Nonetheless, intermittent execution presents inherent challenges, primarily involving accumulating progress across power cycles and having to refetch volatile data lost due to power loss in each power cycle. Existing approaches typically optimize the inference configuration to maximize data reuse. However, we observe that such a fixed configuration may be significantly inefficient due to the fluctuating balance point between data reuse and data refetch caused by the dynamic nature of ambient energy. This work proposes DynBal , an approach to dynamically reconfigure the inference engine at runtime. DynBal is realized as a middleware plugin that improves inference performance by exploring the interplay between data reuse and data refetch to maintain their balance with respect to the changing level of intermittency. An indirect metric is developed to easily evaluate an inference configuration considering the variability in intermittency, and a lightweight reconfiguration algorithm is employed to efficiently optimize the configuration at runtime. We evaluate the improvement brought by integrating DynBal into a recent intermittent inference approach that uses a fixed configuration. Evaluations were conducted on a Texas Instruments device with various network models and under varied intermittent power strengths. Our experimental results demonstrate that DynBal can speed up intermittent inference by 3.26 times, achieving a greater improvement for a large network under high intermittency and a large gap between memory and computation performance.

间歇性深度神经网络(DNN)推理是一种很有前途的技术，可以在由环境能源驱动的微型设备上实现智能应用。尽管如此，间歇性执行带来了固有的挑战，主要涉及跨电源周期累积进度，并且必须重新获取由于每个电源周期中的电源丢失而丢失的易失性数据。现有的方法通常会优化推理配置以最大化数据重用。然而，我们观察到，由于环境能量的动态性导致数据重用和数据重取之间的平衡点波动，这种固定配置可能会显着降低效率。这项工作提出了DynBal，一种在运行时动态重新配置推理引擎的方法。DynBal是作为一个中间件插件实现的，它通过探索数据重用和数据重取之间的相互作用来提高推理性能，以保持它们在间歇性变化水平方面的平衡。考虑间歇性的可变性，提出了一种间接度量来方便地评估推理配置，并采用了一种轻量级的重构算法来有效地优化运行时的配置。我们评估了将DynBal集成到最近使用固定配置的间歇性推理方法中所带来的改进。在不同网络模型和不同间歇功率强度下，对德州仪器设备进行了评估。我们的实验结果表明，DynBal可以将间歇性推理的速度提高3.26倍，对于高间歇性和内存与计算性能之间较大差距的大型网络实现了更大的改进。

{"title":"Keep in Balance: Runtime-reconfigurable Intermittent Deep Inference","authors":"Chih-Hsuan Yen, Hashan Roshantha Mendis, Tei-Wei Kuo, Pi-Cheng Hsiu","doi":"10.1145/3607918","DOIUrl":"https://doi.org/10.1145/3607918","url":null,"abstract":"Intermittent deep neural network (DNN) inference is a promising technique to enable intelligent applications on tiny devices powered by ambient energy sources. Nonetheless, intermittent execution presents inherent challenges, primarily involving accumulating progress across power cycles and having to refetch volatile data lost due to power loss in each power cycle. Existing approaches typically optimize the inference configuration to maximize data reuse. However, we observe that such a fixed configuration may be significantly inefficient due to the fluctuating balance point between data reuse and data refetch caused by the dynamic nature of ambient energy. This work proposes DynBal , an approach to dynamically reconfigure the inference engine at runtime. DynBal is realized as a middleware plugin that improves inference performance by exploring the interplay between data reuse and data refetch to maintain their balance with respect to the changing level of intermittency. An indirect metric is developed to easily evaluate an inference configuration considering the variability in intermittency, and a lightweight reconfiguration algorithm is employed to efficiently optimize the configuration at runtime. We evaluate the improvement brought by integrating DynBal into a recent intermittent inference approach that uses a fixed configuration. Evaluations were conducted on a Texas Instruments device with various network models and under varied intermittent power strengths. Our experimental results demonstrate that DynBal can speed up intermittent inference by 3.26 times, achieving a greater improvement for a large network under high intermittency and a large gap between memory and computation performance.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpikeHard: Efficiency-Driven Neuromorphic Hardware for Heterogeneous Systems-on-Chip SpikeHard:异构片上系统的效率驱动神经形态硬件

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609101

Judicael Clair, Guy Eichler, Luca P. Carloni

Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.

神经形态计算是一个新兴领域，与传统的机器学习方法相比，它有潜力提供性能和能效方面的提升。然而，大多数神经形态硬件在设计时，对将其与异质片上系统(SoC)中的其他组件集成的问题关注有限。基于最先进的可重构神经形态架构，我们提出了一种神经形态硬件加速器的设计，该加速器配备了可编程接口，简化了集成到SoC中的过程以及与SoC上处理器的通信。为了优化片上资源的分配，我们开发了一个优化器来重构给定硬件架构的现有神经形态模型，并进行设计空间探索以找到高效的实现。我们对各种基于fpga的多加速器soc原型进行了实验，其中运行在RISC-V处理器上的基于linux的应用程序调用了我们的加速器和第三方加速器的帕累托最优实现。这些实验表明，应用我们的优化器后，我们的神经形态硬件的速度提高了89倍，能效提高了170倍，可以与其他加速器协同使用，用于不同的应用目的。

{"title":"SpikeHard: Efficiency-Driven Neuromorphic Hardware for Heterogeneous Systems-on-Chip","authors":"Judicael Clair, Guy Eichler, Luca P. Carloni","doi":"10.1145/3609101","DOIUrl":"https://doi.org/10.1145/3609101","url":null,"abstract":"Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

iAware: Interaction Aware Task Scheduling for Reducing Resource Contention in Mobile Systems iAware:用于减少移动系统资源争用的交互感知任务调度

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609391

Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu

To ensure the user experience of mobile systems, the foreground application can be differentiated to minimize the impact of background applications. However, this article observes that system services in the kernel and framework layer, instead of background applications, are now the major resource competitors. Specifically, these service tasks tend to be quiet when people rarely interact with the foreground application and active when interactions become frequent, and this high overlap of busy times leads to contention for resources. This article proposes iAware, an interaction-aware task scheduling framework in mobile systems. The key insight is to make use of the previously ignored idle period and schedule service tasks to run at that period. iAware quantify the interaction characteristic based on the screen touch event, and successfully stagger the periods of frequent user interactions. With iAware, service tasks tend to run when few interactions occur, for example, when the device’s screen is turned off, instead of when the user is frequently interacting with it. iAware is implemented on real smartphones. Experimental results show that the user experience is significantly improved with iAware. Compared to the state-of-the-art, the application launching speed and frame rate are enhanced by 38.89% and 7.97% separately, with no more than 1% additional battery consumption.

为了保证移动系统的用户体验，可以对前台应用进行区分，尽量减少后台应用的影响。然而，本文注意到内核和框架层的系统服务，而不是后台应用程序，现在是主要的资源竞争者。具体来说，当人们很少与前台应用程序交互时，这些服务任务往往是安静的，而当交互变得频繁时，这些服务任务往往是活跃的，而这种繁忙时间的高度重叠会导致对资源的争夺。本文提出了一种基于交互感知的移动系统任务调度框架iAware。关键是要利用之前忽略的空闲期，并安排服务任务在此期间运行。iAware量化了基于屏幕触摸事件的交互特征，并成功地错开了频繁用户交互的周期。使用iAware，服务任务往往在交互很少的时候运行，例如，当设备的屏幕关闭时，而不是当用户经常与它交互时。iAware在真正的智能手机上实现。实验结果表明，iAware显著改善了用户体验。与最先进的技术相比，应用程序启动速度和帧率分别提高了38.89%和7.97%，电池消耗不超过1%。

{"title":"iAware: Interaction Aware Task Scheduling for Reducing Resource Contention in Mobile Systems","authors":"Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu","doi":"10.1145/3609391","DOIUrl":"https://doi.org/10.1145/3609391","url":null,"abstract":"To ensure the user experience of mobile systems, the foreground application can be differentiated to minimize the impact of background applications. However, this article observes that system services in the kernel and framework layer, instead of background applications, are now the major resource competitors. Specifically, these service tasks tend to be quiet when people rarely interact with the foreground application and active when interactions become frequent, and this high overlap of busy times leads to contention for resources. This article proposes iAware, an interaction-aware task scheduling framework in mobile systems. The key insight is to make use of the previously ignored idle period and schedule service tasks to run at that period. iAware quantify the interaction characteristic based on the screen touch event, and successfully stagger the periods of frequent user interactions. With iAware, service tasks tend to run when few interactions occur, for example, when the device’s screen is turned off, instead of when the user is frequently interacting with it. iAware is implemented on real smartphones. Experimental results show that the user experience is significantly improved with iAware. Compared to the state-of-the-art, the application launching speed and frame rate are enhanced by 38.89% and 7.97% separately, with no more than 1% additional battery consumption.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predictable GPU Wavefront Splitting for Safety-Critical Systems 用于安全关键系统的可预测GPU波前分裂

3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems

Pub Date : 2023-09-09 DOI: 10.1145/3609102

Artem Klashtorny, Zhuanhao Wu, Anirudh Mohan Kaushik, Hiren Patel

We present a predictable wavefront splitting (PWS) technique for graphics processing units (GPUs). PWS improves the performance of GPU applications by reducing the impact of branch divergence while ensuring that worst-case execution time (WCET) estimates can be computed. This makes PWS an appropriate technique to use in safety-critical applications, such as autonomous driving systems, avionics, and space, that require strict temporal guarantees. In developing PWS on an AMD-based GPU, we propose microarchitectural enhancements to the GPU, and a compiler pass that eliminates branch serializations to reduce the WCET of a wavefront. Our analysis of PWS exhibits a performance improvement of 11% over existing architectures with a lower WCET than prior works in wavefront splitting.

我们提出了一种用于图形处理器(gpu)的可预测波前分裂(PWS)技术。PWS通过减少分支发散的影响来提高GPU应用程序的性能，同时确保可以计算最坏情况执行时间(WCET)估计。这使得PWS成为安全关键应用的合适技术，例如需要严格时间保证的自动驾驶系统、航空电子设备和空间。在基于amd的GPU上开发PWS时，我们提出了对GPU的微架构增强，以及消除分支序列化的编译器通道，以减少波前的WCET。我们对PWS的分析表明，在波前分裂方面，与现有架构相比，PWS的性能提高了11%，WCET比以前的工作低。

引用次数: 0