首页 > 最新文献

ACM Transactions on Embedded Computing Systems最新文献

英文 中文
Let Coarse-Grained Resources Be Shared: Mapping Entire Neural Networks on FPGAs 让粗粒度资源共享:在fpga上映射整个神经网络
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609109
Tzung-Han Juang, Christof Schlaak, Christophe Dubach
Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints. This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.
传统的高级综合(HLS)提供了硬件加速器的快速原型设计,而无需使用硬件描述语言(hdl)进行编码。然而,这种方法不能很好地支持在单个现场可编程门阵列(FPGA)器件上分配像整个深度神经网络这样的大型应用。由于资源限制,这种方法导致设计效率低下或不适合fpga。本文提出了基于功能中间表示(IRs)中功能共享的粗粒度资源控制来缩减生成的设计。所提出的编译器通过和重写系统的目的是产生有效的设计点和去除冗余的硬件。这种优化使得在fpga上拟合整个神经网络变得可行,并且与在每层运行专门的内核相比,产生了具有竞争力的性能。
{"title":"Let Coarse-Grained Resources Be Shared: Mapping Entire Neural Networks on FPGAs","authors":"Tzung-Han Juang, Christof Schlaak, Christophe Dubach","doi":"10.1145/3609109","DOIUrl":"https://doi.org/10.1145/3609109","url":null,"abstract":"Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints. This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WARM-tree: Making Quadtrees Write-efficient and Space-economic on Persistent Memories 使四叉树在持久内存上具有写效率和空间经济性
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3608033
Shin-Ting Wu, Liang-Chi Chen, Po-Chun Huang, Yuan-Hao Chang, Chien-Chung Ho, Wei-Kuan Shih
Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.
Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.
{"title":"WARM-tree: Making Quadtrees Write-efficient and Space-economic on Persistent Memories","authors":"Shin-Ting Wu, Liang-Chi Chen, Po-Chun Huang, Yuan-Hao Chang, Chien-Chung Ho, Wei-Kuan Shih","doi":"10.1145/3608033","DOIUrl":"https://doi.org/10.1145/3608033","url":null,"abstract":"Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators 神经网络加速器的光子随机梯度下降
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3607920
Chengpeng Xia, Yawen Chen, Haibo Zhang, Jigang Wu
Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.
深度神经网络(dnn)在图像识别和文本分析等领域取得了巨大的成功。然而,DNN模型和训练数据集的不断增长的规模使得深度学习非常需要计算和内存。近年来,光子计算已经成为一种很有前途的加速深度神经网络的技术。虽然用于深度神经网络推理和深度神经网络训练正向传播的光子加速器的设计已经得到了广泛的研究,但用于同样重要的深度神经网络训练反向传播的光子加速器的设计还没有得到很好的研究。在本文中,我们提出了一种新的基于硅光子的反向传播加速器,用于高性能深度神经网络的训练。具体而言,设计了通用光子梯度下降单元STADIA,利用马赫-曾达干涉仪(MZI)和微核谐振器(MRR)等成熟的光学器件实现梯度计算所需的乘法、累加和减法运算,显著降低了训练延迟,提高了反向传播的能量效率。为了演示高效的并行计算,我们提出了一个基于stadia的反向传播加速架构,并通过波分复用(WDM)设计了一个数据流。我们通过量化损耗和噪声对STADIA精度的限制来分析其精度。此外,我们还基于AlexNet、VGG19和ResNet等DNN模型,通过分析光子加速器的功率、面积和时间延迟,对不同元件尺寸的STADIA进行了评估。仿真结果表明,与目前最先进的基于光忆阻器的反向传播加速器相比,所提出的结构STADIA的时间效率提高了9.7倍,能量效率提高了147.2倍。
{"title":"STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators","authors":"Chengpeng Xia, Yawen Chen, Haibo Zhang, Jigang Wu","doi":"10.1145/3607920","DOIUrl":"https://doi.org/10.1145/3607920","url":null,"abstract":"Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Building Verifiable CPS using Lingua Franca 使用通用语言构建可验证的CPS
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609134
Shaokai Lin, Yatin A. Manerkar, Marten Lohstroh, Elizabeth Polgreen, Sheng-Jung Yu, Chadlia Jerad, Edward A. Lee, Sanjit A. Seshia
Formal verification of cyber-physical systems (CPS) is challenging because it has to consider real-time and concurrency aspects that are often absent in ordinary software. Moreover, the software in CPS is often complex and low-level, making it hard to assure that a formal model of the system used for verification is a faithful representation of the actual implementation, which can undermine the value of a verification result. To address this problem, we propose a methodology for building verifiable CPS based on the principle that a formal model of the software can be derived automatically from its implementation. Our approach requires that the system implementation is specified in Lingua Franca (LF), a polyglot coordination language tailored for real-time, concurrent CPS, which we made amenable to the specification of safety properties via annotations in the code. The program structure and the deterministic semantics of LF enable automatic construction of formal axiomatic models directly from LF programs. The generated models are automatically checked using Bounded Model Checking (BMC) by the verification engine Uclid5 using the Z3 SMT solver. The proposed technique enables checking a well-defined fragment of Safety Metric Temporal Logic (Safety MTL) formulas. To ensure the completeness of BMC, we present a method to derive an upper bound on the completeness threshold of an axiomatic model based on the semantics of LF. We implement our approach in the LF V erifier and evaluate it using a benchmark suite with 22 programs sampled from real-life applications and benchmarks for Erlang, Lustre, actor-oriented languages, and RTOSes. The LF V erifier correctly checks 21 out of 22 programs automatically.
网络物理系统(CPS)的正式验证具有挑战性,因为它必须考虑在普通软件中经常缺失的实时和并发性方面。此外,CPS中的软件通常是复杂和低级的,这使得很难保证用于验证的系统的正式模型是实际实现的忠实表示,这可能会破坏验证结果的价值。为了解决这个问题,我们提出了一种构建可验证的CPS的方法,该方法基于软件的正式模型可以从其实现中自动导出的原则。我们的方法要求系统实现用Lingua Franca (LF)指定,这是一种为实时、并发CPS量身定制的多语言协调语言,我们通过代码中的注释使其符合安全属性的规范。LF的程序结构和确定性语义使得直接从LF程序自动构造形式化公理模型成为可能。生成的模型由验证引擎Uclid5使用Z3 SMT求解器自动使用BMC (Bounded Model Checking)进行检查。所提出的技术能够检查定义良好的安全度量时间逻辑(Safety MTL)公式片段。为了保证BMC的完备性,我们提出了一种基于LF语义的公理模型完备性阈值上界的推导方法。我们在LF V验证器中实现了我们的方法,并使用一个包含22个程序的基准测试套件来评估它,这些程序来自于现实生活中的应用程序和Erlang、Lustre、面向角色的语言和rtos的基准测试。LF V验证器自动正确检查22个程序中的21个。
{"title":"Towards Building Verifiable CPS using Lingua Franca","authors":"Shaokai Lin, Yatin A. Manerkar, Marten Lohstroh, Elizabeth Polgreen, Sheng-Jung Yu, Chadlia Jerad, Edward A. Lee, Sanjit A. Seshia","doi":"10.1145/3609134","DOIUrl":"https://doi.org/10.1145/3609134","url":null,"abstract":"Formal verification of cyber-physical systems (CPS) is challenging because it has to consider real-time and concurrency aspects that are often absent in ordinary software. Moreover, the software in CPS is often complex and low-level, making it hard to assure that a formal model of the system used for verification is a faithful representation of the actual implementation, which can undermine the value of a verification result. To address this problem, we propose a methodology for building verifiable CPS based on the principle that a formal model of the software can be derived automatically from its implementation. Our approach requires that the system implementation is specified in Lingua Franca (LF), a polyglot coordination language tailored for real-time, concurrent CPS, which we made amenable to the specification of safety properties via annotations in the code. The program structure and the deterministic semantics of LF enable automatic construction of formal axiomatic models directly from LF programs. The generated models are automatically checked using Bounded Model Checking (BMC) by the verification engine Uclid5 using the Z3 SMT solver. The proposed technique enables checking a well-defined fragment of Safety Metric Temporal Logic (Safety MTL) formulas. To ensure the completeness of BMC, we present a method to derive an upper bound on the completeness threshold of an axiomatic model based on the semantics of LF. We implement our approach in the LF V erifier and evaluate it using a benchmark suite with 22 programs sampled from real-life applications and benchmarks for Erlang, Lustre, actor-oriented languages, and RTOSes. The LF V erifier correctly checks 21 out of 22 programs automatically.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IOSR: Improving I/O Efficiency for Memory Swapping on Mobile Devices Via Scheduling and Reshaping IOSR:通过调度和重塑来提高移动设备上内存交换的I/O效率
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3607923
Wentong Li, Liang Shi, Hang Li, Changlong Li, Edwin Hsing-Mean Sha
Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.
移动系统和应用程序的功能越来越丰富和强大,它们不断受到内存压力的影响,特别是对于配备有限DRAM的设备。将非活动DRAM页交换到存储设备是一种很有前途的扩展物理内存的解决方案。然而,现有的移动设备通常采用闪存作为存储设备,将DRAM页面交换到闪存可能会带来很大的性能开销。本文首先深入分析了基于闪存交换的I/O特性,包括交换子系统中的I/O干扰和交换I/O随机性。为了提高移动设备上基于闪存的内存交换性能,提出了一个内存交换I/O效率优化框架(IOSR)。IOSR包括两种方法:交换I/O调度(SIOS)和交换I/O模式重塑(SIOR)。SIOS设计用于调度交换I/O,以减少与其他进程I/O的干扰。SIOR旨在通过面向进程的交换槽分配和自适应粒度交换预读来重塑交换I/O模式。IOSR在Google Pixel 4上实现。实验结果表明,与现有技术相比,IOSR平均减少了31.7%的应用切换时间,提高了35.5%的换入带宽。
{"title":"IOSR: Improving I/O Efficiency for Memory Swapping on Mobile Devices Via Scheduling and Reshaping","authors":"Wentong Li, Liang Shi, Hang Li, Changlong Li, Edwin Hsing-Mean Sha","doi":"10.1145/3607923","DOIUrl":"https://doi.org/10.1145/3607923","url":null,"abstract":"Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Energy-efficient Personalized Federated Search with Graph for Edge Computing 基于图的边缘计算节能个性化联邦搜索
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609435
Zhao Yang, Qingshuang Sun
Federated Learning (FL) is a popular method for privacy-preserving machine learning on edge devices. However, the heterogeneity of edge devices, including differences in system architecture, data, and co-running applications, can significantly impact the energy efficiency of FL. To address these issues, we propose an energy-efficient personalized federated search framework. This framework has three key components. Firstly, we search for partial models with high inference efficiency to reduce training energy consumption and the occurrence of stragglers in each round. Secondly, we build lightweight search controllers that control the model sampling and respond to runtime variances, mitigating new straggler issues caused by co-running applications. Finally, we design an adaptive search update strategy based on graph aggregation to improve personalized training convergence. Our framework reduces the energy consumption of the training process by lowering the training overhead of each round and speeding up the training convergence rate. Experimental results show that our approach achieves up to 5.02% accuracy and 3.45× energy efficiency improvements.
联邦学习(FL)是一种在边缘设备上进行隐私保护机器学习的流行方法。然而,边缘设备的异构性,包括系统架构、数据和协同运行应用程序的差异,会显著影响FL的能源效率。为了解决这些问题,我们提出了一个节能的个性化联邦搜索框架。这个框架有三个关键组成部分。首先,我们寻找推理效率高的部分模型,以减少训练能量消耗和每轮中掉队者的出现。其次,我们构建了轻量级的搜索控制器来控制模型采样和响应运行时的变化,减轻了由共同运行的应用程序引起的新的掉队问题。最后,设计了一种基于图聚合的自适应搜索更新策略,以提高个性化训练的收敛性。我们的框架通过降低每轮的训练开销和加快训练收敛速度来减少训练过程的能量消耗。实验结果表明,该方法的准确率高达5.02%,能效提高3.45倍。
{"title":"Energy-efficient Personalized Federated Search with Graph for Edge Computing","authors":"Zhao Yang, Qingshuang Sun","doi":"10.1145/3609435","DOIUrl":"https://doi.org/10.1145/3609435","url":null,"abstract":"Federated Learning (FL) is a popular method for privacy-preserving machine learning on edge devices. However, the heterogeneity of edge devices, including differences in system architecture, data, and co-running applications, can significantly impact the energy efficiency of FL. To address these issues, we propose an energy-efficient personalized federated search framework. This framework has three key components. Firstly, we search for partial models with high inference efficiency to reduce training energy consumption and the occurrence of stragglers in each round. Secondly, we build lightweight search controllers that control the model sampling and respond to runtime variances, mitigating new straggler issues caused by co-running applications. Finally, we design an adaptive search update strategy based on graph aggregation to improve personalized training convergence. Our framework reduces the energy consumption of the training process by lowering the training overhead of each round and speeding up the training convergence rate. Experimental results show that our approach achieves up to 5.02% accuracy and 3.45× energy efficiency improvements.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LaDy: Enabling L ocality- a ware D eduplication Technolog y on Shingled Magnetic Recording Drives 使能局域性——瓦式磁记录驱动器上的一种软件重复技术
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3607921
Jung-Hsiu Chang, Tzu-Yu Chang, Yi-Chao Shih, Tseng-Yi Chen
The continuous increase in data volume has led to the adoption of shingled-magnetic recording (SMR) as the primary technology for modern storage drives. This technology offers high storage density and low unit cost but introduces significant performance overheads due to the read-update-write operation and garbage collection (GC) process. To reduce these overheads, data deduplication has been identified as an effective solution as it reduces the amount of written data to an SMR-based storage device. However, deduplication can result in poor data locality, leading to decreased read performance. To tackle this problem, this study proposes a data locality-aware deduplication technology, LaDy, that considers both the overheads of writing duplicate data and the impact on data locality to determine whether the duplicate data should be written. LaDy integrates with DiskSim, an open-source project, and modifies it to simulate an SMR-based drive. The experimental results demonstrate that LaDy can significantly reduce the response time in the best-case scenario by 87.3% compared with CAFTL on the SMR drive. LaDy achieves this by selectively writing duplicate data, which preserves data locality, resulting in improved read performance. The proposed solution provides an effective and efficient method for mitigating the performance overheads associated with data deduplication in SMR-based storage devices.
数据量的不断增加导致采用瓦式磁记录(SMR)作为现代存储驱动器的主要技术。该技术提供了高存储密度和低单位成本,但由于读-更新-写操作和垃圾收集(GC)过程,引入了显著的性能开销。为了减少这些开销,重复数据删除被认为是一种有效的解决方案,因为它减少了向基于smr的存储设备写入的数据量。但是,重复数据删除会导致数据局部性差,从而降低读性能。为了解决这个问题,本研究提出了一种数据位置感知的重复数据删除技术LaDy,该技术考虑了写入重复数据的开销和对数据位置的影响,以确定是否应该写入重复数据。LaDy集成了一个开源项目DiskSim,并对其进行了修改,以模拟基于smr的驱动器。实验结果表明,在最佳情况下,与在SMR驱动器上的CAFTL相比,LaDy可以显著减少87.3%的响应时间。LaDy通过选择性地写入重复数据来实现这一点,从而保留了数据的局域性,从而提高了读性能。该解决方案为降低基于smr的存储设备中重复数据删除带来的性能开销提供了一种有效的方法。
{"title":"LaDy: Enabling <u>L</u> ocality- <u>a</u> ware <u>D</u> eduplication Technolog <u>y</u> on Shingled Magnetic Recording Drives","authors":"Jung-Hsiu Chang, Tzu-Yu Chang, Yi-Chao Shih, Tseng-Yi Chen","doi":"10.1145/3607921","DOIUrl":"https://doi.org/10.1145/3607921","url":null,"abstract":"The continuous increase in data volume has led to the adoption of shingled-magnetic recording (SMR) as the primary technology for modern storage drives. This technology offers high storage density and low unit cost but introduces significant performance overheads due to the read-update-write operation and garbage collection (GC) process. To reduce these overheads, data deduplication has been identified as an effective solution as it reduces the amount of written data to an SMR-based storage device. However, deduplication can result in poor data locality, leading to decreased read performance. To tackle this problem, this study proposes a data locality-aware deduplication technology, LaDy, that considers both the overheads of writing duplicate data and the impact on data locality to determine whether the duplicate data should be written. LaDy integrates with DiskSim, an open-source project, and modifies it to simulate an SMR-based drive. The experimental results demonstrate that LaDy can significantly reduce the response time in the best-case scenario by 87.3% compared with CAFTL on the SMR drive. LaDy achieves this by selectively writing duplicate data, which preserves data locality, resulting in improved read performance. The proposed solution provides an effective and efficient method for mitigating the performance overheads associated with data deduplication in SMR-based storage devices.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
iAware: Interaction Aware Task Scheduling for Reducing Resource Contention in Mobile Systems iAware:用于减少移动系统资源争用的交互感知任务调度
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609391
Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu
To ensure the user experience of mobile systems, the foreground application can be differentiated to minimize the impact of background applications. However, this article observes that system services in the kernel and framework layer, instead of background applications, are now the major resource competitors. Specifically, these service tasks tend to be quiet when people rarely interact with the foreground application and active when interactions become frequent, and this high overlap of busy times leads to contention for resources. This article proposes iAware, an interaction-aware task scheduling framework in mobile systems. The key insight is to make use of the previously ignored idle period and schedule service tasks to run at that period. iAware quantify the interaction characteristic based on the screen touch event, and successfully stagger the periods of frequent user interactions. With iAware, service tasks tend to run when few interactions occur, for example, when the device’s screen is turned off, instead of when the user is frequently interacting with it. iAware is implemented on real smartphones. Experimental results show that the user experience is significantly improved with iAware. Compared to the state-of-the-art, the application launching speed and frame rate are enhanced by 38.89% and 7.97% separately, with no more than 1% additional battery consumption.
为了保证移动系统的用户体验,可以对前台应用进行区分,尽量减少后台应用的影响。然而,本文注意到内核和框架层的系统服务,而不是后台应用程序,现在是主要的资源竞争者。具体来说,当人们很少与前台应用程序交互时,这些服务任务往往是安静的,而当交互变得频繁时,这些服务任务往往是活跃的,而这种繁忙时间的高度重叠会导致对资源的争夺。本文提出了一种基于交互感知的移动系统任务调度框架iAware。关键是要利用之前忽略的空闲期,并安排服务任务在此期间运行。iAware量化了基于屏幕触摸事件的交互特征,并成功地错开了频繁用户交互的周期。使用iAware,服务任务往往在交互很少的时候运行,例如,当设备的屏幕关闭时,而不是当用户经常与它交互时。iAware在真正的智能手机上实现。实验结果表明,iAware显著改善了用户体验。与最先进的技术相比,应用程序启动速度和帧率分别提高了38.89%和7.97%,电池消耗不超过1%。
{"title":"iAware: Interaction Aware Task Scheduling for Reducing Resource Contention in Mobile Systems","authors":"Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu","doi":"10.1145/3609391","DOIUrl":"https://doi.org/10.1145/3609391","url":null,"abstract":"To ensure the user experience of mobile systems, the foreground application can be differentiated to minimize the impact of background applications. However, this article observes that system services in the kernel and framework layer, instead of background applications, are now the major resource competitors. Specifically, these service tasks tend to be quiet when people rarely interact with the foreground application and active when interactions become frequent, and this high overlap of busy times leads to contention for resources. This article proposes iAware, an interaction-aware task scheduling framework in mobile systems. The key insight is to make use of the previously ignored idle period and schedule service tasks to run at that period. iAware quantify the interaction characteristic based on the screen touch event, and successfully stagger the periods of frequent user interactions. With iAware, service tasks tend to run when few interactions occur, for example, when the device’s screen is turned off, instead of when the user is frequently interacting with it. iAware is implemented on real smartphones. Experimental results show that the user experience is significantly improved with iAware. Compared to the state-of-the-art, the application launching speed and frame rate are enhanced by 38.89% and 7.97% separately, with no more than 1% additional battery consumption.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpikeHard: Efficiency-Driven Neuromorphic Hardware for Heterogeneous Systems-on-Chip SpikeHard:异构片上系统的效率驱动神经形态硬件
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609101
Judicael Clair, Guy Eichler, Luca P. Carloni
Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.
神经形态计算是一个新兴领域,与传统的机器学习方法相比,它有潜力提供性能和能效方面的提升。然而,大多数神经形态硬件在设计时,对将其与异质片上系统(SoC)中的其他组件集成的问题关注有限。基于最先进的可重构神经形态架构,我们提出了一种神经形态硬件加速器的设计,该加速器配备了可编程接口,简化了集成到SoC中的过程以及与SoC上处理器的通信。为了优化片上资源的分配,我们开发了一个优化器来重构给定硬件架构的现有神经形态模型,并进行设计空间探索以找到高效的实现。我们对各种基于fpga的多加速器soc原型进行了实验,其中运行在RISC-V处理器上的基于linux的应用程序调用了我们的加速器和第三方加速器的帕累托最优实现。这些实验表明,应用我们的优化器后,我们的神经形态硬件的速度提高了89倍,能效提高了170倍,可以与其他加速器协同使用,用于不同的应用目的。
{"title":"SpikeHard: Efficiency-Driven Neuromorphic Hardware for Heterogeneous Systems-on-Chip","authors":"Judicael Clair, Guy Eichler, Luca P. Carloni","doi":"10.1145/3609101","DOIUrl":"https://doi.org/10.1145/3609101","url":null,"abstract":"Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predictable GPU Wavefront Splitting for Safety-Critical Systems 用于安全关键系统的可预测GPU波前分裂
3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-09-09 DOI: 10.1145/3609102
Artem Klashtorny, Zhuanhao Wu, Anirudh Mohan Kaushik, Hiren Patel
We present a predictable wavefront splitting (PWS) technique for graphics processing units (GPUs). PWS improves the performance of GPU applications by reducing the impact of branch divergence while ensuring that worst-case execution time (WCET) estimates can be computed. This makes PWS an appropriate technique to use in safety-critical applications, such as autonomous driving systems, avionics, and space, that require strict temporal guarantees. In developing PWS on an AMD-based GPU, we propose microarchitectural enhancements to the GPU, and a compiler pass that eliminates branch serializations to reduce the WCET of a wavefront. Our analysis of PWS exhibits a performance improvement of 11% over existing architectures with a lower WCET than prior works in wavefront splitting.
我们提出了一种用于图形处理器(gpu)的可预测波前分裂(PWS)技术。PWS通过减少分支发散的影响来提高GPU应用程序的性能,同时确保可以计算最坏情况执行时间(WCET)估计。这使得PWS成为安全关键应用的合适技术,例如需要严格时间保证的自动驾驶系统、航空电子设备和空间。在基于amd的GPU上开发PWS时,我们提出了对GPU的微架构增强,以及消除分支序列化的编译器通道,以减少波前的WCET。我们对PWS的分析表明,在波前分裂方面,与现有架构相比,PWS的性能提高了11%,WCET比以前的工作低。
{"title":"Predictable GPU Wavefront Splitting for Safety-Critical Systems","authors":"Artem Klashtorny, Zhuanhao Wu, Anirudh Mohan Kaushik, Hiren Patel","doi":"10.1145/3609102","DOIUrl":"https://doi.org/10.1145/3609102","url":null,"abstract":"We present a predictable wavefront splitting (PWS) technique for graphics processing units (GPUs). PWS improves the performance of GPU applications by reducing the impact of branch divergence while ensuring that worst-case execution time (WCET) estimates can be computed. This makes PWS an appropriate technique to use in safety-critical applications, such as autonomous driving systems, avionics, and space, that require strict temporal guarantees. In developing PWS on an AMD-based GPU, we propose microarchitectural enhancements to the GPU, and a compiler pass that eliminates branch serializations to reduce the WCET of a wavefront. Our analysis of PWS exhibits a performance improvement of 11% over existing architectures with a lower WCET than prior works in wavefront splitting.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Embedded Computing Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1