Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints. This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.
{"title":"Let Coarse-Grained Resources Be Shared: Mapping Entire Neural Networks on FPGAs","authors":"Tzung-Han Juang, Christof Schlaak, Christophe Dubach","doi":"10.1145/3609109","DOIUrl":"https://doi.org/10.1145/3609109","url":null,"abstract":"Traditional High-Level Synthesis (HLS) provides rapid prototyping of hardware accelerators without coding with Hardware Description Languages (HDLs). However, such an approach does not well support allocating large applications like entire deep neural networks on a single Field Programmable Gate Array (FPGA) device. The approach leads to designs that are inefficient or do not fit into FPGAs due to resource constraints. This work proposes to shrink generated designs by coarse-grained resource control based on function sharing in functional Intermediate Representations (IRs). The proposed compiler passes and rewrite system aim at producing valid design points and removing redundant hardware. Such optimizations make fitting entire neural networks on FPGAs feasible and produce competitive performance compared to running specialized kernels for each layer.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.
Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.
{"title":"WARM-tree: Making Quadtrees Write-efficient and Space-economic on Persistent Memories","authors":"Shin-Ting Wu, Liang-Chi Chen, Po-Chun Huang, Yuan-Hao Chang, Chien-Chung Ho, Wei-Kuan Shih","doi":"10.1145/3608033","DOIUrl":"https://doi.org/10.1145/3608033","url":null,"abstract":"Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.
{"title":"STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators","authors":"Chengpeng Xia, Yawen Chen, Haibo Zhang, Jigang Wu","doi":"10.1145/3607920","DOIUrl":"https://doi.org/10.1145/3607920","url":null,"abstract":"Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaokai Lin, Yatin A. Manerkar, Marten Lohstroh, Elizabeth Polgreen, Sheng-Jung Yu, Chadlia Jerad, Edward A. Lee, Sanjit A. Seshia
Formal verification of cyber-physical systems (CPS) is challenging because it has to consider real-time and concurrency aspects that are often absent in ordinary software. Moreover, the software in CPS is often complex and low-level, making it hard to assure that a formal model of the system used for verification is a faithful representation of the actual implementation, which can undermine the value of a verification result. To address this problem, we propose a methodology for building verifiable CPS based on the principle that a formal model of the software can be derived automatically from its implementation. Our approach requires that the system implementation is specified in Lingua Franca (LF), a polyglot coordination language tailored for real-time, concurrent CPS, which we made amenable to the specification of safety properties via annotations in the code. The program structure and the deterministic semantics of LF enable automatic construction of formal axiomatic models directly from LF programs. The generated models are automatically checked using Bounded Model Checking (BMC) by the verification engine Uclid5 using the Z3 SMT solver. The proposed technique enables checking a well-defined fragment of Safety Metric Temporal Logic (Safety MTL) formulas. To ensure the completeness of BMC, we present a method to derive an upper bound on the completeness threshold of an axiomatic model based on the semantics of LF. We implement our approach in the LF V erifier and evaluate it using a benchmark suite with 22 programs sampled from real-life applications and benchmarks for Erlang, Lustre, actor-oriented languages, and RTOSes. The LF V erifier correctly checks 21 out of 22 programs automatically.
网络物理系统(CPS)的正式验证具有挑战性,因为它必须考虑在普通软件中经常缺失的实时和并发性方面。此外,CPS中的软件通常是复杂和低级的,这使得很难保证用于验证的系统的正式模型是实际实现的忠实表示,这可能会破坏验证结果的价值。为了解决这个问题,我们提出了一种构建可验证的CPS的方法,该方法基于软件的正式模型可以从其实现中自动导出的原则。我们的方法要求系统实现用Lingua Franca (LF)指定,这是一种为实时、并发CPS量身定制的多语言协调语言,我们通过代码中的注释使其符合安全属性的规范。LF的程序结构和确定性语义使得直接从LF程序自动构造形式化公理模型成为可能。生成的模型由验证引擎Uclid5使用Z3 SMT求解器自动使用BMC (Bounded Model Checking)进行检查。所提出的技术能够检查定义良好的安全度量时间逻辑(Safety MTL)公式片段。为了保证BMC的完备性,我们提出了一种基于LF语义的公理模型完备性阈值上界的推导方法。我们在LF V验证器中实现了我们的方法,并使用一个包含22个程序的基准测试套件来评估它,这些程序来自于现实生活中的应用程序和Erlang、Lustre、面向角色的语言和rtos的基准测试。LF V验证器自动正确检查22个程序中的21个。
{"title":"Towards Building Verifiable CPS using Lingua Franca","authors":"Shaokai Lin, Yatin A. Manerkar, Marten Lohstroh, Elizabeth Polgreen, Sheng-Jung Yu, Chadlia Jerad, Edward A. Lee, Sanjit A. Seshia","doi":"10.1145/3609134","DOIUrl":"https://doi.org/10.1145/3609134","url":null,"abstract":"Formal verification of cyber-physical systems (CPS) is challenging because it has to consider real-time and concurrency aspects that are often absent in ordinary software. Moreover, the software in CPS is often complex and low-level, making it hard to assure that a formal model of the system used for verification is a faithful representation of the actual implementation, which can undermine the value of a verification result. To address this problem, we propose a methodology for building verifiable CPS based on the principle that a formal model of the software can be derived automatically from its implementation. Our approach requires that the system implementation is specified in Lingua Franca (LF), a polyglot coordination language tailored for real-time, concurrent CPS, which we made amenable to the specification of safety properties via annotations in the code. The program structure and the deterministic semantics of LF enable automatic construction of formal axiomatic models directly from LF programs. The generated models are automatically checked using Bounded Model Checking (BMC) by the verification engine Uclid5 using the Z3 SMT solver. The proposed technique enables checking a well-defined fragment of Safety Metric Temporal Logic (Safety MTL) formulas. To ensure the completeness of BMC, we present a method to derive an upper bound on the completeness threshold of an axiomatic model based on the semantics of LF. We implement our approach in the LF V erifier and evaluate it using a benchmark suite with 22 programs sampled from real-life applications and benchmarks for Erlang, Lustre, actor-oriented languages, and RTOSes. The LF V erifier correctly checks 21 out of 22 programs automatically.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.
{"title":"IOSR: Improving I/O Efficiency for Memory Swapping on Mobile Devices Via Scheduling and Reshaping","authors":"Wentong Li, Liang Shi, Hang Li, Changlong Li, Edwin Hsing-Mean Sha","doi":"10.1145/3607923","DOIUrl":"https://doi.org/10.1145/3607923","url":null,"abstract":"Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) is a popular method for privacy-preserving machine learning on edge devices. However, the heterogeneity of edge devices, including differences in system architecture, data, and co-running applications, can significantly impact the energy efficiency of FL. To address these issues, we propose an energy-efficient personalized federated search framework. This framework has three key components. Firstly, we search for partial models with high inference efficiency to reduce training energy consumption and the occurrence of stragglers in each round. Secondly, we build lightweight search controllers that control the model sampling and respond to runtime variances, mitigating new straggler issues caused by co-running applications. Finally, we design an adaptive search update strategy based on graph aggregation to improve personalized training convergence. Our framework reduces the energy consumption of the training process by lowering the training overhead of each round and speeding up the training convergence rate. Experimental results show that our approach achieves up to 5.02% accuracy and 3.45× energy efficiency improvements.
{"title":"Energy-efficient Personalized Federated Search with Graph for Edge Computing","authors":"Zhao Yang, Qingshuang Sun","doi":"10.1145/3609435","DOIUrl":"https://doi.org/10.1145/3609435","url":null,"abstract":"Federated Learning (FL) is a popular method for privacy-preserving machine learning on edge devices. However, the heterogeneity of edge devices, including differences in system architecture, data, and co-running applications, can significantly impact the energy efficiency of FL. To address these issues, we propose an energy-efficient personalized federated search framework. This framework has three key components. Firstly, we search for partial models with high inference efficiency to reduce training energy consumption and the occurrence of stragglers in each round. Secondly, we build lightweight search controllers that control the model sampling and respond to runtime variances, mitigating new straggler issues caused by co-running applications. Finally, we design an adaptive search update strategy based on graph aggregation to improve personalized training convergence. Our framework reduces the energy consumption of the training process by lowering the training overhead of each round and speeding up the training convergence rate. Experimental results show that our approach achieves up to 5.02% accuracy and 3.45× energy efficiency improvements.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The continuous increase in data volume has led to the adoption of shingled-magnetic recording (SMR) as the primary technology for modern storage drives. This technology offers high storage density and low unit cost but introduces significant performance overheads due to the read-update-write operation and garbage collection (GC) process. To reduce these overheads, data deduplication has been identified as an effective solution as it reduces the amount of written data to an SMR-based storage device. However, deduplication can result in poor data locality, leading to decreased read performance. To tackle this problem, this study proposes a data locality-aware deduplication technology, LaDy, that considers both the overheads of writing duplicate data and the impact on data locality to determine whether the duplicate data should be written. LaDy integrates with DiskSim, an open-source project, and modifies it to simulate an SMR-based drive. The experimental results demonstrate that LaDy can significantly reduce the response time in the best-case scenario by 87.3% compared with CAFTL on the SMR drive. LaDy achieves this by selectively writing duplicate data, which preserves data locality, resulting in improved read performance. The proposed solution provides an effective and efficient method for mitigating the performance overheads associated with data deduplication in SMR-based storage devices.
{"title":"LaDy: Enabling <u>L</u> ocality- <u>a</u> ware <u>D</u> eduplication Technolog <u>y</u> on Shingled Magnetic Recording Drives","authors":"Jung-Hsiu Chang, Tzu-Yu Chang, Yi-Chao Shih, Tseng-Yi Chen","doi":"10.1145/3607921","DOIUrl":"https://doi.org/10.1145/3607921","url":null,"abstract":"The continuous increase in data volume has led to the adoption of shingled-magnetic recording (SMR) as the primary technology for modern storage drives. This technology offers high storage density and low unit cost but introduces significant performance overheads due to the read-update-write operation and garbage collection (GC) process. To reduce these overheads, data deduplication has been identified as an effective solution as it reduces the amount of written data to an SMR-based storage device. However, deduplication can result in poor data locality, leading to decreased read performance. To tackle this problem, this study proposes a data locality-aware deduplication technology, LaDy, that considers both the overheads of writing duplicate data and the impact on data locality to determine whether the duplicate data should be written. LaDy integrates with DiskSim, an open-source project, and modifies it to simulate an SMR-based drive. The experimental results demonstrate that LaDy can significantly reduce the response time in the best-case scenario by 87.3% compared with CAFTL on the SMR drive. LaDy achieves this by selectively writing duplicate data, which preserves data locality, resulting in improved read performance. The proposed solution provides an effective and efficient method for mitigating the performance overheads associated with data deduplication in SMR-based storage devices.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu
To ensure the user experience of mobile systems, the foreground application can be differentiated to minimize the impact of background applications. However, this article observes that system services in the kernel and framework layer, instead of background applications, are now the major resource competitors. Specifically, these service tasks tend to be quiet when people rarely interact with the foreground application and active when interactions become frequent, and this high overlap of busy times leads to contention for resources. This article proposes iAware, an interaction-aware task scheduling framework in mobile systems. The key insight is to make use of the previously ignored idle period and schedule service tasks to run at that period. iAware quantify the interaction characteristic based on the screen touch event, and successfully stagger the periods of frequent user interactions. With iAware, service tasks tend to run when few interactions occur, for example, when the device’s screen is turned off, instead of when the user is frequently interacting with it. iAware is implemented on real smartphones. Experimental results show that the user experience is significantly improved with iAware. Compared to the state-of-the-art, the application launching speed and frame rate are enhanced by 38.89% and 7.97% separately, with no more than 1% additional battery consumption.
{"title":"iAware: Interaction Aware Task Scheduling for Reducing Resource Contention in Mobile Systems","authors":"Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu","doi":"10.1145/3609391","DOIUrl":"https://doi.org/10.1145/3609391","url":null,"abstract":"To ensure the user experience of mobile systems, the foreground application can be differentiated to minimize the impact of background applications. However, this article observes that system services in the kernel and framework layer, instead of background applications, are now the major resource competitors. Specifically, these service tasks tend to be quiet when people rarely interact with the foreground application and active when interactions become frequent, and this high overlap of busy times leads to contention for resources. This article proposes iAware, an interaction-aware task scheduling framework in mobile systems. The key insight is to make use of the previously ignored idle period and schedule service tasks to run at that period. iAware quantify the interaction characteristic based on the screen touch event, and successfully stagger the periods of frequent user interactions. With iAware, service tasks tend to run when few interactions occur, for example, when the device’s screen is turned off, instead of when the user is frequently interacting with it. iAware is implemented on real smartphones. Experimental results show that the user experience is significantly improved with iAware. Compared to the state-of-the-art, the application launching speed and frame rate are enhanced by 38.89% and 7.97% separately, with no more than 1% additional battery consumption.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.
{"title":"SpikeHard: Efficiency-Driven Neuromorphic Hardware for Heterogeneous Systems-on-Chip","authors":"Judicael Clair, Guy Eichler, Luca P. Carloni","doi":"10.1145/3609101","DOIUrl":"https://doi.org/10.1145/3609101","url":null,"abstract":"Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artem Klashtorny, Zhuanhao Wu, Anirudh Mohan Kaushik, Hiren Patel
We present a predictable wavefront splitting (PWS) technique for graphics processing units (GPUs). PWS improves the performance of GPU applications by reducing the impact of branch divergence while ensuring that worst-case execution time (WCET) estimates can be computed. This makes PWS an appropriate technique to use in safety-critical applications, such as autonomous driving systems, avionics, and space, that require strict temporal guarantees. In developing PWS on an AMD-based GPU, we propose microarchitectural enhancements to the GPU, and a compiler pass that eliminates branch serializations to reduce the WCET of a wavefront. Our analysis of PWS exhibits a performance improvement of 11% over existing architectures with a lower WCET than prior works in wavefront splitting.
{"title":"Predictable GPU Wavefront Splitting for Safety-Critical Systems","authors":"Artem Klashtorny, Zhuanhao Wu, Anirudh Mohan Kaushik, Hiren Patel","doi":"10.1145/3609102","DOIUrl":"https://doi.org/10.1145/3609102","url":null,"abstract":"We present a predictable wavefront splitting (PWS) technique for graphics processing units (GPUs). PWS improves the performance of GPU applications by reducing the impact of branch divergence while ensuring that worst-case execution time (WCET) estimates can be computed. This makes PWS an appropriate technique to use in safety-critical applications, such as autonomous driving systems, avionics, and space, that require strict temporal guarantees. In developing PWS on an AMD-based GPU, we propose microarchitectural enhancements to the GPU, and a compiler pass that eliminates branch serializations to reduce the WCET of a wavefront. Our analysis of PWS exhibits a performance improvement of 11% over existing architectures with a lower WCET than prior works in wavefront splitting.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}