Debasmita Lohar, Clothilde Jeangoudoux, Anastasia Volkova, Eva Darulova
Neural networks are increasingly being used as components in safety-critical applications, for instance, as controllers in embedded systems. Their formal safety verification has made significant progress but typically considers only idealized real-valued networks. For practical applications, such neural networks have to be quantized, i.e., implemented in finite-precision arithmetic, which inevitably introduces roundoff errors. Choosing a suitable precision that is both guaranteed to satisfy a roundoff error bound to ensure safety and that is as small as possible to not waste resources is highly nontrivial to do manually. This task is especially challenging when quantizing a neural network in fixed-point arithmetic, where one can choose among a large number of precisions and has to ensure overflow-freedom explicitly. This paper presents the first sound and fully automated mixed-precision quantization approach that specifically targets deep feed-forward neural networks. Our quantization is based on mixed-integer linear programming (MILP) and leverages the unique structure of neural networks and effective over-approximations to make MILP optimization feasible. Our approach efficiently optimizes the number of bits needed to implement a network while guaranteeing a provided error bound. Our evaluation on existing embedded neural controller benchmarks shows that our optimization translates into precision assignments that mostly use fewer machine cycles when compiled to an FPGA with a commercial HLS compiler than code generated by (sound) state-of-the-art. Furthermore, our approach handles significantly more benchmarks substantially faster, especially for larger networks.
{"title":"Sound Mixed Fixed-Point Quantization of Neural Networks","authors":"Debasmita Lohar, Clothilde Jeangoudoux, Anastasia Volkova, Eva Darulova","doi":"10.1145/3609118","DOIUrl":"https://doi.org/10.1145/3609118","url":null,"abstract":"Neural networks are increasingly being used as components in safety-critical applications, for instance, as controllers in embedded systems. Their formal safety verification has made significant progress but typically considers only idealized real-valued networks. For practical applications, such neural networks have to be quantized, i.e., implemented in finite-precision arithmetic, which inevitably introduces roundoff errors. Choosing a suitable precision that is both guaranteed to satisfy a roundoff error bound to ensure safety and that is as small as possible to not waste resources is highly nontrivial to do manually. This task is especially challenging when quantizing a neural network in fixed-point arithmetic, where one can choose among a large number of precisions and has to ensure overflow-freedom explicitly. This paper presents the first sound and fully automated mixed-precision quantization approach that specifically targets deep feed-forward neural networks. Our quantization is based on mixed-integer linear programming (MILP) and leverages the unique structure of neural networks and effective over-approximations to make MILP optimization feasible. Our approach efficiently optimizes the number of bits needed to implement a network while guaranteeing a provided error bound. Our evaluation on existing embedded neural controller benchmarks shows that our optimization translates into precision assignments that mostly use fewer machine cycles when compiled to an FPGA with a commercial HLS compiler than code generated by (sound) state-of-the-art. Furthermore, our approach handles significantly more benchmarks substantially faster, especially for larger networks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a novel methodology for testing stochastic black-box systems, frequently encountered in embedded systems. Our approach enhances the established black-box checking (BBC) technique to address stochastic behavior. Traditional BBC primarily involves iteratively identifying an input that breaches the system’s specifications by executing the following three phases: the learning phase to construct an automaton approximating the black box’s behavior, the synthesis phase to identify a candidate counterexample from the learned automaton, and the validation phase to validate the obtained candidate counterexample and the learned automaton against the original black-box system. Our method, ProbBBC, refines the conventional BBC approach by (1) employing an active Markov Decision Process (MDP) learning method during the learning phase, (2) incorporating probabilistic model checking in the synthesis phase, and (3) applying statistical hypothesis testing in the validation phase. ProbBBC uniquely integrates these techniques rather than merely substituting each method in the traditional BBC; for instance, the statistical hypothesis testing and the MDP learning procedure exchange information regarding the black-box system’s observation with one another. The experiment results suggest that ProbBBC outperforms an existing method, especially for systems with limited observation.
{"title":"Probabilistic Black-Box Checking via Active MDP Learning","authors":"Junya Shijubo, Masaki Waga, Kohei Suenaga","doi":"10.1145/3609127","DOIUrl":"https://doi.org/10.1145/3609127","url":null,"abstract":"We introduce a novel methodology for testing stochastic black-box systems, frequently encountered in embedded systems. Our approach enhances the established black-box checking (BBC) technique to address stochastic behavior. Traditional BBC primarily involves iteratively identifying an input that breaches the system’s specifications by executing the following three phases: the learning phase to construct an automaton approximating the black box’s behavior, the synthesis phase to identify a candidate counterexample from the learned automaton, and the validation phase to validate the obtained candidate counterexample and the learned automaton against the original black-box system. Our method, ProbBBC, refines the conventional BBC approach by (1) employing an active Markov Decision Process (MDP) learning method during the learning phase, (2) incorporating probabilistic model checking in the synthesis phase, and (3) applying statistical hypothesis testing in the validation phase. ProbBBC uniquely integrates these techniques rather than merely substituting each method in the traditional BBC; for instance, the statistical hypothesis testing and the MDP learning procedure exchange information regarding the black-box system’s observation with one another. The experiment results suggest that ProbBBC outperforms an existing method, especially for systems with limited observation.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A crucial design factor for users of smart mobile devices is the latency of graphical interface interaction. Switching a background app to foreground is a frequent operation on mobile devices and the latency of this process is highly perceivable to users. Based on an Android smartphone, through analysis of memory reference generated during the app-switching process, we observe that file (virtual) pages and anonymous pages are both heavily involved. However, to our surprise, the amounts of the two types of pages in the main memory are highly imbalanced, and frequent I/O operations on file pages noticeably slows down the app-switching process. In this study, we advocate to improve the app-switching latency by rectifying the skewed kernel page reclaiming. Our approach involves two parts: proactive identification of unused anonymous pages and adaptive balance between file pages and anonymous pages. As mobile apps are found inflating their anonymous pages, we propose identifying unused anonymous pages in sync with the app-switching events. In addition, Android devices replaces the swap device with RAM-based zram, and swapping on zram is much faster than file accessing on flash storage. Without causing thrashing, we propose swapping out as many anonymous pages to zram as possible for caching more file pages. We conduct experiments on a Google Pixel phone with realistic user workloads, and results confirm that our method is adaptive to different memory requirements and greatly improves the app-switching latency by up to 43% compared with the original kernel.
{"title":"Rectifying Skewed Kernel Page Reclamation in Mobile Devices for Improving User-Perceivable Latency","authors":"Yi-Quan Chou, Lin-Wei Shen, Li-Pin Chang","doi":"10.1145/3607937","DOIUrl":"https://doi.org/10.1145/3607937","url":null,"abstract":"A crucial design factor for users of smart mobile devices is the latency of graphical interface interaction. Switching a background app to foreground is a frequent operation on mobile devices and the latency of this process is highly perceivable to users. Based on an Android smartphone, through analysis of memory reference generated during the app-switching process, we observe that file (virtual) pages and anonymous pages are both heavily involved. However, to our surprise, the amounts of the two types of pages in the main memory are highly imbalanced, and frequent I/O operations on file pages noticeably slows down the app-switching process. In this study, we advocate to improve the app-switching latency by rectifying the skewed kernel page reclaiming. Our approach involves two parts: proactive identification of unused anonymous pages and adaptive balance between file pages and anonymous pages. As mobile apps are found inflating their anonymous pages, we propose identifying unused anonymous pages in sync with the app-switching events. In addition, Android devices replaces the swap device with RAM-based zram, and swapping on zram is much faster than file accessing on flash storage. Without causing thrashing, we propose swapping out as many anonymous pages to zram as possible for caching more file pages. We conduct experiments on a Google Pixel phone with realistic user workloads, and results confirm that our method is adaptive to different memory requirements and greatly improves the app-switching latency by up to 43% compared with the original kernel.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention for shared resources such as LLC and the underlying network. The paper proposes Zero Pollution Prefetcher (ZPP) that eliminates cache pollution for NUCA architecture. For this purpose, ZPP uses L1 prefetcher and places the prefetched blocks in the data locations of LLC where modified blocks are stored. Since modified blocks in LLC are stale and request for such blocks are served from the exclusively owned private cache, their space unnecessary consumes power to maintain such stale data in the cache. The benefits of ZPP are (a) Eliminates cache pollution in L1 and LLC by storing prefetched blocks in LLC locations where stale blocks are stored. (b) Insufficient cache space is solved by placing prefetched blocks in LLC as LLCs are larger in size than L1 cache. This helps in prefetching more cache blocks, thereby increasing prefetch aggressiveness. (c) Increasing prefetch aggressiveness increases its coverage. (d) It also maintains an equivalent lookup latency to L1 cache for prefetched blocks. Experimentally it has been found that ZPP increases weighted speedup by 2.19x as compared to a system with no prefetching while prefetch coverage and prefetch accuracy increases by 50%, and 12%, respectively compared to the baseline.1
{"title":"ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs","authors":"Dipika Deb, John Jose","doi":"10.1145/3609113","DOIUrl":"https://doi.org/10.1145/3609113","url":null,"abstract":"Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention for shared resources such as LLC and the underlying network. The paper proposes Zero Pollution Prefetcher (ZPP) that eliminates cache pollution for NUCA architecture. For this purpose, ZPP uses L1 prefetcher and places the prefetched blocks in the data locations of LLC where modified blocks are stored. Since modified blocks in LLC are stale and request for such blocks are served from the exclusively owned private cache, their space unnecessary consumes power to maintain such stale data in the cache. The benefits of ZPP are (a) Eliminates cache pollution in L1 and LLC by storing prefetched blocks in LLC locations where stale blocks are stored. (b) Insufficient cache space is solved by placing prefetched blocks in LLC as LLCs are larger in size than L1 cache. This helps in prefetching more cache blocks, thereby increasing prefetch aggressiveness. (c) Increasing prefetch aggressiveness increases its coverage. (d) It also maintains an equivalent lookup latency to L1 cache for prefetched blocks. Experimentally it has been found that ZPP increases weighted speedup by 2.19x as compared to a system with no prefetching while prefetch coverage and prefetch accuracy increases by 50%, and 12%, respectively compared to the baseline.1","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Safety-critical embedded software is routinely programmed in block-diagram languages. Recent work in the Vélus project specifies such a language and its compiler in the Coq proof assistant. It builds on the CompCert verified C compiler to give an end-to-end proof linking the dataflow semantics of source programs to traces of the generated assembly code. We extend this work with switched blocks, shared variables, reset blocks, and state machines; define a relational semantics to integrate these block- and mode-based constructions into the existing stream-based model; adapt the standard source-to-source rewriting scheme to compile the new constructions; and reestablish the correctness theorem.
{"title":"Verified Compilation of Synchronous Dataflow with State Machines","authors":"Timothy Bourke, Basile Pesin, Marc Pouzet","doi":"10.1145/3608102","DOIUrl":"https://doi.org/10.1145/3608102","url":null,"abstract":"Safety-critical embedded software is routinely programmed in block-diagram languages. Recent work in the Vélus project specifies such a language and its compiler in the Coq proof assistant. It builds on the CompCert verified C compiler to give an end-to-end proof linking the dataflow semantics of source programs to traces of the generated assembly code. We extend this work with switched blocks, shared variables, reset blocks, and state machines; define a relational semantics to integrate these block- and mode-based constructions into the existing stream-based model; adapt the standard source-to-source rewriting scheme to compile the new constructions; and reestablish the correctness theorem.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.
{"title":"Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks","authors":"Harsh Sharma, Lukas Pfromm, Rasit Onur Topaloglu, Janardhan Rao Doppa, Umit Y. Ogras, Ananth Kalyanraman, Partha Pratim Pande","doi":"10.1145/3608098","DOIUrl":"https://doi.org/10.1145/3608098","url":null,"abstract":"Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza
Compute memories are memory arrays augmented with dedicated logic to support arithmetic. They support the efficient execution of data-centric computing patterns, such as those characterizing Artificial Intelligence (AI) algorithms. These architectures can provide computing capabilities as part of the memory array structures (In-Memory Computing, IMC) or at their immediate periphery (Near-Memory Computing, NMC). By bringing the processing elements inside (or very close to) storage, compute memories minimize the cost of data access. Moreover, highly parallel (and, hence, high-performance) computations are enabled by exploiting the regular structure of memory arrays. However, the regular layout of memory elements also constrains the data range of inputs and outputs, since the bitwidths of operands and results stored at each address cannot be freely varied. Addressing this challenge, we herein propose a HW/SW co-design methodology combining careful per-layer quantization and inter-layer scaling with lightweight hardware support for overflow-free computation of dot-vector operations. We demonstrate their use to implement the convolutional and fully connected layers of AI models. We embody our strategy in two implementations, based on IMC and NMC, respectively. Experimental results highlight that an area overhead of only 10.5% (for IMC) and 12.9% (for NMC) is required when interfacing with a 2KB subarray. Furthermore, inferences on benchmark CNNs show negligible accuracy degradation due to quantization for equivalent floating-point implementations.
{"title":"Overflow-free Compute Memories for Edge AI Acceleration","authors":"Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza","doi":"10.1145/3609387","DOIUrl":"https://doi.org/10.1145/3609387","url":null,"abstract":"Compute memories are memory arrays augmented with dedicated logic to support arithmetic. They support the efficient execution of data-centric computing patterns, such as those characterizing Artificial Intelligence (AI) algorithms. These architectures can provide computing capabilities as part of the memory array structures (In-Memory Computing, IMC) or at their immediate periphery (Near-Memory Computing, NMC). By bringing the processing elements inside (or very close to) storage, compute memories minimize the cost of data access. Moreover, highly parallel (and, hence, high-performance) computations are enabled by exploiting the regular structure of memory arrays. However, the regular layout of memory elements also constrains the data range of inputs and outputs, since the bitwidths of operands and results stored at each address cannot be freely varied. Addressing this challenge, we herein propose a HW/SW co-design methodology combining careful per-layer quantization and inter-layer scaling with lightweight hardware support for overflow-free computation of dot-vector operations. We demonstrate their use to implement the convolutional and fully connected layers of AI models. We embody our strategy in two implementations, based on IMC and NMC, respectively. Experimental results highlight that an area overhead of only 10.5% (for IMC) and 12.9% (for NMC) is required when interfacing with a 2KB subarray. Furthermore, inferences on benchmark CNNs show negligible accuracy degradation due to quantization for equivalent floating-point implementations.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard
In distributed applications, Brewer’s CAP theorem tells us that when networks become partitioned (P), one must give up either consistency (C) or availability (A). Consistency is agreement on the values of shared variables; availability is the ability to respond to reads and writes accessing those shared variables. Availability is a real-time property whereas consistency is a logical property. We extend consistency and availability to refer to cyber-physical properties such as the state of the physical system and delays in actuation. We have further extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design cyber-physical systems. We develop a methodology for systematically trading off availability and consistency in application-specific ways and to guide the system designer when putting functionality in end devices, in edge computers, or in the cloud. We build on the Lingua Franca coordination language to provide system designers with concrete analysis and design tools to make the required tradeoffs in deployable embedded software.
{"title":"Consistency vs. Availability in Distributed Cyber-Physical Systems","authors":"Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard","doi":"10.1145/3609119","DOIUrl":"https://doi.org/10.1145/3609119","url":null,"abstract":"In distributed applications, Brewer’s CAP theorem tells us that when networks become partitioned (P), one must give up either consistency (C) or availability (A). Consistency is agreement on the values of shared variables; availability is the ability to respond to reads and writes accessing those shared variables. Availability is a real-time property whereas consistency is a logical property. We extend consistency and availability to refer to cyber-physical properties such as the state of the physical system and delays in actuation. We have further extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design cyber-physical systems. We develop a methodology for systematically trading off availability and consistency in application-specific ways and to guide the system designer when putting functionality in end devices, in edge computers, or in the cloud. We build on the Lingua Franca coordination language to provide system designers with concrete analysis and design tools to make the required tradeoffs in deployable embedded software.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136191886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader
Due to the low costs and energy needed, cyber-physical systems are adopting multi-core processors for their embedded computing requirements. In order to guarantee safety when the application has real-time constraints, a critical requirement is to estimate the worst-case interference from other executing programs. However, the complexity of multi-core hardware inhibits precisely determining the Worst-Case Program Interference. Existing solutions are either prone to overestimate the interference or are not scalable to different hardware sizes and designs. In this paper we present Kryptonite , an automated framework to synthesize Worst-Case Program Interference (WCPI) environments for multi-core systems. Fundamental to Kryptonite is a set of tiny hardware-specific code gadgets that are crafted to maximize interference locally. The gadgets are arranged using a greedy approach and then molded using a Reinforcement Learning algorithm to create the WCPI environment. We demonstrate Kryptonite on the automotive grade Infineon AURIX TC399 processor with a wide range of programs that includes a commercial real-time automotive application. We show that, while being easily scalable and tunable, Kryptonite creates WCPI environments increasing the runtime by up to 58% for benchmark applications and 26% for the automotive application.
{"title":"Kryptonite: Worst-Case Program Interference Estimation on Multi-Core Embedded Systems","authors":"Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader","doi":"10.1145/3609128","DOIUrl":"https://doi.org/10.1145/3609128","url":null,"abstract":"Due to the low costs and energy needed, cyber-physical systems are adopting multi-core processors for their embedded computing requirements. In order to guarantee safety when the application has real-time constraints, a critical requirement is to estimate the worst-case interference from other executing programs. However, the complexity of multi-core hardware inhibits precisely determining the Worst-Case Program Interference. Existing solutions are either prone to overestimate the interference or are not scalable to different hardware sizes and designs. In this paper we present Kryptonite , an automated framework to synthesize Worst-Case Program Interference (WCPI) environments for multi-core systems. Fundamental to Kryptonite is a set of tiny hardware-specific code gadgets that are crafted to maximize interference locally. The gadgets are arranged using a greedy approach and then molded using a Reinforcement Learning algorithm to create the WCPI environment. We demonstrate Kryptonite on the automotive grade Infineon AURIX TC399 processor with a wide range of programs that includes a commercial real-time automotive application. We show that, while being easily scalable and tunable, Kryptonite creates WCPI environments increasing the runtime by up to 58% for benchmark applications and 26% for the automotive application.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Interlaced Magnetic Recording (IMR) is an emerging recording technology for hard-disk drives (HDDs) that provides larger storage capacity at a lower cost. By partially overlapping (interlacing) each bottom track with two adjacent top tracks, IMR-based HDDs successfully increase the data density while incurring some hardware write constraints. To update each bottom track, the data on two adjacent top tracks must be read and rewritten to avoid losing their valid data, resulting in additional overhead for performing read-modify-write (RMW) operations. Therefore, researchers have proposed various data management schemes to mitigate such overhead in recent years, aiming at improving the write performance. However, these designs have not taken into account the data characteristics of the file system, which is a crucial layer of operating systems for storing/retrieving data into/from HDDs. Consequently, the write performance improvement is limited due to the unawareness of spatial locality and hotness of data. This paper proposes a file-system-aware data management scheme called FSIMR to improve system write performance. Noticing that data of the same directory may have higher spatial locality and are mostly updated at the same time, FSIMR logically partitions the IMR-based HDD into fixed-sized zones; data belonging to the same directory will be arranged to one zone to reduce the time of seeking to-be-updated data (seek time). Furthermore, cold data within a zone are arranged to bottom tracks and updated in an out-of-place manner to eliminate RMW operations. Our experimental results show that the proposed FSIMR could reduce the seek time by up to 14% without introducing additional RMW operations, compared to existing designs.
{"title":"FSIMR: File-system-aware Data Management for Interlaced Magnetic Recording","authors":"Yi-Han Lien, Yen-Ting Chen, Yuan-Hao Chang, Yu-Pei Liang, Wei-Kuan Shih","doi":"10.1145/3607922","DOIUrl":"https://doi.org/10.1145/3607922","url":null,"abstract":"Interlaced Magnetic Recording (IMR) is an emerging recording technology for hard-disk drives (HDDs) that provides larger storage capacity at a lower cost. By partially overlapping (interlacing) each bottom track with two adjacent top tracks, IMR-based HDDs successfully increase the data density while incurring some hardware write constraints. To update each bottom track, the data on two adjacent top tracks must be read and rewritten to avoid losing their valid data, resulting in additional overhead for performing read-modify-write (RMW) operations. Therefore, researchers have proposed various data management schemes to mitigate such overhead in recent years, aiming at improving the write performance. However, these designs have not taken into account the data characteristics of the file system, which is a crucial layer of operating systems for storing/retrieving data into/from HDDs. Consequently, the write performance improvement is limited due to the unawareness of spatial locality and hotness of data. This paper proposes a file-system-aware data management scheme called FSIMR to improve system write performance. Noticing that data of the same directory may have higher spatial locality and are mostly updated at the same time, FSIMR logically partitions the IMR-based HDD into fixed-sized zones; data belonging to the same directory will be arranged to one zone to reduce the time of seeking to-be-updated data (seek time). Furthermore, cold data within a zone are arranged to bottom tracks and updated in an out-of-place manner to eliminate RMW operations. Our experimental results show that the proposed FSIMR could reduce the seek time by up to 14% without introducing additional RMW operations, compared to existing designs.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}