Savvas Sioutas, S. Stuijk, H. Corporaal, T. Basten, L. Somers
Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.
{"title":"Loop transformations leveraging hardware prefetching","authors":"Savvas Sioutas, S. Stuijk, H. Corporaal, T. Basten, L. Somers","doi":"10.1145/3168823","DOIUrl":"https://doi.org/10.1145/3168823","url":null,"abstract":"Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124735508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the quality of approximation with data reconstruction techniques. This paper introduces local memory-aware perforation techniques specifically designed for the acceleration and approximation of GPU kernels. We propose a local memory-aware kernel perforation technique that first skips the loading of parts of the input data from global memory, and later uses reconstruction techniques on local memory to reach higher accuracy while having performance similar to state-of-the-art techniques. Experiments show that our approach is able to accelerate the execution of a variety of applications from 1.6× to 3× while introducing an average error of 6%, which is much smaller than that of other approaches. Results further show how much the error depends on the input data and application scenario, the impact of local memory tuning and different parameter configurations.
{"title":"Local memory-aware kernel perforation","authors":"Daniel Maier, Biagio Cosenza, B. Juurlink","doi":"10.1145/3168814","DOIUrl":"https://doi.org/10.1145/3168814","url":null,"abstract":"Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the quality of approximation with data reconstruction techniques. This paper introduces local memory-aware perforation techniques specifically designed for the acceleration and approximation of GPU kernels. We propose a local memory-aware kernel perforation technique that first skips the loading of parts of the input data from global memory, and later uses reconstruction techniques on local memory to reach higher accuracy while having performance similar to state-of-the-art techniques. Experiments show that our approach is able to accelerate the execution of a variety of applications from 1.6× to 3× while introducing an average error of 6%, which is much smaller than that of other approaches. Results further show how much the error depends on the input data and application scenario, the impact of local memory tuning and different parameter configurations.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116480419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
New trends in edge computing encourage pushing more of the compute and analytics to the outer edge and processing most of the data locally. We explore how to transparently provide resiliency for heavy duty edge applications running on low-power devices that must deal with frequent and unpredictable power disruptions. Complicating this process further are (a) memory usage restrictions in tiny low-power devices, that affect not only performance but efficacy of the resiliency techniques, and (b) differing resiliency requirements across deployment environments. Nevertheless, an application developer wants the ability to write an application once, and have it be reusable across all low-power platforms and across all different deployment settings. In response to these challenges, we have devised a transparent roll-back recovery mechanism that performs incremental checkpoints with minimal execution time overhead and at variable granularities. Our solution includes the co-design of firmware, runtime and compiler transformations for providing seamless fault-tolerance, along with an auto-tuning layer that automatically generates multiple resilient variants of an application. Each variant spreads application’s execution over atomic transactional regions of a certain granularity. Variants with smaller regions provide better resiliency, but incur higher overhead; thus, there is no single best option, but rather a Pareto optimal set of configurations. We apply these strategies across a variety of edge device applications and measure the execution time overhead of the framework on a TI MSP430FR6989. When we restrict unin- terrupted atomic intervals to 100ms, our framework keeps geomean overhead below 2.48x.
{"title":"Automating efficient variable-grained resiliency for low-power IoT systems","authors":"Sara S. Baghsorkhi, Christos Margiolas","doi":"10.1145/3168816","DOIUrl":"https://doi.org/10.1145/3168816","url":null,"abstract":"New trends in edge computing encourage pushing more of the compute and analytics to the outer edge and processing most of the data locally. We explore how to transparently provide resiliency for heavy duty edge applications running on low-power devices that must deal with frequent and unpredictable power disruptions. Complicating this process further are (a) memory usage restrictions in tiny low-power devices, that affect not only performance but efficacy of the resiliency techniques, and (b) differing resiliency requirements across deployment environments. Nevertheless, an application developer wants the ability to write an application once, and have it be reusable across all low-power platforms and across all different deployment settings. In response to these challenges, we have devised a transparent roll-back recovery mechanism that performs incremental checkpoints with minimal execution time overhead and at variable granularities. Our solution includes the co-design of firmware, runtime and compiler transformations for providing seamless fault-tolerance, along with an auto-tuning layer that automatically generates multiple resilient variants of an application. Each variant spreads application’s execution over atomic transactional regions of a certain granularity. Variants with smaller regions provide better resiliency, but incur higher overhead; thus, there is no single best option, but rather a Pareto optimal set of configurations. We apply these strategies across a variety of edge device applications and measure the execution time overhead of the framework on a TI MSP430FR6989. When we restrict unin- terrupted atomic intervals to 100ms, our framework keeps geomean overhead below 2.48x.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126207344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Stojanov, Ivaylo Toskov, Tiark Rompf, Markus Püschel
Managed language runtimes such as the Java Virtual Machine (JVM) provide adequate performance for a wide range of applications, but at the same time, they lack much of the low-level control that performance-minded programmers appreciate in languages like C/C++. One important example is the intrinsics interface that exposes instructions of SIMD (Single Instruction Multiple Data) vector ISAs (Instruction Set Architectures). In this paper we present an automatic approach for including native intrinsics in the runtime of a managed language. Our implementation consists of two parts. First, for each vector ISA, we automatically generate the intrinsics API from the vendor-provided XML specification. Second, we employ a metaprogramming approach that enables programmers to generate and load native code at runtime. In this setting, programmers can use the entire high-level language as a kind of macro system to define new high-level vector APIs with zero overhead. As an example use case we show a variable precision API. We provide an end-to-end implementation of our approach in the HotSpot VM that supports all 5912 Intel SIMD intrinsics from MMX to AVX-512. Our benchmarks demonstrate that this combination of SIMD and metaprogramming enables developers to write high-performance, vectorized code on an unmodified JVM that outperforms the auto-vectorizing HotSpot just-in-time (JIT) compiler and provides tight integration between vectorized native code and the managed JVM ecosystem.
{"title":"SIMD intrinsics on managed language runtimes","authors":"A. Stojanov, Ivaylo Toskov, Tiark Rompf, Markus Püschel","doi":"10.1145/3168810","DOIUrl":"https://doi.org/10.1145/3168810","url":null,"abstract":"Managed language runtimes such as the Java Virtual Machine (JVM) provide adequate performance for a wide range of applications, but at the same time, they lack much of the low-level control that performance-minded programmers appreciate in languages like C/C++. One important example is the intrinsics interface that exposes instructions of SIMD (Single Instruction Multiple Data) vector ISAs (Instruction Set Architectures). In this paper we present an automatic approach for including native intrinsics in the runtime of a managed language. Our implementation consists of two parts. First, for each vector ISA, we automatically generate the intrinsics API from the vendor-provided XML specification. Second, we employ a metaprogramming approach that enables programmers to generate and load native code at runtime. In this setting, programmers can use the entire high-level language as a kind of macro system to define new high-level vector APIs with zero overhead. As an example use case we show a variable precision API. We provide an end-to-end implementation of our approach in the HotSpot VM that supports all 5912 Intel SIMD intrinsics from MMX to AVX-512. Our benchmarks demonstrate that this combination of SIMD and metaprogramming enables developers to write high-performance, vectorized code on an unmodified JVM that outperforms the auto-vectorizing HotSpot just-in-time (JIT) compiler and provides tight integration between vectorized native code and the managed JVM ecosystem.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134104941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vasileios Porpodas, Rodrigo C. O. Rocha, L. F. Góes
Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.
{"title":"Look-ahead SLP: auto-vectorization in the presence of commutative operations","authors":"Vasileios Porpodas, Rodrigo C. O. Rocha, L. F. Góes","doi":"10.1145/3168807","DOIUrl":"https://doi.org/10.1145/3168807","url":null,"abstract":"Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121526215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Long Zheng, Xiaofei Liao, Hai Jin, Jieshan Zhao, Qinggang Wang
Existing constraint-solving-based technique enables an efficient and high-coverage concurrency debugging. Yet, there remains a significant gap between the state of the art and the state of the programming practices for scaling to handle long-running execution of programs. In this paper, we revisit the scalability problem of state-of-the-art constraint-solving-based technique. Our key insight is that concurrency debugging for many real-world bugs can be turned into a graph traversal problem. We therefore present GraphDebugger, a novel debugging framework to enable the scalable concurrency analysis on program graphs via a tailored graph-parallel analysis in a distributed environment. It is verified that GraphDebugger is more capable than CLAP in reproducing the real-world bugs that involve a complex concurrency analysis. Our extensive evaluation on 7 real-world programs shows that, GraphDebugger (deployed on an 8-node EC2 like cluster) is significantly efficient to reproduce concurrency bugs within 1∼8 minutes while CLAP does so with 1∼30 hours, or even without returning solutions.
{"title":"Scalable concurrency debugging with distributed graph processing","authors":"Long Zheng, Xiaofei Liao, Hai Jin, Jieshan Zhao, Qinggang Wang","doi":"10.1145/3168817","DOIUrl":"https://doi.org/10.1145/3168817","url":null,"abstract":"Existing constraint-solving-based technique enables an efficient and high-coverage concurrency debugging. Yet, there remains a significant gap between the state of the art and the state of the programming practices for scaling to handle long-running execution of programs. In this paper, we revisit the scalability problem of state-of-the-art constraint-solving-based technique. Our key insight is that concurrency debugging for many real-world bugs can be turned into a graph traversal problem. We therefore present GraphDebugger, a novel debugging framework to enable the scalable concurrency analysis on program graphs via a tailored graph-parallel analysis in a distributed environment. It is verified that GraphDebugger is more capable than CLAP in reproducing the real-world bugs that involve a complex concurrency analysis. Our extensive evaluation on 7 real-world programs shows that, GraphDebugger (deployed on an 8-node EC2 like cluster) is significantly efficient to reproduce concurrency bugs within 1∼8 minutes while CLAP does so with 1∼30 hours, or even without returning solutions.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117348236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erick Bauman, Huibo Wang, Mingwei Zhang, Zhiqiang Lin
Intel SGX provides a secure enclave in which code and data are hidden from the outside world, including privileged code such as the OS or hypervisor. However, by default, enclave code prior to initialization can be disassembled and therefore no secrets can be embedded in the binary. This is a problem for developers wishing to protect code secrets. This paper introduces SGXElide, a nearly-transparent framework that enables enclave code confidentiality. The key idea is to treat program code as data and dynamically restore secrets after an enclave is initialized. SGXElide can be integrated into any enclave, providing a mechanism to securely decrypt or deliver the secret code with the assistance of a developer-controlled trusted remote party. We have implemented SGXElide atop a recently released version of the Linux SGX SDK, and our evaluation with a number of programs shows that SGXElide can be used to protect the code secrecy of practical applications with no overhead after enclave initialization.
{"title":"SGXElide: enabling enclave code secrecy via self-modification","authors":"Erick Bauman, Huibo Wang, Mingwei Zhang, Zhiqiang Lin","doi":"10.1145/3168833","DOIUrl":"https://doi.org/10.1145/3168833","url":null,"abstract":"Intel SGX provides a secure enclave in which code and data are hidden from the outside world, including privileged code such as the OS or hypervisor. However, by default, enclave code prior to initialization can be disassembled and therefore no secrets can be embedded in the binary. This is a problem for developers wishing to protect code secrets. This paper introduces SGXElide, a nearly-transparent framework that enables enclave code confidentiality. The key idea is to treat program code as data and dynamically restore secrets after an enclave is initialized. SGXElide can be integrated into any enclave, providing a mechanism to securely decrypt or deliver the secret code with the assistance of a developer-controlled trusted remote party. We have implemented SGXElide atop a recently released version of the Linux SGX SDK, and our evaluation with a number of programs shows that SGXElide can be used to protect the code secrecy of practical applications with no overhead after enclave initialization.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126687130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.
{"title":"Transforming loop chains via macro dataflow graphs","authors":"Eddie C. Davis, M. Strout, C. Olschanowsky","doi":"10.1145/3168832","DOIUrl":"https://doi.org/10.1145/3168832","url":null,"abstract":"This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"316 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132162262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Programmable microfluidic laboratories-on-a-chip (LoCs) offer the benefits of automation and miniaturization to the life sciences. This paper presents an updated version of the BioCoder language and a fully static (offline) compiler that can target an emerging class of LoCs called Digital Microfluidic Biochips (DMFBs), which manipulate discrete droplets of liquid on a 2D electrode grid. The BioCoder language and runtime execution engine leverage advances in sensor integration to enable specification, compilation, and execution of assays (bio-chemical procedures) that feature online decision-making based on sensory data acquired during assay execution. The compiler features a novel hybrid intermediate representation (IR) that interleaves fluidic operations with computations performed on sensor data. The IR extends the traditional notions of liveness and interference to fluidic variables and operations, as needed to target the DMFB, which itself can be viewed as a spatially reconfigurable array. The code generator converts the IR into the following: (1) a set of electrode activation sequences for each basic block in the control flow graph (CFG); (2) a set of computations performed on sensor data, which dynamically determine the result of each control flow operation; and (3) a set of electrode activation sequences for each control flow transfer operation (CFG edge). The compiler is validated using a software simulator which produces animated videos of realistic bioassay execution on a DMFB.
{"title":"A compiler for cyber-physical digital microfluidic biochips","authors":"C. Curtis, D. Grissom, P. Brisk","doi":"10.1145/3168826","DOIUrl":"https://doi.org/10.1145/3168826","url":null,"abstract":"Programmable microfluidic laboratories-on-a-chip (LoCs) offer the benefits of automation and miniaturization to the life sciences. This paper presents an updated version of the BioCoder language and a fully static (offline) compiler that can target an emerging class of LoCs called Digital Microfluidic Biochips (DMFBs), which manipulate discrete droplets of liquid on a 2D electrode grid. The BioCoder language and runtime execution engine leverage advances in sensor integration to enable specification, compilation, and execution of assays (bio-chemical procedures) that feature online decision-making based on sensory data acquired during assay execution. The compiler features a novel hybrid intermediate representation (IR) that interleaves fluidic operations with computations performed on sensor data. The IR extends the traditional notions of liveness and interference to fluidic variables and operations, as needed to target the DMFB, which itself can be viewed as a spatially reconfigurable array. The code generator converts the IR into the following: (1) a set of electrode activation sequences for each basic block in the control flow graph (CFG); (2) a set of computations performed on sensor data, which dynamically determine the result of each control flow operation; and (3) a set of electrode activation sequences for each control flow transfer operation (CFG edge). The compiler is validated using a software simulator which produces animated videos of realistic bioassay execution on a DMFB.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133629719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Selecting collection data structures for a given application is a crucial aspect of the software development. Inefficient usage of collections has been credited as a major cause of performance bloat in applications written in Java, C++ and C#. Furthermore, a single implementation might not be optimal throughout the entire program execution. This demands an adaptive solution that adjusts at runtime the collection implementations to varying workloads. We present CollectionSwitch, an application-level framework for efficient collection adaptation. It selects at runtime collection implementations in order to optimize the execution and memory performance of an application. Unlike previous works, we use workload data on the level of collection allocation sites to guide the optimization process. Our framework identifies allocation sites which instantiate suboptimal collection variants, and selects optimized variants for future instantiations. As a further contribution we propose adaptive collection implementations which switch their underlying data structures according to the size of the collection. We implement this framework in Java, and demonstrate the improvements in terms of time and memory behavior across a range of benchmarks. To our knowledge, it is the first approach which is capable of runtime performance optimization of Java collections with very low overhead.
{"title":"CollectionSwitch: a framework for efficient and dynamic collection selection","authors":"D. Costa, A. Andrzejak","doi":"10.1145/3168825","DOIUrl":"https://doi.org/10.1145/3168825","url":null,"abstract":"Selecting collection data structures for a given application is a crucial aspect of the software development. Inefficient usage of collections has been credited as a major cause of performance bloat in applications written in Java, C++ and C#. Furthermore, a single implementation might not be optimal throughout the entire program execution. This demands an adaptive solution that adjusts at runtime the collection implementations to varying workloads. We present CollectionSwitch, an application-level framework for efficient collection adaptation. It selects at runtime collection implementations in order to optimize the execution and memory performance of an application. Unlike previous works, we use workload data on the level of collection allocation sites to guide the optimization process. Our framework identifies allocation sites which instantiate suboptimal collection variants, and selects optimized variants for future instantiations. As a further contribution we propose adaptive collection implementations which switch their underlying data structures according to the size of the collection. We implement this framework in Java, and demonstrate the improvements in terms of time and memory behavior across a range of benchmarks. To our knowledge, it is the first approach which is capable of runtime performance optimization of Java collections with very low overhead.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132357897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}