In memory hierarchies, caches perform an important role in reducing average memory access latency. Minimizing cache misses can yield significant performance gains. As set-associative caches are widely used in modern architectures, capacity and conflict cache misses co-exist. These two types of cache misses require different optimization strategies. While cache misses are commonly studied using cache simulators, state-of-the-art simulators usually incur hundreds to thousands of times a program's execution runtime. Moreover, a simulator has difficulty in simulating complex real hardware. To overcome these limitations, measurement methods are proposed to directly monitor program execution on real hardware via performance monitoring units. However, existing measurement-based tools either focus on capacity cache misses or do not distinguish capacity and conflict cache misses. In this paper, we design and implement CCProf, a lightweight measurement-based profiler that identifies conflict cache misses and associates them with program source code and data structures. CCProf incurs moderate runtime overhead that is at least an order of magnitude lower than simulators. With the evaluation on a number of representative programs, CCProf is able to guide optimizations on cache conflict misses and obtain nontrivial speedups.
{"title":"Lightweight detection of cache conflicts","authors":"Probir Roy, S. Song, S. Krishnamoorthy, Xu Liu","doi":"10.1145/3168819","DOIUrl":"https://doi.org/10.1145/3168819","url":null,"abstract":"In memory hierarchies, caches perform an important role in reducing average memory access latency. Minimizing cache misses can yield significant performance gains. As set-associative caches are widely used in modern architectures, capacity and conflict cache misses co-exist. These two types of cache misses require different optimization strategies. While cache misses are commonly studied using cache simulators, state-of-the-art simulators usually incur hundreds to thousands of times a program's execution runtime. Moreover, a simulator has difficulty in simulating complex real hardware. To overcome these limitations, measurement methods are proposed to directly monitor program execution on real hardware via performance monitoring units. However, existing measurement-based tools either focus on capacity cache misses or do not distinguish capacity and conflict cache misses. In this paper, we design and implement CCProf, a lightweight measurement-based profiler that identifies conflict cache misses and associates them with program source code and data structures. CCProf incurs moderate runtime overhead that is at least an order of magnitude lower than simulators. With the evaluation on a number of representative programs, CCProf is able to guide optimizations on cache conflict misses and obtain nontrivial speedups.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117201532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose POKER, a permutation-based vectorization approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, POKER represents a general-purpose solution for vectorizing the queries over indexing trees on multi-core processors equipped with SIMD units. For a set of five representative benchmarks evaluated with 24 configurations each, POKER outperforms the state-of-the-art by 2.11x with one single thread and 2.28x with eight threads on an Intel Broadwell processor that supports 256-bit AVX2, on average.
{"title":"Poker: permutation-based SIMD execution of intensive tree search by path encoding","authors":"Feng Zhang, Jingling Xue","doi":"10.1145/3168808","DOIUrl":"https://doi.org/10.1145/3168808","url":null,"abstract":"We propose POKER, a permutation-based vectorization approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, POKER represents a general-purpose solution for vectorizing the queries over indexing trees on multi-core processors equipped with SIMD units. For a set of five representative benchmarks evaluated with 24 configurations each, POKER outperforms the state-of-the-art by 2.11x with one single thread and 2.28x with eight threads on an Intel Broadwell processor that supports 256-bit AVX2, on average.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129023469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Zhou, Lian Li, Lei Wang, Jingling Xue, Xiaobing Feng
May-Happen-in-Parallel (MHP) analysis computes whether two statements in a multi-threaded program may execute concurrently or not. It works as a basis for many analyses and optimization techniques of concurrent programs. This paper proposes a novel approach for MHP analysis, by statically computing vector clocks. Static vector clocks extend the classic vector clocks algorithm to handle the complex control flow structures in static analysis, and we have developed an efficient context-sensitive algorithm to compute them. To the best of our knowledge, this is the first attempt to compute vector clocks statically. Using static vector clocks, we can drastically improve the efficiency of existing MHP analyses, without loss of precision: the performance speedup can be up to 1828X, with a much smaller memory footprint (reduced by up to 150X). We have implemented our analysis in a static data race detector, and experimental results show that our MHP analysis can help remove up to 88% of spurious data race pairs.
{"title":"May-happen-in-parallel analysis with static vector clocks","authors":"Qing Zhou, Lian Li, Lei Wang, Jingling Xue, Xiaobing Feng","doi":"10.1145/3168813","DOIUrl":"https://doi.org/10.1145/3168813","url":null,"abstract":"May-Happen-in-Parallel (MHP) analysis computes whether two statements in a multi-threaded program may execute concurrently or not. It works as a basis for many analyses and optimization techniques of concurrent programs. This paper proposes a novel approach for MHP analysis, by statically computing vector clocks. Static vector clocks extend the classic vector clocks algorithm to handle the complex control flow structures in static analysis, and we have developed an efficient context-sensitive algorithm to compute them. To the best of our knowledge, this is the first attempt to compute vector clocks statically. Using static vector clocks, we can drastically improve the efficiency of existing MHP analyses, without loss of precision: the performance speedup can be up to 1828X, with a much smaller memory footprint (reduced by up to 150X). We have implemented our analysis in a static data race detector, and experimental results show that our MHP analysis can help remove up to 88% of spurious data race pairs.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"23 14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128100730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Leopoldseder, Lukas Stadler, Thomas Würthinger, J. Eisl, Doug Simon, H. Mössenböck
Compilers perform a variety of advanced optimizations to improve the quality of the generated machine code. However, optimizations that depend on the data flow of a program are often limited by control-flow merges. Code duplication can solve this problem by hoisting, i.e. duplicating, instructions from merge blocks to their predecessors. However, finding optimization opportunities enabled by duplication is a non-trivial task that requires compile-time intensive analysis. This imposes a challenge on modern (just-in-time) compilers: Duplicating instructions tentatively at every control flow merge is not feasible because excessive duplication leads to uncontrolled code growth and compile time increases. Therefore, compilers need to find out whether a duplication is beneficial enough to be performed. This paper proposes a novel approach to determine which duplication operations should be performed to increase performance. The approach is based on a duplication simulation that enables a compiler to evaluate different success metrics per potential duplication. Using this information, the compiler can then select the most promising candidates for optimization. We show how to map duplication candidates into an optimization cost model that allows us to trade-off between different success metrics including peak performance, code size and compile time. We implemented the approach on top of the GraalVM and evaluated it with the benchmarks Java DaCapo, Scala DaCapo, JavaScript Octane and a micro-benchmark suite, in terms of performance, compilation time and code size increase. We show that our optimization can reach peak performance improvements of up to 40% with a mean peak performance increase of 5.89%, while it generates a mean code size increase of 9.93% and mean compile time increase of 18.44%.
{"title":"Dominance-based duplication simulation (DBDS): code duplication to enable compiler optimizations","authors":"David Leopoldseder, Lukas Stadler, Thomas Würthinger, J. Eisl, Doug Simon, H. Mössenböck","doi":"10.1145/3168811","DOIUrl":"https://doi.org/10.1145/3168811","url":null,"abstract":"Compilers perform a variety of advanced optimizations to improve the quality of the generated machine code. However, optimizations that depend on the data flow of a program are often limited by control-flow merges. Code duplication can solve this problem by hoisting, i.e. duplicating, instructions from merge blocks to their predecessors. However, finding optimization opportunities enabled by duplication is a non-trivial task that requires compile-time intensive analysis. This imposes a challenge on modern (just-in-time) compilers: Duplicating instructions tentatively at every control flow merge is not feasible because excessive duplication leads to uncontrolled code growth and compile time increases. Therefore, compilers need to find out whether a duplication is beneficial enough to be performed. This paper proposes a novel approach to determine which duplication operations should be performed to increase performance. The approach is based on a duplication simulation that enables a compiler to evaluate different success metrics per potential duplication. Using this information, the compiler can then select the most promising candidates for optimization. We show how to map duplication candidates into an optimization cost model that allows us to trade-off between different success metrics including peak performance, code size and compile time. We implemented the approach on top of the GraalVM and evaluated it with the benchmarks Java DaCapo, Scala DaCapo, JavaScript Octane and a micro-benchmark suite, in terms of performance, compilation time and code size increase. We show that our optimization can reach peak performance improvements of up to 40% with a mean peak performance increase of 5.89%, while it generates a mean code size increase of 9.93% and mean compile time increase of 18.44%.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132012835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV’s parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches through using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 × (1.33× on average) and a speedup up to 1.57× (1.10× on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.
{"title":"CVR: efficient vectorization of SpMV on x86 processors","authors":"Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, Lixin Zhang","doi":"10.1145/3168818","DOIUrl":"https://doi.org/10.1145/3168818","url":null,"abstract":"Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV’s parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches through using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 × (1.33× on average) and a speedup up to 1.57× (1.10× on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131273237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Zeng, Lannan Luo, Zhiyun Qian, Xiaojiang Du, Zhoujun Li
Application repackaging is a severe threat to Android users and the market. Existing countermeasures mostly detect repackaging based on app similarity measurement and rely on a central party to perform detection, which is unscalable and imprecise. We instead consider building the detection capability into apps, such that user devices are made use of to detect repackaging in a decentralized fashion. The main challenge is how to protect repackaging detection code from attacks. We propose a creative use of logic bombs, which are regularly used in malware, to conquer the challenge. A novel bomb structure is invented and used: the trigger conditions are constructed to exploit the differences between the attacker and users, such that a bomb that lies dormant on the attacker side will be activated on one of the user devices, while the repackaging detection code, which is packed as the bomb payload, is kept inactive until the trigger conditions are satisfied. Moreover, the repackaging detection code is woven into the original app code and gets encrypted; thus, attacks by modifying or deleting suspicious code will corrupt the app itself. We have implemented a prototype, named BombDroid, that builds the repackaging detection into apps through bytecode instrumentation, and the evaluation shows that the technique is effective, efficient, and resilient to various adversary analysis including symbol execution, multi-path exploration, and program slicing.
{"title":"Resilient decentralized Android application repackaging detection using logic bombs","authors":"Qiang Zeng, Lannan Luo, Zhiyun Qian, Xiaojiang Du, Zhoujun Li","doi":"10.1145/3168820","DOIUrl":"https://doi.org/10.1145/3168820","url":null,"abstract":"Application repackaging is a severe threat to Android users and the market. Existing countermeasures mostly detect repackaging based on app similarity measurement and rely on a central party to perform detection, which is unscalable and imprecise. We instead consider building the detection capability into apps, such that user devices are made use of to detect repackaging in a decentralized fashion. The main challenge is how to protect repackaging detection code from attacks. We propose a creative use of logic bombs, which are regularly used in malware, to conquer the challenge. A novel bomb structure is invented and used: the trigger conditions are constructed to exploit the differences between the attacker and users, such that a bomb that lies dormant on the attacker side will be activated on one of the user devices, while the repackaging detection code, which is packed as the bomb payload, is kept inactive until the trigger conditions are satisfied. Moreover, the repackaging detection code is woven into the original app code and gets encrypted; thus, attacks by modifying or deleting suspicious code will corrupt the app itself. We have implemented a prototype, named BombDroid, that builds the repackaging detection into apps through bytecode instrumentation, and the evaluation shows that the technique is effective, efficient, and resilient to various adversary analysis including symbol execution, multi-path exploration, and program slicing.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126022002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.
{"title":"CUDAAdvisor: LLVM-based runtime profiling for modern GPUs","authors":"Du Shen, S. Song, Ang Li, Xu Liu","doi":"10.1145/3168831","DOIUrl":"https://doi.org/10.1145/3168831","url":null,"abstract":"General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134401801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unlike engineered systems, living cells self-generate, self-organise and self-repair, they undertake massively parallel operations with slow and noisy components in a noisy environment, they sense and actuate at molecular scales, and most intriguingly, they blur the line between software and hardware. Understanding this biological computation presents a huge challenge to the scientific community. Yet the ultimate destination and prize at the culmination of this scientific journey is the promise of revolutionary and transformative technology: the rational design and implementation of biological function, or more succinctly, the ability to program life.
{"title":"Biological computation (keynote)","authors":"Sara-Jane Dunn","doi":"10.1145/3179541.3179542","DOIUrl":"https://doi.org/10.1145/3179541.3179542","url":null,"abstract":"Unlike engineered systems, living cells self-generate, self-organise and self-repair, they undertake massively parallel operations with slow and noisy components in a noisy environment, they sense and actuate at molecular scales, and most intriguingly, they blur the line between software and hardware. Understanding this biological computation presents a huge challenge to the scientific community. Yet the ultimate destination and prize at the culmination of this scientific journey is the promise of revolutionary and transformative technology: the rational design and implementation of biological function, or more succinctly, the ability to program life.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130388482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software often suffers from performance bottlenecks, e.g., because some code has a higher computational complexity than expected or because a code change introduces a performance regression. Finding such bottlenecks is challenging for developers and for profiling techniques because both rely on performance tests to execute the software, which are often not available in practice. This paper presents PerfSyn, an approach for synthesizing test programs that expose performance bottlenecks in a given method under test. The basic idea is to repeatedly mutate a program that uses the method to systematically increase the amount of work done by the method. We formulate the problem of synthesizing a bottleneck-exposing program as a combinatorial search and show that it can be effectively and efficiently addressed using well known graph search algorithms. We evaluate the approach with 147 methods from seven Java code bases. PerfSyn automatically synthesizes test programs that expose 22 bottlenecks. The bottlenecks are due to unexpectedly high computational complexity and due to performance differences between different versions of the same code.
{"title":"Synthesizing programs that expose performance bottlenecks","authors":"Luca Della Toffola, Michael Pradel, T. Gross","doi":"10.1145/3168830","DOIUrl":"https://doi.org/10.1145/3168830","url":null,"abstract":"Software often suffers from performance bottlenecks, e.g., because some code has a higher computational complexity than expected or because a code change introduces a performance regression. Finding such bottlenecks is challenging for developers and for profiling techniques because both rely on performance tests to execute the software, which are often not available in practice. This paper presents PerfSyn, an approach for synthesizing test programs that expose performance bottlenecks in a given method under test. The basic idea is to repeatedly mutate a program that uses the method to systematically increase the amount of work done by the method. We formulate the problem of synthesizing a bottleneck-exposing program as a combinatorial search and show that it can be effectively and efficiently addressed using well known graph search algorithms. We evaluate the approach with 147 methods from seven Java code bases. PerfSyn automatically synthesizes test programs that expose 22 bottlenecks. The bottlenecks are due to unexpectedly high computational complexity and due to performance differences between different versions of the same code.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133142045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, S. Gorlatch, Christophe Dubach
Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes.
{"title":"High performance stencil code generation with Lift","authors":"Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, S. Gorlatch, Christophe Dubach","doi":"10.1145/3168824","DOIUrl":"https://doi.org/10.1145/3168824","url":null,"abstract":"Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114269272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}