Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444789
Hugo Reymond, Jean-Luc Béchennec, M. Briday, Sébastien Faucou, Isabelle Puaut, Erven Rohou
Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present Schematic,a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. Schematicensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. Schematictakes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested Schematicfor different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques.
{"title":"SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems","authors":"Hugo Reymond, Jean-Luc Béchennec, M. Briday, Sébastien Faucou, Isabelle Puaut, Erven Rohou","doi":"10.1109/CGO57630.2024.10444789","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444789","url":null,"abstract":"Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present Schematic,a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. Schematicensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. Schematictakes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested Schematicfor different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"34 3","pages":"258-269"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140285711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/cgo57630.2024.10444821
{"title":"CGO 2024 Sponsors and Supporters","authors":"","doi":"10.1109/cgo57630.2024.10444821","DOIUrl":"https://doi.org/10.1109/cgo57630.2024.10444821","url":null,"abstract":"","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"54 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444828
Ivan R. Ivanov, O. Zinenko, Jens Domke, Toshio Endo, William S. Moses
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
为了接近峰值性能,像 GPU 这样的加速器需要对特定架构进行大量调整,了解共享内存、并行性、张量内核等的可用性。遗憾的是,为了追求更高的性能和更低的成本,即使是同一供应商的架构设计也出现了严重的多样化。这就产生了在不同 GPU 之间实现性能可移植性的需求,尤其是对于采用特定编程模型并考虑到特定架构的程序而言,这一点尤为重要。即使程序可以在不同的架构上无缝执行,也可能会因为没有根据可用的硬件资源(如快速内存和寄存器)适当调整大小而导致性能下降,更不用说没有使用架构的最新高级功能了。我们提出了一种新方法,通过自动调整每个并行线程的工作量及其所需的内存和寄存器资源量,提高(传统)CUDA 程序在现代机器上的性能。通过在 MLIR 编译器基础架构内运行,我们还能从 CUDA 自动转换到 AMD GPU,同时调整程序粒度以适应目标 GPU 的大小。结合特定平台编译器辅助下的自动调整,我们的方法在 Rodinia 基准套件上比基准 CUDA 实现提高了 27% 的 geomean 速度,并且在执行相同 CUDA 程序的类似英伟达和 AMD GPU 之间实现了性能均等。
{"title":"Retargeting and Respecializing GPU Workloads for Performance Portability","authors":"Ivan R. Ivanov, O. Zinenko, Jens Domke, Toshio Endo, William S. Moses","doi":"10.1109/CGO57630.2024.10444828","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444828","url":null,"abstract":"In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"53 7","pages":"119-132"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444869
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, Vahl Scott Gordon
In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU.
{"title":"Instruction Scheduling for the GPU on the GPU","authors":"Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, Vahl Scott Gordon","doi":"10.1109/CGO57630.2024.10444869","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444869","url":null,"abstract":"In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"65 1","pages":"435-447"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444817
Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, Simone Campanoni
Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.
{"title":"Representing Data Collections in an SSA Form","authors":"Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, Simone Campanoni","doi":"10.1109/CGO57630.2024.10444817","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444817","url":null,"abstract":"Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"24 10","pages":"308-321"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140285857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444856
Alexis Engelke, Tobias Schwarz
Low compilation times are highly important in contexts of Just-in-time compilation. This not only applies to language runtimes for Java, WebAssembly, or JavaScript, but is also crucial for database systems that employ query compilation as the primary measure for achieving high throughput in combination with low query execution time. We present a performance comparison and detailed analysis of the compile times of the JIT compilation back-ends provided by GCC, LLVM, Cranelift, and a single-pass compiler in the context of database queries. Our results show that LLVM achieves the highest execution performance, but can compile substantially faster when tuning for low compilation time. Cranelift achieves a similar run-time performance to unoptimized LLVM, but compiles just 20–35% faster and is outperformed by the single-pass compiler, which compiles code 16x faster than Cranelift at similar execution performance.
{"title":"Compile-Time Analysis of Compiler Frameworks for Query Compilation","authors":"Alexis Engelke, Tobias Schwarz","doi":"10.1109/CGO57630.2024.10444856","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444856","url":null,"abstract":"Low compilation times are highly important in contexts of Just-in-time compilation. This not only applies to language runtimes for Java, WebAssembly, or JavaScript, but is also crucial for database systems that employ query compilation as the primary measure for achieving high throughput in combination with low query execution time. We present a performance comparison and detailed analysis of the compile times of the JIT compilation back-ends provided by GCC, LLVM, Cranelift, and a single-pass compiler in the context of database queries. Our results show that LLVM achieves the highest execution performance, but can compile substantially faster when tuning for low compilation time. Cranelift achieves a similar run-time performance to unoptimized LLVM, but compiles just 20–35% faster and is outperformed by the single-pass compiler, which compiles code 16x faster than Cranelift at similar execution performance.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"59 8","pages":"233-244"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/cgo57630.2024.10444881
{"title":"CGO 2024 Organization","authors":"","doi":"10.1109/cgo57630.2024.10444881","DOIUrl":"https://doi.org/10.1109/cgo57630.2024.10444881","url":null,"abstract":"","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"36 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140285710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444884
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, Lin Gao
The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADROID, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADROID significantly outperforms FLowDROID (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X.
{"title":"Boosting the Performance of Multi-Solver IFDS Algorithms with Flow-Sensitivity Optimizations","authors":"Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, Lin Gao","doi":"10.1109/CGO57630.2024.10444884","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444884","url":null,"abstract":"The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADROID, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADROID significantly outperforms FLowDROID (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"62 3","pages":"296-307"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/cgo57630.2024.10444811
{"title":"Welcome from the General Chairs","authors":"","doi":"10.1109/cgo57630.2024.10444811","DOIUrl":"https://doi.org/10.1109/cgo57630.2024.10444811","url":null,"abstract":"","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"61 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444847
Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, Hugh Leather
Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -ozbaseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application.
{"title":"Revealing Compiler Heuristics Through Automated Discovery and Optimization","authors":"Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, Hugh Leather","doi":"10.1109/CGO57630.2024.10444847","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444847","url":null,"abstract":"Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -ozbaseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"60 6","pages":"55-66"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}