2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)最新文献

英文中文

SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems SCHEMATIC：间歇系统的编译时检查点放置和内存分配

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444789

Hugo Reymond, Jean-Luc Béchennec, M. Briday, Sébastien Faucou, Isabelle Puaut, Erven Rohou

Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present Schematic,a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. Schematicensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. Schematictakes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested Schematicfor different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques.

无电池设备能够在难以接近的地方进行传感，为医疗保健、太空或土木工程等各个领域带来了新的机遇。此类设备可收集环境能量并将其储存在电容器中。由于采集能量的不可预测性，断电可能随时发生，导致所有非持久性信息（如处理器寄存器、存储在易失性存储器中的数据）丢失。在非易失性存储器中对易失性数据进行检查点处理可使系统在断电后恢复，但会产生两个问题：(i) 检查点在空间和时间上的位置；(ii) 易失性存储器和非易失性存储器之间变量的存储器分配，总体目标是尽可能高效地使用能源。许多技术都依赖于开发人员来解决这些问题，而我们提出的 Schematic 是一种编译器技术，它能自动进行检查点放置和内存分配，从而最大限度地降低总体能耗。Schematic 可确保程序最终终止（前向进程属性）。此外，检查点放置和内存分配还能适应能量缓冲区的大小和易失性内存的容量。Schematic 利用易失性内存（VM）的优势，通过自动将最常用的变量放入易失性内存来降低能耗。我们针对不同的实验设置（易失性内存和电容器的大小）对 Schematic 进行了测试，结果表明，与相关技术相比，平均能耗降低了 51%。

{"title":"SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems","authors":"Hugo Reymond, Jean-Luc Béchennec, M. Briday, Sébastien Faucou, Isabelle Puaut, Erven Rohou","doi":"10.1109/CGO57630.2024.10444789","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444789","url":null,"abstract":"Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present Schematic,a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. Schematicensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. Schematictakes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested Schematicfor different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"34 3","pages":"258-269"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140285711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CGO 2024 Sponsors and Supporters CGO 2024 赞助商和支持商

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/cgo57630.2024.10444821

引用次数: 0

Retargeting and Respecializing GPU Workloads for Performance Portability 重新定位和重新专用 GPU 工作负载以实现性能可移植性

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444828

Ivan R. Ivanov, O. Zinenko, Jens Domke, Toshio Endo, William S. Moses

In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.

为了接近峰值性能，像 GPU 这样的加速器需要对特定架构进行大量调整，了解共享内存、并行性、张量内核等的可用性。遗憾的是，为了追求更高的性能和更低的成本，即使是同一供应商的架构设计也出现了严重的多样化。这就产生了在不同 GPU 之间实现性能可移植性的需求，尤其是对于采用特定编程模型并考虑到特定架构的程序而言，这一点尤为重要。即使程序可以在不同的架构上无缝执行，也可能会因为没有根据可用的硬件资源（如快速内存和寄存器）适当调整大小而导致性能下降，更不用说没有使用架构的最新高级功能了。我们提出了一种新方法，通过自动调整每个并行线程的工作量及其所需的内存和寄存器资源量，提高（传统）CUDA 程序在现代机器上的性能。通过在 MLIR 编译器基础架构内运行，我们还能从 CUDA 自动转换到 AMD GPU，同时调整程序粒度以适应目标 GPU 的大小。结合特定平台编译器辅助下的自动调整，我们的方法在 Rodinia 基准套件上比基准 CUDA 实现提高了 27% 的 geomean 速度，并且在执行相同 CUDA 程序的类似英伟达和 AMD GPU 之间实现了性能均等。

{"title":"Retargeting and Respecializing GPU Workloads for Performance Portability","authors":"Ivan R. Ivanov, O. Zinenko, Jens Domke, Toshio Endo, William S. Moses","doi":"10.1109/CGO57630.2024.10444828","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444828","url":null,"abstract":"In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"53 7","pages":"119-132"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Instruction Scheduling for the GPU on the GPU GPU 上的 GPU 指令调度

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444869

Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, Vahl Scott Gordon

In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU.

在本文中，我们展示了如何利用 GPU 并行基于蚁群优化（ACO）的精确指令调度算法。ACO 是一种受自然启发的智能搜索技术，已被用于计算运筹学（OR）中 NP 难问题的精确解。这种智能搜索技术过去未被用于解决 NP 难度的编译器优化问题，因为与生产编译器中使用的启发式技术相比，它们需要的计算量要大得多。在这项工作中，我们展示了在 GPU 上并行化这种计算密集型技术使其在编译中的应用变得合理实用。本研究中涉及的寄存器压力感知指令调度问题是一个多目标优化问题，其复杂程度远远超过之前在 GPU 上使用并行 ACO 解决的问题。我们介绍了我们为在 GPU 上高效并行化解决这一多目标优化问题的 ACO 算法而开发的一系列技术。目标处理器也是 GPU。我们的实验评估表明，GPU 上基于 ACO 的并行调度比 CPU 上基于 ACO 的顺序调度快 27 倍，这使得 rocPRIM 基准的总编译时间缩短了 21%。与 AMD 的生产调度程序相比，基于 ACO 的调度程序将编译基准的执行速度提高了 74%。据我们所知，我们的工作是在 GPU 上并行化编译器优化算法的首次成功尝试。

{"title":"Instruction Scheduling for the GPU on the GPU","authors":"Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, Vahl Scott Gordon","doi":"10.1109/CGO57630.2024.10444869","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444869","url":null,"abstract":"In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"65 1","pages":"435-447"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Representing Data Collections in an SSA Form 在 SSA 表格中表示数据收集情况

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444817

Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, Simone Campanoni

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

编译器的研究与开发一直将计算作为提高 C/C++ 程序性能的主要驱动力，而将内存优化作为次要考虑因素。目前，开发人员需要手动或通过库来描述内存中数据的语义和布局，这是一项艰巨的任务，过早地将高层数据集合降级为编译器的低层内存视图。因此，编译器只能收集到程序中有关内存的保守信息，如别名分析，并受到大量内存优化的进一步阻碍。本文提出了内存对象中间表示法（Memory Object Intermediate Representation，MEMOIR），这是一种与语言无关的 SSA 形式，适用于顺序和关联数据集合、对象及其包含的字段。Memoir 的核心是将用于存储数据的内存与用于逻辑组织数据的内存解耦。通过 SSA 形式，Memoir 编译器可以对数据集合执行元素级分析，从而在任何给定的程序点对集合或对象的状态进行静态分析。为了说明这种分析的威力，我们进行了死元素消除，结果在 SPECINT 2017 的 mcf 上速度提高了 26.6%。由于可以自由改变内存布局，我们的 Memoir 编译器可以执行字段消除和死字段消除，从而将 mcf 的峰值内存使用率降低了 20.8%。

{"title":"Representing Data Collections in an SSA Form","authors":"Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, Simone Campanoni","doi":"10.1109/CGO57630.2024.10444817","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444817","url":null,"abstract":"Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"24 10","pages":"308-321"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140285857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compile-Time Analysis of Compiler Frameworks for Query Compilation 用于查询编译的编译器框架的编译时分析

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444856

Alexis Engelke, Tobias Schwarz

Low compilation times are highly important in contexts of Just-in-time compilation. This not only applies to language runtimes for Java, WebAssembly, or JavaScript, but is also crucial for database systems that employ query compilation as the primary measure for achieving high throughput in combination with low query execution time. We present a performance comparison and detailed analysis of the compile times of the JIT compilation back-ends provided by GCC, LLVM, Cranelift, and a single-pass compiler in the context of database queries. Our results show that LLVM achieves the highest execution performance, but can compile substantially faster when tuning for low compilation time. Cranelift achieves a similar run-time performance to unoptimized LLVM, but compiles just 20–35% faster and is outperformed by the single-pass compiler, which compiles code 16x faster than Cranelift at similar execution performance.

低编译时间对于即时编译来说非常重要。这不仅适用于 Java、WebAssembly 或 JavaScript 的语言运行时，而且对于采用查询编译作为实现高吞吐量和低查询执行时间的主要措施的数据库系统也至关重要。我们对 GCC、LLVM、Cranelift 和单通道编译器提供的 JIT 编译后端在数据库查询中的编译时间进行了性能比较和详细分析。我们的结果表明，LLVM 的执行性能最高，但在调整以缩短编译时间时，其编译速度会大大加快。Cranelift 的运行时性能与未优化的 LLVM 相近，但编译速度仅快 20-35%，而单通编译器的性能则优于 Cranelift，后者在执行性能相近的情况下，编译代码的速度比 Cranelift 快 16 倍。

引用次数: 0

CGO 2024 Organization CGO 2024 组织

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/cgo57630.2024.10444881

引用次数: 0

Boosting the Performance of Multi-Solver IFDS Algorithms with Flow-Sensitivity Optimizations 通过流量敏感性优化提升多解器 IFDS 算法的性能

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444884

Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, Lin Gao

The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADROID, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADROID significantly outperforms FLowDROID (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X.

IFDS（程序间、有限、分配、子集）算法被广泛用于解决各种分析问题。特别是，许多有趣的问题都被表述为多求解器 IFDS 问题，这些问题需要多个交错 IFDS 求解器协同工作。例如，污点分析需要两个 IFDS 求解器，一个前向求解器传播污点数据流事实，另一个后向求解器同时求解别名关系。对于这类问题，需要引入大量额外的数据流事实，以实现对数据流的敏感性。这通常会导致性能和可扩展性较差，这在我们的实验和以前的工作中都很明显。在本文中，我们提出了一种新方法来减少引入额外数据流事实的数量，同时保持流敏感性和健全性。我们开发了一种新的污点分析工具 SADROID，并在 1,228 个开源 Android APP 上对其进行了评估。评估结果表明，SADROID 的性能明显优于 FLowDROID（最先进的多求解器 IFDS 污点分析工具），同时不影响精确性和合理性：运行时间加快了 17.89 倍，内存使用优化了 9 倍。

{"title":"Boosting the Performance of Multi-Solver IFDS Algorithms with Flow-Sensitivity Optimizations","authors":"Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, Lin Gao","doi":"10.1109/CGO57630.2024.10444884","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444884","url":null,"abstract":"The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADROID, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADROID significantly outperforms FLowDROID (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"62 3","pages":"296-307"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Welcome from the General Chairs 总主席致欢迎辞

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/cgo57630.2024.10444811

引用次数: 0

Revealing Compiler Heuristics Through Automated Discovery and Optimization 通过自动发现和优化揭示编译器启发式方法

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444847

Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, Hugh Leather

Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -ozbaseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application.

众所周知，调整编译器启发式算法和参数可以显著改善优化结果。之前的工作已经调整了命令行标志和一些专家确定的启发式方法。然而，由于数十年的开发过程中，开发人员并没有将自动调整放在首位，因此编译器中埋藏着数量不详的启发式算法，它们没有被标记，也没有被公开。许多启发式算法甚至连编写它们的开发人员都不认为是启发式算法。其结果是，如果所有启发式算法都可以进行调整，自动调整搜索和机器学习只能优化很小一部分。手动发现所有这些隐藏在数百万行代码中的启发式算法，并将它们展示给自动调整工具，是一项艰巨的任务，根本不切实际。我们需要的是一种自动发现这些启发式算法的方法，以挖掘出每一滴潜在的优化潜力。在这项工作中，我们提出了 Heureka 这一框架，它能自动识别编译器中具有高收益优化目标的潜在启发式算法，然后自动为这些启发式算法找到可用的调整参数，只需极少的人工参与。我们的工作基于以下关键见解：当在可接受的值范围内修改启发式的输出时，使用该输出的调用代码仍能正常运行，并产生语义正确的结果。在此基础上，我们在编译器中自动处理潜在启发式代码的输出，并使用差分测试方法决定是否找到启发式。在输出处理过程中，我们还会探索目标代码的可接受值范围。通过这种方式确定的启发式方法可以进行调整，以优化目标函数。我们使用 Heureka 从 LLVM 优化传递的八千个函数中搜索启发式，这大约是所有可用函数的 2%。然后，我们使用确定的启发式方法来调整 NAS 和 Polybench 基准套件中 38 个应用程序的编译。与 -ozbaseline 相比，如果只考虑单一启发式方法，二进制大小最多可减少 11.6%；如果将多个确定的调整目标的效果叠加起来，并以最小的搜索努力应用随机搜索，二进制大小最多可减少 19.5%。根据现有的分析结果，Heureka 在单台机器上平均只需不到一个小时的时间，就能为以前从未见过的应用确定相关的启发式目标。

{"title":"Revealing Compiler Heuristics Through Automated Discovery and Optimization","authors":"Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, Hugh Leather","doi":"10.1109/CGO57630.2024.10444847","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444847","url":null,"abstract":"Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -ozbaseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"60 6","pages":"55-66"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀