2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)最新文献

Enabling Fine-Grained Incremental Builds by Making Compiler Stateful 让编译器有状态，实现细粒度增量编译

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444865

Ruobing Han, Jisheng Zhao, Hyesoon Kim

Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%.

增量构建通常用于软件开发，涉及对现有源代码的微小改动，然后经常重新编译。加快增量构建不仅能增强软件开发工作流程，还能通过加快验证步骤改进 CI/CD 系统。目前的增量构建解决方案主要依靠构建系统来分析文件依赖关系，以避免对未更改的文件进行不必要的重新编译。然而，对于确实发生变化的文件，这些构建系统只是调用编译器从头开始重新编译。这种方法揭示了系统中一个根本的不对称：构建系统以有状态的方式运行，而编译器则是无状态的。因此，增量构建只能在粗粒度级别上应用，重点关注整个源文件，而不是在更细粒度级别上考虑单个代码段。在本文中，我们提出了一种创新方法，通过在编译器中引入有状态，实现细粒度增量编译。在这种模式下，编译器会利用其剖析历史记录来加快已修改源文件的编译过程，从而缩短整体构建时间。具体来说，有状态编译器会保留编译器在以前编译过程中执行的休眠信息，并利用这些数据在后续增量编译过程中绕过休眠过程。我们还概述了将传统无状态编译器转变为有状态编译器所需的基本变化。为了进行实际评估，我们修改了 Clang 编译器以采用有状态架构，并在实际的 C++ 项目中对其性能进行了评估。我们的比较研究表明，有状态版本在增量编译方面优于标准 Clang 编译器，其端到端编译过程平均加快了 6.72%。

{"title":"Enabling Fine-Grained Incremental Builds by Making Compiler Stateful","authors":"Ruobing Han, Jisheng Zhao, Hyesoon Kim","doi":"10.1109/CGO57630.2024.10444865","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444865","url":null,"abstract":"Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"55 10","pages":"221-232"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation 利用上下文敏感性和伪仪器改造基于采样的 PGO

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444807

Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh

The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. In this paper, we propose CSSPGO, a context-sensitive sampling-based PGO framework with pseudo-instrumentation. CSSPGO offers a more balanced solution to push sampling-based PGO performance closer to instrumentation-based PGO while maintaining minimal operational overhead. It leverages pseudo-instrumentation to improve profile quality without incurring the overhead of traditional instrumentation. It also enriches profile with context-sensitivity to aid more effective optimizations through a novel profiling methodology using synchronized LBR and stack sampling. CSSPGO is now used to optimize over 75% of Meta's data center CPU cycles. Our evaluation with production workloads demonstrates 1%-5% performance improvement on top of state-of-the-art sampling-based PGO.

现代数据中心的规模不断扩大，需要更有效的优化，因为即使是很小比例的性能提升，也能显著降低数据中心的成本和环境影响。然而，数据中心中运行的各种工作负载也对优化解决方案的可扩展性提出了挑战。配置文件引导优化（PGO）是一种很有前途的提高应用性能的技术。基于采样的 PGO 操作开销低，因此在数据中心应用中广泛使用，但其性能提升不如基于仪器的 PGO 那么显著。另一方面，尽管基于仪器的 PGO 性能优越，但其高操作开销阻碍了它的大规模应用。在本文中，我们提出了 CSSPGO，一种基于上下文敏感采样的伪仪器 PGO 框架。CSSPGO 提供了一种更平衡的解决方案，在保持最小操作开销的同时，使基于采样的 PGO 性能更接近基于仪器的 PGO。它利用伪仪器来提高配置文件的质量，而不会产生传统仪器的开销。它还通过使用同步 LBR 和堆栈采样的新型剖析方法，丰富了具有上下文敏感性的剖析，从而帮助进行更有效的优化。目前，CSSPGO 已用于优化 Meta 数据中心超过 75% 的 CPU 周期。我们使用生产工作负载进行的评估表明，与最先进的基于采样的 PGO 相比，性能提高了 1%-5%。

{"title":"Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation","authors":"Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh","doi":"10.1109/CGO57630.2024.10444807","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444807","url":null,"abstract":"The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. In this paper, we propose CSSPGO, a context-sensitive sampling-based PGO framework with pseudo-instrumentation. CSSPGO offers a more balanced solution to push sampling-based PGO performance closer to instrumentation-based PGO while maintaining minimal operational overhead. It leverages pseudo-instrumentation to improve profile quality without incurring the overhead of traditional instrumentation. It also enriches profile with context-sensitivity to aid more effective optimizations through a novel profiling methodology using synchronized LBR and stack sampling. CSSPGO is now used to optimize over 75% of Meta's data center CPU cycles. Our evaluation with production workloads demonstrates 1%-5% performance improvement on top of state-of-the-art sampling-based PGO.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"63 4","pages":"322-333"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Tensor Algebra Compiler for Sparse Differentiation 用于稀疏微分的张量代数编译器

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444787

Amir Shaikhha, Mathieu Huot, Shideh Hashemian

Sparse tensors are prevalent in many data-intensive applications. However, existing automatic differentiation (AD) frameworks are tailored towards dense tensors, which makes it a challenge to efficiently compute gradients through sparse tensor operations. This is due to irregular sparsity patterns that can result in substantial memory and computational overheads. We propose a novel framework that enables the efficient AD of sparse tensors. The key aspects of our work include a compilation pipeline leveraging two intermediate DSLs with AD-agnostic domain-specific optimizations followed by efficient C++ code generation. We showcase the effectiveness of our framework in terms of performance and scalability through extensive experimentation, outperforming state-of-the-art alternatives across a variety of synthetic and real-world datasets.

稀疏张量在许多数据密集型应用中非常普遍。然而，现有的自动微分（AD）框架都是针对稠密张量量身定制的，这使得通过稀疏张量操作有效计算梯度成为一项挑战。这是由于不规则的稀疏模式会导致大量内存和计算开销。我们提出了一种新颖的框架，可以实现稀疏张量的高效 AD。我们工作的主要方面包括利用两个中间 DSL 的编译流水线和 AD 无关的特定领域优化，以及高效的 C++ 代码生成。通过广泛的实验，我们展示了我们的框架在性能和可扩展性方面的有效性，在各种合成和真实世界数据集上都优于最先进的替代方案。

引用次数: 0

One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution 一个自动机统治一切：超越多个正则表达式的执行

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444810

L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi

Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99times$ throughput improvement and $4.05times$ speedup against single and multiple parallel FSAs.

正则表达式（Regular Expressions，REs）匹配对于识别数据流中表现出特定形态属性的字符串至关重要，因此在计算机安全的深度数据包检查和生物信息学的基因组分析等方面发挥着重要作用。然而，由于其内在的数据依赖特性，REs 代表了一个复杂的计算内核，众多解决方案从不同方向研究模式匹配的效率。然而，大多数方案都缺乏全面的规则集优化方法，无法在同时考虑多个 RE 时真正提高模式匹配性能。因此，利用同一数据集中 RE 的形态相似性可以减少存储模式时的内存，并大大提高数据集匹配的吞吐量。基于这一观点，我们提出了多 RE 有限状态自动机（MFSA），它扩展了有限状态自动机（FSA）模型，通过利用特定应用规则集中的相似性来改进 RE 的并行化。我们设计了一个多级编译框架来管理 REs 合并和优化，以生成 MFSA。此外，我们还利用新颖的 iMFAnt 引擎扩展了用于执行 MFSA 的 iNFAnt 算法。我们的评估研究了在单线程和多线程配置下，与多 FSA 相比，MFSA 的大小缩减影响和执行吞吐量。就状态而言，这种方法的平均压缩率为 71.95%，编译时间开销有限。此外，与单线程和多线程并行 FSA 相比，最佳 iMFAnt 的吞吐量提高了 5.99 美元，速度提高了 4.05 美元。

{"title":"One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution","authors":"L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi","doi":"10.1109/CGO57630.2024.10444810","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444810","url":null,"abstract":"Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99times$ throughput improvement and $4.05times$ speedup against single and multiple parallel FSAs.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"65 6","pages":"193-206"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TapeFlow: Streaming Gradient Tapes in Automatic Differentiation TapeFlow：自动分辨中的梯度磁带流

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444805

Milad Hakimi, Arrvindh Shriraman

Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2-4×. We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3-2.5×, reduces on-chip SRAM usage by 5–40 ×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language.

计算梯度是机器学习、物理模拟和科学计算等许多领域的一项重要任务。自动微分（AD）可以计算任意指令代码的梯度。在反向模式 AD 中，使用辅助结构（磁带）来传输梯度计算所需的中间值。如何在内存层次结构中组织磁带是一个挑战，因为磁带的重用距离很远，缺乏时间局部性，而且会使工作集膨胀 2-4倍。我们介绍了 Tapeflow，这是一个协调和管理梯度磁带的编译器框架。i) 我们引入了区域的概念，将磁带布局转换为结构数组格式，以提高空间重用性。 ii) 我们将执行调度为多层，并使用刮板明确协调磁带操作数。这样可以减少所需的缓存大小和片上能耗。 iii) 最后，我们将磁带组织成一个 FIFO（瓦片），从 DRAM 中流式传输磁带。磁带操作数及时到达每一层。在相同硬件上运行的 Tapeflow 性能比最先进的编译器 Enzyme 高出 1.3-2.5倍，片上 SRAM 使用量减少了 5-40 倍，片上能耗节省了 8 倍。我们在用通用语言编写的各种算法上演示了 Tapeflow。

{"title":"TapeFlow: Streaming Gradient Tapes in Automatic Differentiation","authors":"Milad Hakimi, Arrvindh Shriraman","doi":"10.1109/CGO57630.2024.10444805","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444805","url":null,"abstract":"Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2-4×. We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3-2.5×, reduces on-chip SRAM usage by 5–40 ×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"61 9","pages":"81-92"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Performance Through Control-Flow Unmerging and Loop Unrolling on GPUs 在 GPU 上通过控制流拆分和循环解卷提升性能

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444819

Alnis Murtovi, G. Georgakoudis, K. Parasyris, Chunhua Liao, Ignacio Laguna, Bernhard Steffen

Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%.

编译器使用各种先进的优化技术来提高其生成的机器代码的质量。在大多数情况下，编译器优化依赖于精确的分析来执行优化。然而，每当进行控制流合并时，由于无法再对程序进行精确推理，信息就会丢失。解决这一问题的现有方法之一是代码复制，即把合并块中的指令复制到它们的前代指令中。本文介绍了一种新颖、更激进的代码复制方法，它以循环解卷和控制流解合并为基础，可实现仅应用其中一种转换无法实现的后续优化。我们在 LLVM 中实现了我们的方法，并在 CUDA 的一系列 GPU 基准上评估了其性能。我们的结果表明，即使在面临分支分歧的情况下，我们的优化技术也能实现高达 81% 的性能提升。

{"title":"Enhancing Performance Through Control-Flow Unmerging and Loop Unrolling on GPUs","authors":"Alnis Murtovi, G. Georgakoudis, K. Parasyris, Chunhua Liao, Ignacio Laguna, Bernhard Steffen","doi":"10.1109/CGO57630.2024.10444819","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444819","url":null,"abstract":"Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"56 12","pages":"106-118"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EasyTracker: A Python Library for Controlling and Inspecting Program Execution EasyTracker：用于控制和检查程序执行的 Python 库

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444823

Théo Barollet, C. Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, Fabrice Rastello

Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program.

学习编程需要在头脑中建立机器如何执行指令和在内存中存储数据的表象。为了帮助学生，教师在讲课时经常使用直观表示法来说明程序的执行或特定概念。一个著名的例子是，教师经常用箭头指向对象或内存位置来表示引用/指针。虽然这些可视化表示法大多是手绘的，但也有使用工具对其进行补充的趋势。然而，从零开始制作这样的工具需要大量的精力和高水平的调试技术知识，而现有的工具又很难适应不同的环境。本文介绍的 EasyTracker 是一个针对非调试专家的教师的 Python 库。通过提供控制程序的执行和检查程序状态的方法，EasyTracker 简化了工具的开发，这些工具可通过控制程序的执行生成经过调整的可视化表示。受控程序可以用 Python、C 或汇编语言编写。为了便于开发适用于不同语言程序的可视化工具，并允许构建基于网络的工具，EasyTracker 提供了一种与语言无关且可序列化的运行程序状态表示法。

{"title":"EasyTracker: A Python Library for Controlling and Inspecting Program Execution","authors":"Théo Barollet, C. Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, Fabrice Rastello","doi":"10.1109/CGO57630.2024.10444823","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444823","url":null,"abstract":"Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"62 12","pages":"359-372"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications DrPy：准确定位多层 Python 应用程序中的低效内存使用情况

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444862

Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu

Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by application developers and committed to application repositories.

Python 已成为一种越来越流行的编程语言，尤其是在数据分析和机器学习领域。许多现代 Python 软件包采用了多层设计：Python 层管理各种软件包并表达高级算法；本地层使用 C/C++/Fortran/CUDA 编写，用于高效计算。通常情况下，每一层管理自己的计算和内存，并为跨层交互提供应用程序接口。如果不进行整体优化，各层之间的边界可能会出现性能低下的问题。在本文中，我们开发了 DrPy，这是一种新型剖析器，能找出 Python 应用程序中跨层的内存低效问题。与现有工具不同的是，DrPy 采用混合和细粒度方法跟踪 Python 和本地层中的内存对象及其使用情况。DrPy 将各层内存对象的行为关联起来，并构建对象流图，以精确定位内存效率低下的问题。此外，DrPy 还能捕捉与对象流图相关的丰富信息，如调用路径和源代码属性，以指导直观的代码优化。在 DrPy 的指导下，我们对许多 Python 应用程序进行了优化，取得了非同小可的性能提升。许多优化补丁已通过应用程序开发人员的验证，并已提交至应用程序库。

{"title":"DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications","authors":"Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu","doi":"10.1109/CGO57630.2024.10444862","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444862","url":null,"abstract":"Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by application developers and committed to application repositories.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"61 7","pages":"245-257"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis OptiWISE：将采样与仪器相结合，进行粒度 CPI 分析

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444771

Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, Timothy M. Jones

Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements.

尽管数十年来编译器技术不断进步，但仍有必要对应用程序进行剖析以提高性能。现有的剖析工具通常要么对硬件性能计数器进行采样，要么使用额外指令对程序进行检测，以分析其执行情况。这两种技术各有优缺点，但并不总能正确识别优化机会。我们介绍的 OPTIWISE 是一种剖析工具，它可以运行程序两次，一次使用低开销采样来精确测量性能，另一次使用仪器来精确捕捉控制流和执行计数。然后，OPTIWISE 将这些信息结合起来，通过计算采样与执行次数的比率，以及诸如每个循环、源代码行或函数的成本等汇总信息，给出高度详细的每条指令 CPI 指标。我们对 OPTIWISE 进行了评估，结果表明它在 SPEC CPU2017 基准上的开销为 8.1× geomean，最坏情况为 57×。利用 OPTIWISE，我们介绍了在现代 x86 服务器处理器上优化所选 SPEC 基准的案例研究。每条指令的 CPI 指标能迅速揭示问题，如代价高昂的错误预测分支和高速缓存缺失，我们利用这些指标进行手动优化，从而有效提高性能。

{"title":"OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis","authors":"Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, Timothy M. Jones","doi":"10.1109/CGO57630.2024.10444771","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444771","url":null,"abstract":"Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"57 9","pages":"373-385"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Energy-Aware Tile Size Selection for Affine Programs on GPUs GPU 上 Affine 程序的能量感知瓦片尺寸选择

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444795

Malith Jayaweera, Martin Kong, Yanzhi Wang, D. Kaeli

Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5 × and 1.2 × improvement and obtain up to 6.3 × improvement on non-Polybench high-dimensional affine kernels.

循环平铺是一种高阶变换，用于提高数据局部性和性能。虽然之前的工作已经考虑了它在多个领域和架构中的应用，但它对能效的潜在影响在很大程度上被忽视了。在这项工作中，我们针对 GPU 的仿射程序提出了一种节能瓦片大小选择方案（EATSS）。我们自动推导仿射程序的非线性整数公式，并使用 Z3 求解器找到有效的磁贴大小，以满足架构资源限制，同时最大限度地提高性能和降低能耗。我们的方法基于这样一种见解，即减少缓存内数据的延迟，同时利用自动功率缩放，可以大幅提高性能和能效。我们在英伟达 Xavier 和 GA100 GPU 上对 EATSS 进行了评估，并报告了在几个仿射内核上相对于 PPCG 的每瓦性能改进中值。在 Polybench 内核上，我们实现了 1.5 倍和 1.2 倍的改进，在非 Polybench 高维仿射内核上，我们实现了高达 6.3 倍的改进。

引用次数: 0