Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444865
Ruobing Han, Jisheng Zhao, Hyesoon Kim
Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%.
增量构建通常用于软件开发,涉及对现有源代码的微小改动,然后经常重新编译。加快增量构建不仅能增强软件开发工作流程,还能通过加快验证步骤改进 CI/CD 系统。目前的增量构建解决方案主要依靠构建系统来分析文件依赖关系,以避免对未更改的文件进行不必要的重新编译。然而,对于确实发生变化的文件,这些构建系统只是调用编译器从头开始重新编译。这种方法揭示了系统中一个根本的不对称:构建系统以有状态的方式运行,而编译器则是无状态的。因此,增量构建只能在粗粒度级别上应用,重点关注整个源文件,而不是在更细粒度级别上考虑单个代码段。在本文中,我们提出了一种创新方法,通过在编译器中引入有状态,实现细粒度增量编译。在这种模式下,编译器会利用其剖析历史记录来加快已修改源文件的编译过程,从而缩短整体构建时间。具体来说,有状态编译器会保留编译器在以前编译过程中执行的休眠信息,并利用这些数据在后续增量编译过程中绕过休眠过程。我们还概述了将传统无状态编译器转变为有状态编译器所需的基本变化。为了进行实际评估,我们修改了 Clang 编译器以采用有状态架构,并在实际的 C++ 项目中对其性能进行了评估。我们的比较研究表明,有状态版本在增量编译方面优于标准 Clang 编译器,其端到端编译过程平均加快了 6.72%。
{"title":"Enabling Fine-Grained Incremental Builds by Making Compiler Stateful","authors":"Ruobing Han, Jisheng Zhao, Hyesoon Kim","doi":"10.1109/CGO57630.2024.10444865","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444865","url":null,"abstract":"Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"55 10","pages":"221-232"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444807
Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh
The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. In this paper, we propose CSSPGO, a context-sensitive sampling-based PGO framework with pseudo-instrumentation. CSSPGO offers a more balanced solution to push sampling-based PGO performance closer to instrumentation-based PGO while maintaining minimal operational overhead. It leverages pseudo-instrumentation to improve profile quality without incurring the overhead of traditional instrumentation. It also enriches profile with context-sensitivity to aid more effective optimizations through a novel profiling methodology using synchronized LBR and stack sampling. CSSPGO is now used to optimize over 75% of Meta's data center CPU cycles. Our evaluation with production workloads demonstrates 1%-5% performance improvement on top of state-of-the-art sampling-based PGO.
{"title":"Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation","authors":"Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh","doi":"10.1109/CGO57630.2024.10444807","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444807","url":null,"abstract":"The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. In this paper, we propose CSSPGO, a context-sensitive sampling-based PGO framework with pseudo-instrumentation. CSSPGO offers a more balanced solution to push sampling-based PGO performance closer to instrumentation-based PGO while maintaining minimal operational overhead. It leverages pseudo-instrumentation to improve profile quality without incurring the overhead of traditional instrumentation. It also enriches profile with context-sensitivity to aid more effective optimizations through a novel profiling methodology using synchronized LBR and stack sampling. CSSPGO is now used to optimize over 75% of Meta's data center CPU cycles. Our evaluation with production workloads demonstrates 1%-5% performance improvement on top of state-of-the-art sampling-based PGO.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"63 4","pages":"322-333"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444787
Amir Shaikhha, Mathieu Huot, Shideh Hashemian
Sparse tensors are prevalent in many data-intensive applications. However, existing automatic differentiation (AD) frameworks are tailored towards dense tensors, which makes it a challenge to efficiently compute gradients through sparse tensor operations. This is due to irregular sparsity patterns that can result in substantial memory and computational overheads. We propose a novel framework that enables the efficient AD of sparse tensors. The key aspects of our work include a compilation pipeline leveraging two intermediate DSLs with AD-agnostic domain-specific optimizations followed by efficient C++ code generation. We showcase the effectiveness of our framework in terms of performance and scalability through extensive experimentation, outperforming state-of-the-art alternatives across a variety of synthetic and real-world datasets.
稀疏张量在许多数据密集型应用中非常普遍。然而,现有的自动微分(AD)框架都是针对稠密张量量身定制的,这使得通过稀疏张量操作有效计算梯度成为一项挑战。这是由于不规则的稀疏模式会导致大量内存和计算开销。我们提出了一种新颖的框架,可以实现稀疏张量的高效 AD。我们工作的主要方面包括利用两个中间 DSL 的编译流水线和 AD 无关的特定领域优化,以及高效的 C++ 代码生成。通过广泛的实验,我们展示了我们的框架在性能和可扩展性方面的有效性,在各种合成和真实世界数据集上都优于最先进的替代方案。
{"title":"A Tensor Algebra Compiler for Sparse Differentiation","authors":"Amir Shaikhha, Mathieu Huot, Shideh Hashemian","doi":"10.1109/CGO57630.2024.10444787","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444787","url":null,"abstract":"Sparse tensors are prevalent in many data-intensive applications. However, existing automatic differentiation (AD) frameworks are tailored towards dense tensors, which makes it a challenge to efficiently compute gradients through sparse tensor operations. This is due to irregular sparsity patterns that can result in substantial memory and computational overheads. We propose a novel framework that enables the efficient AD of sparse tensors. The key aspects of our work include a compilation pipeline leveraging two intermediate DSLs with AD-agnostic domain-specific optimizations followed by efficient C++ code generation. We showcase the effectiveness of our framework in terms of performance and scalability through extensive experimentation, outperforming state-of-the-art alternatives across a variety of synthetic and real-world datasets.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"59 5","pages":"1-12"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444810
L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi
Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99times$ throughput improvement and $4.05times$ speedup against single and multiple parallel FSAs.
正则表达式(Regular Expressions,REs)匹配对于识别数据流中表现出特定形态属性的字符串至关重要,因此在计算机安全的深度数据包检查和生物信息学的基因组分析等方面发挥着重要作用。然而,由于其内在的数据依赖特性,REs 代表了一个复杂的计算内核,众多解决方案从不同方向研究模式匹配的效率。然而,大多数方案都缺乏全面的规则集优化方法,无法在同时考虑多个 RE 时真正提高模式匹配性能。因此,利用同一数据集中 RE 的形态相似性可以减少存储模式时的内存,并大大提高数据集匹配的吞吐量。基于这一观点,我们提出了多 RE 有限状态自动机(MFSA),它扩展了有限状态自动机(FSA)模型,通过利用特定应用规则集中的相似性来改进 RE 的并行化。我们设计了一个多级编译框架来管理 REs 合并和优化,以生成 MFSA。此外,我们还利用新颖的 iMFAnt 引擎扩展了用于执行 MFSA 的 iNFAnt 算法。我们的评估研究了在单线程和多线程配置下,与多 FSA 相比,MFSA 的大小缩减影响和执行吞吐量。就状态而言,这种方法的平均压缩率为 71.95%,编译时间开销有限。此外,与单线程和多线程并行 FSA 相比,最佳 iMFAnt 的吞吐量提高了 5.99 美元,速度提高了 4.05 美元。
{"title":"One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution","authors":"L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi","doi":"10.1109/CGO57630.2024.10444810","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444810","url":null,"abstract":"Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99times$ throughput improvement and $4.05times$ speedup against single and multiple parallel FSAs.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"65 6","pages":"193-206"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444805
Milad Hakimi, Arrvindh Shriraman
Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2-4×. We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3-2.5×, reduces on-chip SRAM usage by 5–40 ×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language.
{"title":"TapeFlow: Streaming Gradient Tapes in Automatic Differentiation","authors":"Milad Hakimi, Arrvindh Shriraman","doi":"10.1109/CGO57630.2024.10444805","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444805","url":null,"abstract":"Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2-4×. We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3-2.5×, reduces on-chip SRAM usage by 5–40 ×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"61 9","pages":"81-92"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444819
Alnis Murtovi, G. Georgakoudis, K. Parasyris, Chunhua Liao, Ignacio Laguna, Bernhard Steffen
Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%.
编译器使用各种先进的优化技术来提高其生成的机器代码的质量。在大多数情况下,编译器优化依赖于精确的分析来执行优化。然而,每当进行控制流合并时,由于无法再对程序进行精确推理,信息就会丢失。解决这一问题的现有方法之一是代码复制,即把合并块中的指令复制到它们的前代指令中。本文介绍了一种新颖、更激进的代码复制方法,它以循环解卷和控制流解合并为基础,可实现仅应用其中一种转换无法实现的后续优化。我们在 LLVM 中实现了我们的方法,并在 CUDA 的一系列 GPU 基准上评估了其性能。我们的结果表明,即使在面临分支分歧的情况下,我们的优化技术也能实现高达 81% 的性能提升。
{"title":"Enhancing Performance Through Control-Flow Unmerging and Loop Unrolling on GPUs","authors":"Alnis Murtovi, G. Georgakoudis, K. Parasyris, Chunhua Liao, Ignacio Laguna, Bernhard Steffen","doi":"10.1109/CGO57630.2024.10444819","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444819","url":null,"abstract":"Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"56 12","pages":"106-118"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444823
Théo Barollet, C. Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, Fabrice Rastello
Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program.
{"title":"EasyTracker: A Python Library for Controlling and Inspecting Program Execution","authors":"Théo Barollet, C. Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, Fabrice Rastello","doi":"10.1109/CGO57630.2024.10444823","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444823","url":null,"abstract":"Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"62 12","pages":"359-372"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444862
Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu
Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by application developers and committed to application repositories.
{"title":"DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications","authors":"Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu","doi":"10.1109/CGO57630.2024.10444862","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444862","url":null,"abstract":"Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by application developers and committed to application repositories.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"61 7","pages":"245-257"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444771
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, Timothy M. Jones
Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements.
{"title":"OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis","authors":"Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, Timothy M. Jones","doi":"10.1109/CGO57630.2024.10444771","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444771","url":null,"abstract":"Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"57 9","pages":"373-385"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-02DOI: 10.1109/CGO57630.2024.10444795
Malith Jayaweera, Martin Kong, Yanzhi Wang, D. Kaeli
Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5 × and 1.2 × improvement and obtain up to 6.3 × improvement on non-Polybench high-dimensional affine kernels.
{"title":"Energy-Aware Tile Size Selection for Affine Programs on GPUs","authors":"Malith Jayaweera, Martin Kong, Yanzhi Wang, D. Kaeli","doi":"10.1109/CGO57630.2024.10444795","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444795","url":null,"abstract":"Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5 × and 1.2 × improvement and obtain up to 6.3 × improvement on non-Polybench high-dimensional affine kernels.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"62 11","pages":"13-27"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}