A DSL for Performance Orchestration

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2017-09-01 DOI:10.1109/PACT.2017.50

Thiago Teixeira, D. Padua, W. Gropp

{"title":"A DSL for Performance Orchestration","authors":"Thiago Teixeira, D. Padua, W. Gropp","doi":"10.1109/PACT.2017.50","DOIUrl":null,"url":null,"abstract":"The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"421 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.50","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于性能编排的DSL

当今计算机体系结构的复杂性和多样性要求软件开发人员给予更多的关注，以便利用所有可用的计算能力。此外，每个不同的现代体系结构都需要一组潜在的非重叠优化，以获得其名义峰值速度的更高比例。这导致了性能可移植性和代码可维护性方面的挑战，特别是，如何管理针对不同架构的相同代码的不同优化版本，以及如何在添加新算法特性时使它们保持最新状态。体系结构复杂性的增加和优化空间的扩展往往会使编译器提供令人不满意的性能，手动调优和编译器生成的代码之间的性能差距急剧扩大。即使使用高级优化标志也不足以缩小这一差距。另一方面，手动优化应用程序非常耗时，开发人员需要了解每种体系结构的许多不同硬件特性并与之交互。已经开展了一些成功的研究，以帮助程序员在实现、优化和将应用程序移植到不同架构的这个痛苦且容易出错的过程中发挥作用。尽管如此，这些作品的采用大多局限于特定的领域，如密集线性代数、傅立叶变换和信号处理。我们已经开发了框架ICE，它将性能专家角色与应用程序专家角色分离(关注点分离)。它允许使用特定于体系结构的优化，同时保持代码的长期可维护性。它负责将多个优化工具的使用编排到应用程序的基线版本，并执行经验搜索以找到最佳优化序列及其参数。基线版本被认为没有任何特定于体系结构或编译器的优化。优化和经验搜索由外部文件中的特定于领域的语言(DSL)指导。通过为所使用的每个体系结构添加多个优化案例，应用程序的代码通常会发生巨大的变化。这个DSL允许性能专家在不打乱原始代码的情况下应用优化。DSL具有公开优化选项的构造，并生成可由不同搜索工具遍历的搜索空间。例如，它具有条件语句，可用于指定应该为每个编译器执行哪些优化。DSL既是经验搜索的输入，也是输出。它可以用来保存在以前的搜索中找到的最佳转换序列。应用程序的代码用在DSL中引用的唯一标识符进行注释。目前，可以接受源到源的循环优化、算法和语用选择。框架接口非常灵活，可以集成新的优化和搜索工具。如果出现任何故障，它会返回到基线版本。我们已经将该框架应用于线性代数问题、模板计算和模拟等离子体耦合燃烧的生产代码~xpacc，实现了高达3倍的加速。其他的工作试图解决促进优化应用程序的问题，但他们缺乏由ICE组成的重要功能。CHiLL、Orio和X语言简化了优化代码的生成。在这些代码中，CHiLL是唯一使用外部文件给出执行优化的指令，但是它根据循环在源中的位置引用循环，并且源中的修改需要在外部文件中修改，这限制了它在大型生产代码中的使用。只有Orio经验地计算带注释的代码的变体。综上所述，该框架的贡献是:关注点分离、增量采用、指定优化空间的DSL、插件和比较不同优化和搜索工具的接口、经验搜索与专家知识的结合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量