Inter-loop optimization in RAJA using loop chains

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing Pub Date : 2021-06-03 DOI:10.1145/3447818.3461665

Brandon Neth, T. Scogland, B. Supinski, M. Strout

{"title":"Inter-loop optimization in RAJA using loop chains","authors":"Brandon Neth, T. Scogland, B. Supinski, M. Strout","doi":"10.1145/3447818.3461665","DOIUrl":null,"url":null,"abstract":"Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic execution capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98\\% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3461665","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic execution capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98\% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于循环链的RAJA循环间优化

典型的并行化方法，如OpenMP和CUDA，为单个循环的数据局部性提供了并行化和阻塞结构。由于分别关注每个循环，这些方法无法利用由于循环间数据重用而可能产生的数据源局部性。循环链抽象为推理和应用循环间优化提供了一个框架。在这项工作中，我们将循环链抽象合并到RAJA中，RAJA是高性能计算应用程序的性能可移植性层。使用环链扩展的RAJA或rajarc，开发人员可以让RAJA库应用循环转换，如循环融合和重叠平铺，同时保持程序的原始结构。通过引入目标符号执行功能，我们可以收集和缓存验证循环转换所需的数据访问信息。我们评估了扩展的性能改进和重构成本。总的来说，我们的结果表明，手工优化内核的性能提高了85- 98%，而代码更改却少得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Accelerating BWA-MEM Read Mapping on GPUs. Dynamic Memory Management in Massively Parallel Systems: A Case on GPUs. Priority Algorithms with Advice for Disjoint Path Allocation Problems From Data of Internet of Things to Domain Knowledge: A Case Study of Exploration in Smart Agriculture On Two Variants of Induced Matchings