(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms

IF 1.5 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Programming Languages and Systems Pub Date : 2024-05-22 DOI:10.1145/3665643

Ari Rasch

{"title":"(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms","authors":"Ari Rasch","doi":"10.1145/3665643","DOIUrl":null,"url":null,"abstract":"\n Data-parallel computations, such as linear algebra routines (BLAS) and stencil computations, constitute one of the most relevant classes in parallel computing, e.g., due to their importance for deep learning. Efficiently de-composing such computations for the memory and core hierarchies of modern architectures and re-composing the computed intermediate results back to the final result – we say\n (de/re)-composition\n for short – is key to achieve high performance for these computations on, e.g., GPU and CPU. Current high-level approaches to generating data-parallel code are often restricted to a particular subclass of data-parallel computations and architectures (e.g., only linear algebra routines on only GPU, or only stencil computations), and/or the approaches rely on a user-guided optimization process for a well-performing (de/re)-composition of computations, which is complex and error prone for the user.\n \n \n We formally introduce a systematic (de/re)-composition approach, based on the algebraic formalism of\n Multi-Dimensional Homomorphisms (MDHs)\n \n 1\n \n . Our approach is designed as general enough to be applicable to a wide range of data-parallel computations and for various kinds of target parallel architectures. To efficiently target the deep and complex memory and core hierarchies of contemporary architectures, we exploit our introduced (de/re)-composition approach for a correct-by-construction, parametrized cache blocking and parallelization strategy. We show that our approach is powerful enough to express, in the same formalism, the (de/re)-composition strategies of different classes of state-of-the-art approaches (scheduling-based, polyhedral, etc), and we demonstrate that the parameters of our strategies enable systematically generating code that can be fully automatically optimized (auto-tuned) for the particular target architecture and characteristics of the input and output data (e.g., their sizes and memory layouts). Particularly, our experiments confirm that via auto-tuning, we achieve higher performance than state-of-the-art approaches, including hand-optimized solutions provided by vendors (such as NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN), on real-world data sets and for a variety of data-parallel computations, including: linear algebra routines, stencil and quantum chemistry computations, data mining algorithms, and computations that recently gained high attention due to their relevance for deep learning.\n","PeriodicalId":50939,"journal":{"name":"ACM Transactions on Programming Languages and Systems","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Programming Languages and Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3665643","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Data-parallel computations, such as linear algebra routines (BLAS) and stencil computations, constitute one of the most relevant classes in parallel computing, e.g., due to their importance for deep learning. Efficiently de-composing such computations for the memory and core hierarchies of modern architectures and re-composing the computed intermediate results back to the final result – we say (de/re)-composition for short – is key to achieve high performance for these computations on, e.g., GPU and CPU. Current high-level approaches to generating data-parallel code are often restricted to a particular subclass of data-parallel computations and architectures (e.g., only linear algebra routines on only GPU, or only stencil computations), and/or the approaches rely on a user-guided optimization process for a well-performing (de/re)-composition of computations, which is complex and error prone for the user. We formally introduce a systematic (de/re)-composition approach, based on the algebraic formalism of Multi-Dimensional Homomorphisms (MDHs) 1 . Our approach is designed as general enough to be applicable to a wide range of data-parallel computations and for various kinds of target parallel architectures. To efficiently target the deep and complex memory and core hierarchies of contemporary architectures, we exploit our introduced (de/re)-composition approach for a correct-by-construction, parametrized cache blocking and parallelization strategy. We show that our approach is powerful enough to express, in the same formalism, the (de/re)-composition strategies of different classes of state-of-the-art approaches (scheduling-based, polyhedral, etc), and we demonstrate that the parameters of our strategies enable systematically generating code that can be fully automatically optimized (auto-tuned) for the particular target architecture and characteristics of the input and output data (e.g., their sizes and memory layouts). Particularly, our experiments confirm that via auto-tuning, we achieve higher performance than state-of-the-art approaches, including hand-optimized solutions provided by vendors (such as NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN), on real-world data sets and for a variety of data-parallel computations, including: linear algebra routines, stencil and quantum chemistry computations, data mining algorithms, and computations that recently gained high attention due to their relevance for deep learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

(通过多维同构实现数据并行计算的（De/Re）组合

数据并行计算，如线性代数例程（BLAS）和模板计算，构成了并行计算中最相关的类别之一，例如，由于其对深度学习的重要性。针对现代架构的内存和内核分层，有效地去组合此类计算，并将计算的中间结果重新组合为最终结果--我们简称为（去/重）组合--是在 GPU 和 CPU 等设备上实现高性能的关键。目前生成数据并行代码的高级方法往往局限于数据并行计算和架构的特定子类（例如，仅在 GPU 上生成线性代数例程，或仅生成模板计算），并且/或者这些方法依赖于用户引导的优化过程，以获得性能良好的计算（de/re）组合，这对用户来说既复杂又容易出错。我们基于多维同态（MDHs）1 的代数形式，正式介绍了一种系统的（de/re）组合方法。我们的方法具有足够的通用性，可适用于各种数据并行计算和各种目标并行架构。为了有效地针对当代架构的深层复杂内存和内核层次结构，我们利用引入的 (de/re)-composition 方法，采用正确的构造、参数化缓存阻塞和并行化策略。我们证明了我们的方法足够强大，可以在同一形式主义中表达不同类别的先进方法（基于调度、多面体等）的（de/re）-组合策略，我们还证明了我们的策略参数可以系统地生成代码，这些代码可以针对特定的目标架构以及输入和输出数据的特征（如它们的大小和内存布局）进行全自动优化（自动调整）。特别是，我们的实验证实，通过自动调整，我们在真实世界的数据集和各种数据并行计算上实现了比最先进方法（包括供应商提供的手工优化解决方案，如英伟达 cuBLAS/cuDNN 和英特尔 oneMKL/oneDNN）更高的性能，这些数据并行计算包括：线性代数例程、模板和量子化学计算、数据挖掘算法，以及最近因与深度学习相关而备受关注的计算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Programming Languages and Systems 工程技术-计算机：软件工程

CiteScore

3.10

自引率

7.70%

发文量

审稿时长

>12 weeks

期刊介绍： ACM Transactions on Programming Languages and Systems (TOPLAS) is the premier journal for reporting recent research advances in the areas of programming languages, and systems to assist the task of programming. Papers can be either theoretical or experimental in style, but in either case, they must contain innovative and novel content that advances the state of the art of programming languages and systems. We also invite strictly experimental papers that compare existing approaches, as well as tutorial and survey papers. The scope of TOPLAS includes, but is not limited to, the following subjects: language design for sequential and parallel programming programming language implementation programming language semantics compilers and interpreters runtime systems for program execution storage allocation and garbage collection languages and methods for writing program specifications languages and methods for secure and reliable programs testing and verification of programs