E. Tiotto;B. Mahjour;W. Tsang;X. Xue;T. Islam;W. Chen
{"title":"OpenMP 4.5 compiler optimization for GPU offloading","authors":"E. Tiotto;B. Mahjour;W. Tsang;X. Xue;T. Islam;W. Chen","doi":"10.1147/JRD.2019.2962428","DOIUrl":null,"url":null,"abstract":"Ability to efficiently offload computational workloads to graphic processing units (GPUs) is critical for the success of hybrid CPU–GPU architectures, such as the Summit and Sierra supercomputing systems. OpenMP 4.5 is a high-level programming model that enables the development of architecture- and accelerator-independent applications. This article describes aspects of the OpenMP implementation in the IBM XL C/C++ and XL Fortran OpenMP compilers that aid programmers to achieve performance objectives. This includes an interprocedural static analysis the XL optimizer uses to specialize code generation of the OpenMP \n<italic>distribute parallel do</i>\n loop within the dynamic context of a target region, and other compiler optimizations designed to reduce the overhead of data transferred to an offloaded target region. We introduce the heuristic used at runtime to select optimal grid sizes for offloaded target team constructs. These tuned heuristics lead to an average improvement of 2× in the runtime of several target regions in the SPEC ACCEL V1.2 benchmark suite. In addition to performance enhancement, this article also presents an advanced diagnostic feature implemented in the XL Fortran compiler to aid in debugging OpenMP applications offloaded to accelerators.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2019-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2962428","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8943339/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 14
Abstract
Ability to efficiently offload computational workloads to graphic processing units (GPUs) is critical for the success of hybrid CPU–GPU architectures, such as the Summit and Sierra supercomputing systems. OpenMP 4.5 is a high-level programming model that enables the development of architecture- and accelerator-independent applications. This article describes aspects of the OpenMP implementation in the IBM XL C/C++ and XL Fortran OpenMP compilers that aid programmers to achieve performance objectives. This includes an interprocedural static analysis the XL optimizer uses to specialize code generation of the OpenMP
distribute parallel do
loop within the dynamic context of a target region, and other compiler optimizations designed to reduce the overhead of data transferred to an offloaded target region. We introduce the heuristic used at runtime to select optimal grid sizes for offloaded target team constructs. These tuned heuristics lead to an average improvement of 2× in the runtime of several target regions in the SPEC ACCEL V1.2 benchmark suite. In addition to performance enhancement, this article also presents an advanced diagnostic feature implemented in the XL Fortran compiler to aid in debugging OpenMP applications offloaded to accelerators.
期刊介绍:
The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals.
Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.