G. T. Bercea;A. Bataev;A. E. Eichenberger;C. Bertolli;J. K. O'Brien
{"title":"An open-source solution to performance portability for Summit and Sierra supercomputers","authors":"G. T. Bercea;A. Bataev;A. E. Eichenberger;C. Bertolli;J. K. O'Brien","doi":"10.1147/JRD.2019.2955944","DOIUrl":null,"url":null,"abstract":"Programming models that use a higher level of abstraction to express parallelism can target both CPUs and any attached devices, alleviating the maintainability and portability concerns facing today's heterogenous systems. This article describes the design, implementation, and delivery of a compliant OpenMP device offloading implementation for IBM-NVIDIA heterogeneous servers composing the Summit and Sierra supercomputers in the mainline open-source Clang/LLVM compiler and OpenMP runtime projects. From a performance perspective, reconciling the GPU programming model, best suited for massively parallel workloads, with the generality of the OpenMP model was a significant challenge. To achieve both high performance and full portability, we map high-level programming patterns to fine-tuned code generation schemes and customized runtimes that preserve the OpenMP semantics. In the compiler, we implement a low-overhead single-program multiple-data scheme that leverages the GPU native execution model and a fallback scheme to support the generality of OpenMP. Modular design enables the implementation to be extended with new schemes for frequently occurring patterns. Our implementation relies on key optimizations: sharing data among threads, leveraging unified memory, aggressive inlining of runtime calls, memory coalescing, and runtime simplification. We show that for commonly used patterns, performance on the Summit and Sierra GPUs matches that of hand-written native CUDA code.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"12:1-12:23"},"PeriodicalIF":1.3000,"publicationDate":"2019-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2955944","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8928619/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 2
Abstract
Programming models that use a higher level of abstraction to express parallelism can target both CPUs and any attached devices, alleviating the maintainability and portability concerns facing today's heterogenous systems. This article describes the design, implementation, and delivery of a compliant OpenMP device offloading implementation for IBM-NVIDIA heterogeneous servers composing the Summit and Sierra supercomputers in the mainline open-source Clang/LLVM compiler and OpenMP runtime projects. From a performance perspective, reconciling the GPU programming model, best suited for massively parallel workloads, with the generality of the OpenMP model was a significant challenge. To achieve both high performance and full portability, we map high-level programming patterns to fine-tuned code generation schemes and customized runtimes that preserve the OpenMP semantics. In the compiler, we implement a low-overhead single-program multiple-data scheme that leverages the GPU native execution model and a fallback scheme to support the generality of OpenMP. Modular design enables the implementation to be extended with new schemes for frequently occurring patterns. Our implementation relies on key optimizations: sharing data among threads, leveraging unified memory, aggressive inlining of runtime calls, memory coalescing, and runtime simplification. We show that for commonly used patterns, performance on the Summit and Sierra GPUs matches that of hand-written native CUDA code.
期刊介绍:
The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals.
Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.