Robert Lyerly, Carlos Bilbao, Changwoo Min, Christopher J. Rossbach, Binoy Ravindran
{"title":"An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous Nodes","authors":"Robert Lyerly, Carlos Bilbao, Changwoo Min, Christopher J. Rossbach, Binoy Ravindran","doi":"https://dl.acm.org/doi/full/10.1145/3505224","DOIUrl":null,"url":null,"abstract":"<p>In this work, we present <monospace>libHetMP</monospace>, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. <monospace>libHetMP</monospace> targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads. <monospace>libHetMP</monospace> drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial and either distributes work according to the relative performance of CPUs when it is or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate <monospace>libHetMP</monospace> using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs. Because some workloads may showcase irregular behavior among iterations, we extend <monospace>libHetMP</monospace> with a second scheduler called HetProbe-I. The evaluation of HetProbe-I shows it can further improve speedup for irregular computation, in some cases up to a 24%, by triggering periodic distribution decisions.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"7 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/full/10.1145/3505224","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
In this work, we present libHetMP, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. libHetMP targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads. libHetMP drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial and either distributes work according to the relative performance of CPUs when it is or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate libHetMP using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs. Because some workloads may showcase irregular behavior among iterations, we extend libHetMP with a second scheduler called HetProbe-I. The evaluation of HetProbe-I shows it can further improve speedup for irregular computation, in some cases up to a 24%, by triggering periodic distribution decisions.
期刊介绍:
ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized.
TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.