OpenMP运行时跨缓存不一致异构节点的透明工作共享

IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Computer Systems Pub Date : 2022-07-05 DOI:https://dl.acm.org/doi/full/10.1145/3505224
Robert Lyerly, Carlos Bilbao, Changwoo Min, Christopher J. Rossbach, Binoy Ravindran
{"title":"OpenMP运行时跨缓存不一致异构节点的透明工作共享","authors":"Robert Lyerly, Carlos Bilbao, Changwoo Min, Christopher J. Rossbach, Binoy Ravindran","doi":"https://dl.acm.org/doi/full/10.1145/3505224","DOIUrl":null,"url":null,"abstract":"<p>In this work, we present <monospace>libHetMP</monospace>, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. <monospace>libHetMP</monospace> targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads. <monospace>libHetMP</monospace> drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial and either distributes work according to the relative performance of CPUs when it is or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate <monospace>libHetMP</monospace> using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs. Because some workloads may showcase irregular behavior among iterations, we extend <monospace>libHetMP</monospace> with a second scheduler called HetProbe-I. The evaluation of HetProbe-I shows it can further improve speedup for irregular computation, in some cases up to a 24%, by triggering periodic distribution decisions.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"7 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous Nodes\",\"authors\":\"Robert Lyerly, Carlos Bilbao, Changwoo Min, Christopher J. Rossbach, Binoy Ravindran\",\"doi\":\"https://dl.acm.org/doi/full/10.1145/3505224\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In this work, we present <monospace>libHetMP</monospace>, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. <monospace>libHetMP</monospace> targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads. <monospace>libHetMP</monospace> drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial and either distributes work according to the relative performance of CPUs when it is or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate <monospace>libHetMP</monospace> using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs. Because some workloads may showcase irregular behavior among iterations, we extend <monospace>libHetMP</monospace> with a second scheduler called HetProbe-I. The evaluation of HetProbe-I shows it can further improve speedup for irregular computation, in some cases up to a 24%, by triggering periodic distribution decisions.</p>\",\"PeriodicalId\":50918,\"journal\":{\"name\":\"ACM Transactions on Computer Systems\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2022-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Computer Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/https://dl.acm.org/doi/full/10.1145/3505224\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/full/10.1145/3505224","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

摘要

在这项工作中,我们提出了libHetMP,一个OpenMP运行时,用于在异构节点上自动透明地分布并行计算。libHetMP的目标平台包括具有不同指令集架构(ISA)的cpu,通过高速内存互连,其中跨ISA二进制不兼容性和非相干缓存要求应用程序数据被封送以跨cpu共享。因此,工作分配决策必须同时考虑非对称cpu的相对计算性能和通信开销。libHetMP通过测量跨节点执行期间的性能特征,在没有程序员干预的情况下驱动工作负载分配决策。一个新颖的HetProbe循环迭代调度器决定跨节点执行是否有益,并根据cpu的相对性能分配工作,或者将所有工作放在提供最佳性能的同构cpu集上。我们使用来自几个OpenMP基准套件的计算内核来评估libHetMP,并显示在非对称cpu上执行时间的几何平均加速为41%。由于某些工作负载可能在迭代中表现出不规则的行为,因此我们使用第二个称为HetProbe-I的调度器扩展libHetMP。对HetProbe-I的评估表明,通过触发周期性分配决策,它可以进一步提高不规则计算的加速速度,在某些情况下高达24%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous Nodes

In this work, we present libHetMP, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. libHetMP targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads. libHetMP drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial and either distributes work according to the relative performance of CPUs when it is or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate libHetMP using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs. Because some workloads may showcase irregular behavior among iterations, we extend libHetMP with a second scheduler called HetProbe-I. The evaluation of HetProbe-I shows it can further improve speedup for irregular computation, in some cases up to a 24%, by triggering periodic distribution decisions.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Computer Systems
ACM Transactions on Computer Systems 工程技术-计算机:理论方法
CiteScore
4.00
自引率
0.00%
发文量
7
审稿时长
1 months
期刊介绍: ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized. TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.
期刊最新文献
PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation Trinity: High-Performance and Reliable Mobile Emulation through Graphics Projection Hardware-software Collaborative Tiered-memory Management Framework for Virtualization Diciclo: Flexible User-level Services for Efficient Multitenant Isolation Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1