MPI虚拟化的通信和定时问题

Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler
{"title":"MPI虚拟化的通信和定时问题","authors":"Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler","doi":"10.1145/3416315.3416317","DOIUrl":null,"url":null,"abstract":"Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Communication and Timing Issues with MPI Virtualization\",\"authors\":\"Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler\",\"doi\":\"10.1145/3416315.3416317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.\",\"PeriodicalId\":176723,\"journal\":{\"name\":\"Proceedings of the 27th European MPI Users' Group Meeting\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3416315.3416317\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3416315.3416317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

计算通信重叠和良好的负载平衡是并行程序高性能的核心特征。不幸的是,使用MPI实现它们需要大大增加用户代码的复杂性。我们的工作促成了这个问题的替代解决方案:使用虚拟化的MPI实现。虚拟化MPI实现与传统MPI实现的不同之处在于,它们将MPI进程映射到用户级线程,而不是操作系统进程,并且启动的MPI进程比系统中的CPU内核还要多。它们能够提供自动的计算通信重叠和负载平衡,几乎不需要更改预先存在的MPI用户代码。我们的工作揭示了对MPI虚拟化的新见解:需要两种新的计时器:MPI进程计时器和cpu核心计时器,同样的讨论也适用于性能计数器和MPI分析接口。我们还观察到CPU超额订阅程度和会合通信协议之间的相互作用:我们发现,对于会合协议来说,每个CPU核心只有两个MPI进程足以实现完全的计算通信重叠的直观期望是错误的——相反,在这种情况下,每个CPU核心需要三个MPI进程。我们的发现预计将适用于所有虚拟化的MPI实现以及一般的任务运行时系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Communication and Timing Issues with MPI Virtualization
Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations Using Advanced Vector Extensions AVX-512 for MPI Reductions Signature Datatypes for Type Correct Collective Operations, Revisited Communication and Timing Issues with MPI Virtualization Collectives and Communicators: A Case for Orthogonality: (Or: How to get rid of MPI neighbor and enhance Cartesian collectives)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1