MPI虚拟化的通信和定时问题

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI:10.1145/3416315.3416317

Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler

{"title":"MPI虚拟化的通信和定时问题","authors":"Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler","doi":"10.1145/3416315.3416317","DOIUrl":null,"url":null,"abstract":"Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Communication and Timing Issues with MPI Virtualization\",\"authors\":\"Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler\",\"doi\":\"10.1145/3416315.3416317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.\",\"PeriodicalId\":176723,\"journal\":{\"name\":\"Proceedings of the 27th European MPI Users' Group Meeting\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3416315.3416317\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3416315.3416317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

计算通信重叠和良好的负载平衡是并行程序高性能的核心特征。不幸的是，使用MPI实现它们需要大大增加用户代码的复杂性。我们的工作促成了这个问题的替代解决方案:使用虚拟化的MPI实现。虚拟化MPI实现与传统MPI实现的不同之处在于，它们将MPI进程映射到用户级线程，而不是操作系统进程，并且启动的MPI进程比系统中的CPU内核还要多。它们能够提供自动的计算通信重叠和负载平衡，几乎不需要更改预先存在的MPI用户代码。我们的工作揭示了对MPI虚拟化的新见解:需要两种新的计时器:MPI进程计时器和cpu核心计时器，同样的讨论也适用于性能计数器和MPI分析接口。我们还观察到CPU超额订阅程度和会合通信协议之间的相互作用:我们发现，对于会合协议来说，每个CPU核心只有两个MPI进程足以实现完全的计算通信重叠的直观期望是错误的——相反，在这种情况下，每个CPU核心需要三个MPI进程。我们的发现预计将适用于所有虚拟化的MPI实现以及一般的任务运行时系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Communication and Timing Issues with MPI Virtualization

Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 27th European MPI Users' Group Meeting

自引率

0.00%

发文量