面向深度学习的高效过程到达模式感知集体通信

Proceedings of the 29th European MPI Users' Group Meeting Pub Date : 2022-09-14 DOI:10.1145/3555819.3555857

Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi

{"title":"面向深度学习的高效过程到达模式感知集体通信","authors":"Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi","doi":"10.1145/3555819.3555857","DOIUrl":null,"url":null,"abstract":"MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning\",\"authors\":\"Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi\",\"doi\":\"10.1145/3555819.3555857\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.\",\"PeriodicalId\":423846,\"journal\":{\"name\":\"Proceedings of the 29th European MPI Users' Group Meeting\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3555819.3555857\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555819.3555857","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

MPI集体通信操作在并行应用中得到广泛应用。因此，研究人员一直在研究如何提高它们的性能和可伸缩性，从而直接影响应用程序的性能。不幸的是，这些研究大多是基于所有过程同时到达集体呼叫的前提。然而，一些研究表明，不平衡过程到达模式(PAP)在现实环境中普遍存在，严重影响了集体绩效。因此，虽然具有挑战性，但设计能够提高性能的pap感知集体算法是非常可取的。本文是沿着这些思路，但在深度学习(DL)工作负载已经成为主流的背景下。本文简要介绍了Horovod分布式深度学习框架中的集体通信，特别是MPI_Allreduce，并表明MPI进程的到达模式确实是不平衡的。然后，针对中小型消息，提出了一个节点内共享内存感知pap的MPI_Allreduce算法，其中根据每次调用集体调用时进程的到达时间动态选择领导进程。然后，我们提出了一种针对大型消息的节点内pap感知算法，该算法在每次MPI_Allreduce调用时动态构建缩减计划。最后，我们提出了一种感知pap的集群级分层算法，该算法通过利用我们的节点内pap感知设计进行扩展，与平面算法相比，由于其层次性，该算法在进程之间施加了更少的数据依赖性。与原生算法相比，本文提出的算法在微基准测试和使用TensorFlow应用的Horovod上分别提高了58%和17%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning

MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 29th European MPI Users' Group Meeting

自引率

0.00%

发文量

期刊最新文献

Distributed Acceleration of Adhesive Dynamics Simulations Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation Algorithm Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning Enabling Global MPI Process Addressing in MPI Applications Towards a Hybrid MPI Correctness Benchmark Suite