Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi
{"title":"面向深度学习的高效过程到达模式感知集体通信","authors":"Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi","doi":"10.1145/3555819.3555857","DOIUrl":null,"url":null,"abstract":"MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.","PeriodicalId":423846,"journal":{"name":"Proceedings of the 29th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning\",\"authors\":\"Pedram Alizadeh, A. Sojoodi, Yiltan Hassan Temuçin, A. Afsahi\",\"doi\":\"10.1145/3555819.3555857\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.\",\"PeriodicalId\":423846,\"journal\":{\"name\":\"Proceedings of the 29th European MPI Users' Group Meeting\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3555819.3555857\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555819.3555857","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Efficient Process Arrival Pattern Aware Collective Communication for Deep Learning
MPI collective communication operations are used extensively in parallel applications. As such, researchers have been investigating how to improve their performance and scalability to directly impact application performance. Unfortunately, most of these studies are based on the premise that all processes arrive at the collective call simultaneously. A few studies though have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments, significantly affecting the collective performance. Therefore, devising PAP-aware collective algorithms that could improve performance, while challenging, is highly desirable. This paper is along those lines but in the context of Deep Learning (DL) workloads that have become maintstream. This paper presents a brief characterization of collective communications, in particular MPI_Allreduce, in the Horovod distributed Deep Learning framework and shows that the arrival pattern of MPI processes is indeed imbalanced. It then proposes an intra-node shared-memory PAP-aware MPI_Allreduce algorithm for small to medium messages, where the leader process is dynamically chosen based on the arrival time of the processes at each invocation of the collective call. We then propose an intra-node PAP-aware algorithm for large messages that dynamically constructs the reduction schedule at each MPI_Allreduce invocation. Finally, we propose a PAP-aware cluster-wide hierarchical algorithm, which is extended by utilizing our intra-node PAP-aware designs, that imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. The proposed algorithms deliver up to 58% and 17% improvement at the micro-benchmark and Horovod with TensorFlow application over the native algorithms, respectively.