Huansong Fu, S. Pophale, Manjunath Gorentla Venkata, Weikuan Yu
{"title":"DISP: Optimizations towards Scalable MPI Startup","authors":"Huansong Fu, S. Pophale, Manjunath Gorentla Venkata, Weikuan Yu","doi":"10.1109/COM-HPC.2016.11","DOIUrl":null,"url":null,"abstract":"Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COM-HPC.2016.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.
尽管MPI在高性能计算方面很受欢迎,但MPI程序的启动面临着可伸缩性的挑战,因为执行时间和内存消耗都在大规模地急剧增加。我们使用Cheetah和Tuned in Open MPI的集合模块作为代表性实现来研究这个问题。之前对集体的改进主要集中在算法进步和硬件卸载上。在本文中,我们研究了通信器中集合模块的启动成本,并探索了各种技术来提高其效率和可扩展性。因此,我们开发了一种新的可扩展启动方案,其中包含三种内部技术,即延迟初始化,模块共享和基于预测的拓扑设置(DISP)。我们的DISP方案极大地有利于Cheetah模块的集体初始化。同时,它有助于提高Tuned模块中非集体初始化的性能。我们在ORNL的Titan超级计算机上使用多达4096个进程来评估我们的实现性能。结果表明,延迟初始化可以使tuning和Cheetah的启动速度平均分别提高32.0%和29.2%,模块共享可以使tuning和Cheetah的内存消耗分别降低24.1%和83.5%,基于预测的拓扑设置可以使Cheetah的启动速度提高80%。