DISP: Optimizations towards Scalable MPI Startup

Huansong Fu, S. Pophale, Manjunath Gorentla Venkata, Weikuan Yu
{"title":"DISP: Optimizations towards Scalable MPI Startup","authors":"Huansong Fu, S. Pophale, Manjunath Gorentla Venkata, Weikuan Yu","doi":"10.1109/COM-HPC.2016.11","DOIUrl":null,"url":null,"abstract":"Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COM-HPC.2016.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向可扩展MPI启动的优化
尽管MPI在高性能计算方面很受欢迎,但MPI程序的启动面临着可伸缩性的挑战,因为执行时间和内存消耗都在大规模地急剧增加。我们使用Cheetah和Tuned in Open MPI的集合模块作为代表性实现来研究这个问题。之前对集体的改进主要集中在算法进步和硬件卸载上。在本文中,我们研究了通信器中集合模块的启动成本,并探索了各种技术来提高其效率和可扩展性。因此,我们开发了一种新的可扩展启动方案,其中包含三种内部技术,即延迟初始化,模块共享和基于预测的拓扑设置(DISP)。我们的DISP方案极大地有利于Cheetah模块的集体初始化。同时,它有助于提高Tuned模块中非集体初始化的性能。我们在ORNL的Titan超级计算机上使用多达4096个进程来评估我们的实现性能。结果表明,延迟初始化可以使tuning和Cheetah的启动速度平均分别提高32.0%和29.2%,模块共享可以使tuning和Cheetah的内存消耗分别降低24.1%和83.5%,基于预测的拓扑设置可以使Cheetah的启动速度提高80%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DISP: Optimizations towards Scalable MPI Startup Topology and Affinity Aware Hierarchical and Distributed Load-Balancing in Charm++ Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1