Optimizing Non-commutative Allreduce Over Virtualized, Migratable MPI Ranks

Sam White, L. Kalé
{"title":"Optimizing Non-commutative Allreduce Over Virtualized, Migratable MPI Ranks","authors":"Sam White, L. Kalé","doi":"10.1109/IPDPSW55747.2022.00085","DOIUrl":null,"url":null,"abstract":"Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads-instead of operating system processes- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and explore algorithmic alternatives such as recursive doubling and halving in combination with a novel adaptive message combining technique. We explore tradeoffs in the different algorithms for various message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"283 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads-instead of operating system processes- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and explore algorithmic alternatives such as recursive doubling and halving in combination with a novel adaptive message combining technique. We explore tradeoffs in the different algorithms for various message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
优化非交换的Allreduce虚拟化,可迁移的MPI排名
对于基于mpi的应用程序来说,动态负载平衡可能很困难。应用程序逻辑和算法经常被重写,以支持对域的动态重新划分。另一种方法是将MPI级别虚拟化为线程(而不是操作系统进程),并在系统周围迁移线程以平衡计算负载。自适应MPI就是这样一种实现。它支持将MPI列为可迁移的用户级线程的虚拟化。但是,这种可移植性本身会给应用程序带来新的性能开销。在本文中,我们确定了非交换约简操作对于任何支持用户定义的秩初始映射或在机器的核心或节点之间动态迁移秩的运行时都是有问题的。我们研究了与支持高效非交换约简操作相关的挑战,并探索了算法替代方案,如递归加倍和减半,并结合了一种新的自适应消息组合技术。我们探讨了针对不同消息大小和等级到核心映射的不同算法的权衡,并使用微基准测试演示了我们的性能改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
(CGRA4HPC) 2022 Invited Speaker: Pushing the Boundaries of HPC with the Integration of AI Moving from Composable to Programmable Energy-aware neural architecture selection and hyperparameter optimization Smoothing on Dynamic Concurrency Throttling An Analysis of Mapping Polybench Kernels to HPC CGRAs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1