Improving communication performance in dense linear algebra via topology aware collectives

Edgar Solomonik, A. Bhatele, J. Demmel
{"title":"Improving communication performance in dense linear algebra via topology aware collectives","authors":"Edgar Solomonik, A. Bhatele, J. Demmel","doi":"10.1145/2063384.2063487","DOIUrl":null,"url":null,"abstract":"Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2063384.2063487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 57

Abstract

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用拓扑感知集体改进密集线性代数中的通信性能
最近的研究结果表明,拓扑感知映射减少了大规模并行机器上通信密集型内核中的网络争用。我们证明了在网状互连上,拓扑感知映射也允许利用高效的拓扑感知集合。我们映射了新的2.5D密集线性代数算法来利用由Blue Gene/P超级计算机分配的长方体分区上的矩形集体。我们的映射允许算法利用优化的线路多播和减少。常用的2D算法不能以这种方式进行映射。在Blue Gene/P的16,384个节点(65,536个内核)上,利用矩形集合的2.5D算法明显快于2D矩阵乘法(MM)和LU分解,分别高达8.7倍和2.1倍。这些加速是由于通信减少(相对于2D MM, 2.5D MM高达95.6%)。我们还推导了基于LogP的矩形广播和约简的新性能模型。利用这些,我们在假设的百亿亿级架构上对矩阵乘法和LU分解的性能进行了建模。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Challenges of HPC monitoring Scalable fast multipole methods on distributed heterogeneous architectures Hadoop acceleration through network levitated merge Scalable stochastic optimization of complex energy systems How to measure useful, sustained performance
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1