MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters

S. Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, K. Kandalla, H. Subramoni, D. Panda
{"title":"MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters","authors":"S. Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, K. Kandalla, H. Subramoni, D. Panda","doi":"10.1145/2503210.2503288","DOIUrl":null,"url":null,"abstract":"Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2503210.2503288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38

Abstract

Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MVAPICH-PRISM:一个基于代理的通信框架,使用InfiniBand和SCIF用于Intel MIC集群
Xeon Phi基于Intel多集成核心(MIC)架构,在单个芯片上封装高达1TFLOPs的性能,同时提供x86_64兼容性。另一方面,InfiniBand是超级计算系统中最流行的互连选择之一。Xeon Phi处理器上的软件栈允许进程直接访问节点上的InfiniBand HCA,从而为节点间通信提供低延迟路径。然而,像Sandy Bridge这样最先进的芯片组的缺点限制了这些传输的可用带宽。在本文中,我们提出了一种新的基于代理的MVAPICH-PRISM框架来优化此类系统的通信性能。我们提出了几种设计,并使用微基准测试和应用程序内核对它们进行了评估。我们的设计将Xeon Phi处理器之间的节点间延迟提高了65%,节点间带宽提高了5倍。我们的设计将256个进程的MPI_Alltoall操作的性能提高了65%。它们将3D Stencil通信内核和P3DFFT库的性能分别提高了56%和22%,分别具有1,024和512个进程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Distributed-memory parallel algorithms for generating massive scale-free networks using preferential attachment model Enabling comprehensive data-driven system management for large computational facilities There goes the neighborhood: Performance degradation due to nearby jobs A distributed dynamic load balancer for iterative applications Predicting application performance using supervised learning on communication features
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1