Performance Characterization of a Hierarchical MPI Implementation on Large-scale Distributed-memory Platforms

S. Alam, R. Barrett, J. Kuehn, Steve Poole
{"title":"Performance Characterization of a Hierarchical MPI Implementation on Large-scale Distributed-memory Platforms","authors":"S. Alam, R. Barrett, J. Kuehn, Steve Poole","doi":"10.1109/ICPP.2009.51","DOIUrl":null,"url":null,"abstract":"The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2009.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大规模分布式存储平台上分层MPI实现的性能表征
新兴的Petascale大规模并行处理(MPP)系统的构建模块是多核处理器,其中四个或更多核作为单个处理元素和定制的网络接口。现在,通过创建分层或多核感知消息传递(MPI)编程接口,并通过提供一些运行时可调参数(允许映射和控制MPI任务和消息处理),这些平台的最终内存和通信层次结构向应用程序开发人员和最终用户公开。我们描述了MPI通信模式的性能,并提出了在由现代AMD处理器和专有网络基础设施组成的Cray XT系列系统上优化应用程序性能的策略。我们强调了其内存和网络子系统中的依赖关系,这可能会影响生产级应用程序的性能。我们证明MPI微基准测试可能会误导应用程序开发人员或最终用户,因为这些基准测试通常不会暴露用户空间中的内存分配和使用之间的相互作用,这取决于任务或核心的数量和工作负载特征。我们的研究表明,与我们的目标科学基准和生产级应用程序的默认选项相比,性能有所提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Bank-aware Dynamic Cache Partitioning for Multicore Architectures SEIF: Search Enhanced by Intelligent Feedback in Unstructured P2P Networks Scalable Parallel Execution of an Event-Based Radio Signal Propagation Model for Cluttered 3D Terrains Code Semantic-Aware Runahead Threads On the Scalability of Parallel Verilog Simulation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1