Performance Characterization of a Hierarchical MPI Implementation on Large-scale Distributed-memory Platforms

2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI:10.1109/ICPP.2009.51

S. Alam, R. Barrett, J. Kuehn, Steve Poole

{"title":"Performance Characterization of a Hierarchical MPI Implementation on Large-scale Distributed-memory Platforms","authors":"S. Alam, R. Barrett, J. Kuehn, Steve Poole","doi":"10.1109/ICPP.2009.51","DOIUrl":null,"url":null,"abstract":"The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2009.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大规模分布式存储平台上分层MPI实现的性能表征

新兴的Petascale大规模并行处理(MPP)系统的构建模块是多核处理器，其中四个或更多核作为单个处理元素和定制的网络接口。现在，通过创建分层或多核感知消息传递(MPI)编程接口，并通过提供一些运行时可调参数(允许映射和控制MPI任务和消息处理)，这些平台的最终内存和通信层次结构向应用程序开发人员和最终用户公开。我们描述了MPI通信模式的性能，并提出了在由现代AMD处理器和专有网络基础设施组成的Cray XT系列系统上优化应用程序性能的策略。我们强调了其内存和网络子系统中的依赖关系，这可能会影响生产级应用程序的性能。我们证明MPI微基准测试可能会误导应用程序开发人员或最终用户，因为这些基准测试通常不会暴露用户空间中的内存分配和使用之间的相互作用，这取决于任务或核心的数量和工作负载特征。我们的研究表明，与我们的目标科学基准和生产级应用程序的默认选项相比，性能有所提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2009 International Conference on Parallel Processing

自引率

0.00%

发文量

期刊最新文献

Bank-aware Dynamic Cache Partitioning for Multicore Architectures SEIF: Search Enhanced by Intelligent Feedback in Unstructured P2P Networks Scalable Parallel Execution of an Event-Based Radio Signal Propagation Model for Cluttered 3D Terrains Code Semantic-Aware Runahead Threads On the Scalability of Parallel Verilog Simulation