Many of the proposed algorithms for allocating processors to jobs in supercomputers choose arbitrarily among potential allocations that are "equally good" according to the allocation algorithm. In this paper, we add a parametrized tie-breaking strategy to the MC1x1 allocation algorithm for mesh supercomputers. This strategy attempts to favor allocations that preserve large regions of free processors, benefiting future allocations and improving machine performance. Trace-based simulations show the promise of our strategy; with good parameter choices, most jobs benefit and no class of jobs is harmed significantly.
{"title":"A Tie-Breaking Strategy for Processor Allocation in Meshes","authors":"Christopher R. Johnson, David P. Bunde, V. Leung","doi":"10.1109/ICPPW.2010.50","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.50","url":null,"abstract":"Many of the proposed algorithms for allocating processors to jobs in supercomputers choose arbitrarily among potential allocations that are \"equally good\" according to the allocation algorithm. In this paper, we add a parametrized tie-breaking strategy to the MC1x1 allocation algorithm for mesh supercomputers. This strategy attempts to favor allocations that preserve large regions of free processors, benefiting future allocations and improving machine performance. Trace-based simulations show the promise of our strategy; with good parameter choices, most jobs benefit and no class of jobs is harmed significantly.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125301223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
New high performance computing (HPC) applications recently have to face scalability over an increasing number of nodes and the programming of special accelerator hardware. Hybrid composition of large computing systems leads to a new dimension in complexity of software development. This paper presents a novel approach to gain insight into accelerator interaction and utilization without any changes to the application. It leverages well established methods for performance analysis to accelerator hardware, allowing a holistic view on performance bottlenecks of hybrid applications. A general strategy is presented to get dynamic runtime information about hybrid program execution with minimal impact on the program ???ow. The achievable level of detail is exemplarily studied for the CUDA environment and the OpenCL framework. Combined with existing performance analysis techniques this facilitates obtaining the full potential of hybrid computing power.
{"title":"Non-intrusive Performance Analysis of Parallel Hardware Accelerated Applications on Hybrid Architectures","authors":"R. Dietrich, T. Ilsche, G. Juckeland","doi":"10.1109/ICPPW.2010.30","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.30","url":null,"abstract":"New high performance computing (HPC) applications recently have to face scalability over an increasing number of nodes and the programming of special accelerator hardware. Hybrid composition of large computing systems leads to a new dimension in complexity of software development. This paper presents a novel approach to gain insight into accelerator interaction and utilization without any changes to the application. It leverages well established methods for performance analysis to accelerator hardware, allowing a holistic view on performance bottlenecks of hybrid applications. A general strategy is presented to get dynamic runtime information about hybrid program execution with minimal impact on the program ???ow. The achievable level of detail is exemplarily studied for the CUDA environment and the OpenCL framework. Combined with existing performance analysis techniques this facilitates obtaining the full potential of hybrid computing power.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130022025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The current trend towards multi-core/manycore and accelerated architectures presents challenges, both in portability, and also in the choices that developers must make on how to use the resources that these architectures provide. This paper explores some of the possibilities that are enabled by the Open Computing Language (OpenCL), and proposes a programming model that will allow developers and scientists to more fully subscribe hybrid compute nodes, while, at the same time, reducing the impact of system failure.
{"title":"A Hybrid Programming Model for Compressible Gas Dynamics Using OpenCL","authors":"B. Bergen, Marcus G. Daniels, Paul M. Weber","doi":"10.1109/ICPPW.2010.60","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.60","url":null,"abstract":"The current trend towards multi-core/manycore and accelerated architectures presents challenges, both in portability, and also in the choices that developers must make on how to use the resources that these architectures provide. This paper explores some of the possibilities that are enabled by the Open Computing Language (OpenCL), and proposes a programming model that will allow developers and scientists to more fully subscribe hybrid compute nodes, while, at the same time, reducing the impact of system failure.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116302784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Irregular scientific applications are difficult to parallelize in an efficient and scalable fashion due to indirect memory references (i.e. A[B[i]]), irregular communication patterns, and load balancing issues. In this paper, we present our experience parallelizing an irregular scientific application written in Java. The application is an N-Body molecular dynamics simulation that is the main component of a Java application called the Molecular Workbench (MW). We parallelized MW to run on multicore hardware using Java's java.util.concurrent library. Speedup was found to vary greatly depending on what type of force computation dominated the simulation. In order to understand the cause of this appreciable difference in scalability, various performance analysis tools were deployed. These tools include Intel's VTune, Apple's Shark, the Java Application Monitor (JaMON), and Sun's VisualVM. Virtual machine instrumentation as well as hardware performance monitors were used. To our knowledge this is the first such performance analysis of an irregular scientific application parallelized using Java threads. In the course of this investigation, a number of challenges were encountered. These difficulties in general stemmed from a mismatch between the nature of our application and either Java itself or the performance tools we used. This paper aims to share our real world experience with Java threading and today's parallel performance tools in an effort to influence future directions for the Java virtual machine, for the Java concurrency library, and for tools for multicore parallel software development.
{"title":"Performance Evaluation of an Irregular Application Parallelized in Java","authors":"Christopher D. Krieger, M. Strout","doi":"10.1109/ICPPW.2010.40","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.40","url":null,"abstract":"Irregular scientific applications are difficult to parallelize in an efficient and scalable fashion due to indirect memory references (i.e. A[B[i]]), irregular communication patterns, and load balancing issues. In this paper, we present our experience parallelizing an irregular scientific application written in Java. The application is an N-Body molecular dynamics simulation that is the main component of a Java application called the Molecular Workbench (MW). We parallelized MW to run on multicore hardware using Java's java.util.concurrent library. Speedup was found to vary greatly depending on what type of force computation dominated the simulation. In order to understand the cause of this appreciable difference in scalability, various performance analysis tools were deployed. These tools include Intel's VTune, Apple's Shark, the Java Application Monitor (JaMON), and Sun's VisualVM. Virtual machine instrumentation as well as hardware performance monitors were used. To our knowledge this is the first such performance analysis of an irregular scientific application parallelized using Java threads. In the course of this investigation, a number of challenges were encountered. These difficulties in general stemmed from a mismatch between the nature of our application and either Java itself or the performance tools we used. This paper aims to share our real world experience with Java threading and today's parallel performance tools in an effort to influence future directions for the Java virtual machine, for the Java concurrency library, and for tools for multicore parallel software development.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121957327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large scale Service Oriented Architecture (SOA) developments are becoming increasingly reliant on registry services that manage Web Services using taxonomic attributes. At present a registry stores a Web Services interface definition and protocol bindings in WSDL, along with one or more XML schema files that define the structure of a SOAP message exchanged between Web Services operations and client processes and other static metadata. During Web Service discovery an ebXML registry returns the access URI associated with the service binding to allow dynamic discovery and invocation. This usually restricts a calling process to a Web Service invocation on one host. This work explores a mechanism to manage service bindings for a Web Service that has been deployed across multiple hosts, such that, a URI returned by a registry can resolve to a host that satisfies different system constraints like current CPU load, physical memory, swap memory, and time of day. This paper discusses the design and development of new scheme for ebXML registries that facilitates periodic collection and management of dynamic system properties for registry clients and enforces constraints during service discovery and query operation.
{"title":"A Load Balancing Scheme for ebXML Registries","authors":"Sadhana Sahasrabudhe, C. Paolini","doi":"10.1109/ICPPW.2010.12","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.12","url":null,"abstract":"Large scale Service Oriented Architecture (SOA) developments are becoming increasingly reliant on registry services that manage Web Services using taxonomic attributes. At present a registry stores a Web Services interface definition and protocol bindings in WSDL, along with one or more XML schema files that define the structure of a SOAP message exchanged between Web Services operations and client processes and other static metadata. During Web Service discovery an ebXML registry returns the access URI associated with the service binding to allow dynamic discovery and invocation. This usually restricts a calling process to a Web Service invocation on one host. This work explores a mechanism to manage service bindings for a Web Service that has been deployed across multiple hosts, such that, a URI returned by a registry can resolve to a host that satisfies different system constraints like current CPU load, physical memory, swap memory, and time of day. This paper discusses the design and development of new scheme for ebXML registries that facilitates periodic collection and management of dynamic system properties for registry clients and enforces constraints during service discovery and query operation.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116558759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Andrews, P. Kovatch, Victor Hazlewood, Troy Baer
In late 2009, the National Institute for Computational Sciences placed in production the world’s fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for “hero” users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the “clearing out” of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.
2009年底,美国国家计算科学研究所(National Institute for Computational Sciences)投入生产了世界上最快的学术超级计算机(排名第三),名为Kraken的克雷XT5,拥有近10万个计算核心,峰值速度超过1千万亿次。Kraken通过TeraGrid向美国国家科学基金会用户提供超过50%的可用总周期,这两项任务在历史上被证明是难以同时协调的:为社区提供最大数量的总周期,同时为“英雄”用户提供完整的机器运行。从历史上看,通过允许调度器为大型作业的开始选择正确的时间来尝试这一点,同时降低了利用率。在NICS,我们使用先前理论研究的结果来采用一种不同的方法,即每周强制对系统进行“清理”,然后连续运行全机。正如我们之前的模拟结果所表明的那样,这将导致利用率的显著提高,达到90%以上。传统调度策略和采用调度策略之间的利用率差异相当于一台300+ Teraflop的超级计算机,或者每年数百万美元的计算时间。
{"title":"Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability","authors":"P. Andrews, P. Kovatch, Victor Hazlewood, Troy Baer","doi":"10.1109/ICPPW.2010.63","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.63","url":null,"abstract":"In late 2009, the National Institute for Computational Sciences placed in production the world’s fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for “hero” users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the “clearing out” of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126471361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Castells-Rufas, Jaume Joven, Sergi Risueño, Eduard Fernandez-Alonso, J. Carrabina, T. William, H. Mix
There is some consensus that Embedded and HPC domains have to create synergies to face the challenges to create, maintain and optimize software for the future many-core platforms. In this work we show how some HPC performance analysis methods can be successfully adapted to the embedded domain. We propose to use Virtual Prototypes based on Instruction Set Simulators to produce trace files by transparent instrumentation that can be used for post-mortem performance analysis. Transparent instrumentation on ISS kills two birds in one shot: it adds no overhead for trace generation and it solves the problem of trace storage. A virtual prototype is build to generate OTF traces that are later analyzed with Vampir. We show how the performance analysis of the virtual prototype is valuable to optimize a parallel embedded test application, allowing an acceptable speedup factor on 4 processors to be obtained.
{"title":"MPSoC Performance Analysis with Virtual Prototyping Platforms","authors":"David Castells-Rufas, Jaume Joven, Sergi Risueño, Eduard Fernandez-Alonso, J. Carrabina, T. William, H. Mix","doi":"10.1109/ICPPW.2010.32","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.32","url":null,"abstract":"There is some consensus that Embedded and HPC domains have to create synergies to face the challenges to create, maintain and optimize software for the future many-core platforms. In this work we show how some HPC performance analysis methods can be successfully adapted to the embedded domain. We propose to use Virtual Prototypes based on Instruction Set Simulators to produce trace files by transparent instrumentation that can be used for post-mortem performance analysis. Transparent instrumentation on ISS kills two birds in one shot: it adds no overhead for trace generation and it solves the problem of trace storage. A virtual prototype is build to generate OTF traces that are later analyzed with Vampir. We show how the performance analysis of the virtual prototype is valuable to optimize a parallel embedded test application, allowing an acceptable speedup factor on 4 processors to be obtained.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126567011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In wireless sensor networks, collection of raw sensor data at a base station provides the flexibility to perform offline detailed analysis on the data which may not be possible with innetwork data aggregation. However, lossless data collection consumes considerable amount of energy for communication while sensors usually have limited energy. In this paper, we propose a Distributed and Energy efficient algorithm for Collection of Raw data in sensor networks called DECOR. DECOR exploits spatial correlation to reduce the communication energy in sensor networks with highly correlated data. In our approach, at each neighborhood, one sensor shares its raw data as a reference with the rest of sensors without any suppression or compression. Other sensors use this reference data to compress their observations by representing them in the forms of mutual differences. In a highly correlated network, transmission of reference data consumes significantly more energy than transmission of compressed data. Thus, we first attempt to minimize the number of reference transmissions. Then, we try to minimize the size of mutual differences. We derive analytical lower bounds for both these phases and based on our theoretical results, we propose a twostep distributed data collection algorithm which reduces the communication energy significantly compared to existing methods. In addition, we modify our algorithm for lossy communication channels and we evaluate its performance through simulation.
{"title":"A Distributed and Energy Efficient Algorithm for Data Collection in Sensor Networks","authors":"Sarah Sharafkandi, D. Du, Alireza Razavi","doi":"10.1109/ICPPW.2010.84","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.84","url":null,"abstract":"In wireless sensor networks, collection of raw sensor data at a base station provides the flexibility to perform offline detailed analysis on the data which may not be possible with innetwork data aggregation. However, lossless data collection consumes considerable amount of energy for communication while sensors usually have limited energy. In this paper, we propose a Distributed and Energy efficient algorithm for Collection of Raw data in sensor networks called DECOR. DECOR exploits spatial correlation to reduce the communication energy in sensor networks with highly correlated data. In our approach, at each neighborhood, one sensor shares its raw data as a reference with the rest of sensors without any suppression or compression. Other sensors use this reference data to compress their observations by representing them in the forms of mutual differences. In a highly correlated network, transmission of reference data consumes significantly more energy than transmission of compressed data. Thus, we first attempt to minimize the number of reference transmissions. Then, we try to minimize the size of mutual differences. We derive analytical lower bounds for both these phases and based on our theoretical results, we propose a twostep distributed data collection algorithm which reduces the communication energy significantly compared to existing methods. In addition, we modify our algorithm for lossy communication channels and we evaluate its performance through simulation.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124634158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data prefetching is an effective way to accelerate data access in high-end computing systems and to bridge the increasing performance gap between processor and memory. In recent years, the contextbased data prefetching has received intensive attention because of its general applicability. In this study, we provide a preliminary analysis of the impact of orders on the effectiveness of the context-based prefetching. Motivated by the observations from the analytical results, we propose a new context-based prefetching method named Multi-Order Context-based (MOC) prefetching to adopt multi-order context analysis to increase the context-based prefetching effectiveness. We have carried out simulation testing with the SPECCPU2006 benchmarks via an enhanced CMP$im simulator. The simulation results show that the proposed MOC prefetching method outperforms the existing single-order prefetching and reduces the data-access latency effectively.
{"title":"Improving the Effectiveness of Context-Based Prefetching with Multi-order Analysis","authors":"Yong Chen, Huaiyu Zhu, Hui Jin, Xian-He Sun","doi":"10.1109/ICPPW.2010.64","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.64","url":null,"abstract":"Data prefetching is an effective way to accelerate data access in high-end computing systems and to bridge the increasing performance gap between processor and memory. In recent years, the contextbased data prefetching has received intensive attention because of its general applicability. In this study, we provide a preliminary analysis of the impact of orders on the effectiveness of the context-based prefetching. Motivated by the observations from the analytical results, we propose a new context-based prefetching method named Multi-Order Context-based (MOC) prefetching to adopt multi-order context analysis to increase the context-based prefetching effectiveness. We have carried out simulation testing with the SPECCPU2006 benchmarks via an enhanced CMP$im simulator. The simulation results show that the proposed MOC prefetching method outperforms the existing single-order prefetching and reduces the data-access latency effectively.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126866708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, D. Panda
High End Computing (HEC) systems are being deployed with eight to sixteen compute cores, with 64 to 128 cores/node being envisioned for exascale systems. mbox{MVAPICH2} is a popular implementation of MPI-2 specifically designed and optimized for InfiniBand, iWARP and RDMA over Converged Ethernet (RoCE). MVAPICH2 is based on MPICH2 from ANL. Recently MPICH2 has been redesigned with an effort to optimize intra-node communication for future many-core systems. The new communication layer in MPICH2 is called Nemesis, which is very well optimized for shared memory message passing, with a modular design for various high-performance interconnects. In this paper we explore the challenges involved in designing the next-generation MVAPICH2 stack, leveraging the Nemesis communication layer. We observe that Nemesis does not provide abstractions for one-sided communication. We propose an extended Nemesis interface for optimized one-sided communication and provide design details. Our experimental evaluation shows that our proposed one-sided interface extensions are able to provide significantly better performance than the basic Nemesis interface. For example, inter-node MPI_Put bandwidth increased from 1,800 MB/s to 3,000 MB/s and latency for small messages went down by 13%. Additionally, with our proposed designs, we are able to demonstrate performance gains with small messages, when compared to the existing MVAPICH2 CH3 implementation. The designs proposed in this paper is a superset of currently available options to MVAPICH2 users and provides the best combination of performance and modularity.
{"title":"High Performance Design and Implementation of Nemesis Communication Layer for Two-Sided and One-Sided MPI Semantics in MVAPICH2","authors":"Miao Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, D. Panda","doi":"10.1109/ICPPW.2010.58","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.58","url":null,"abstract":"High End Computing (HEC) systems are being deployed with eight to sixteen compute cores, with 64 to 128 cores/node being envisioned for exascale systems. mbox{MVAPICH2} is a popular implementation of MPI-2 specifically designed and optimized for InfiniBand, iWARP and RDMA over Converged Ethernet (RoCE). MVAPICH2 is based on MPICH2 from ANL. Recently MPICH2 has been redesigned with an effort to optimize intra-node communication for future many-core systems. The new communication layer in MPICH2 is called Nemesis, which is very well optimized for shared memory message passing, with a modular design for various high-performance interconnects. In this paper we explore the challenges involved in designing the next-generation MVAPICH2 stack, leveraging the Nemesis communication layer. We observe that Nemesis does not provide abstractions for one-sided communication. We propose an extended Nemesis interface for optimized one-sided communication and provide design details. Our experimental evaluation shows that our proposed one-sided interface extensions are able to provide significantly better performance than the basic Nemesis interface. For example, inter-node MPI_Put bandwidth increased from 1,800 MB/s to 3,000 MB/s and latency for small messages went down by 13%. Additionally, with our proposed designs, we are able to demonstrate performance gains with small messages, when compared to the existing MVAPICH2 CH3 implementation. The designs proposed in this paper is a superset of currently available options to MVAPICH2 users and provides the best combination of performance and modularity.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128065607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}