Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137776
E. Lusk
Summary form only given. In April of 1992, a group of parallel computing vendors, computer science researchers, and application scientists met at a one-day workshop and agreed to cooperate on the development of a community standard for the message-passing model of parallel computing. The MPI Forum that eventually emerged from that workshop became a model of how a broad community could work together to improve an important component of the high performance computing environment. The Message Passing Interface (MPI) definition that resulted from this effort has been widely adopted and implemented, and is now virtually synonymous with the message-passing model itself MPI not only standardized existing practice in the service of making applications portable in the rapidly changing world of parallel computing, but also consolidated research advances into novel features that extended existing practice and have proven useful in developing a new generation of applications. This talk will discuss some of the procedures and approaches of the MPI Forum that led to MPI's early adoption, and then describe some of the features that have led to its persistence as a reference model for parallel computing. Although clusters were only just emerging as a significant parallel computing production platform as MPI was being defined, MPI has proven to be a useful way of programming them for high performance, and we will discuss the current situation in MPI implementations for clusters. MPI was deliberately designed to grant considerable flexibility to implementors, and thus provides a useful framework for implementation research. Successful implementation techniques within the MPI standard can be utilized immediately by applications already using MPI, thus providing an unusually fast path front research results to their application. At Argonne National Laboratory we have been developing and distributing MPICH, a portable, high performance implementation of MPI, from the very beginning of the MPI effort. We will describe MPICH-2, a completely new version of MPICH just being released. We will present some of its novel design features that we hope will stimulate both further research and a new generation of complete MPI-2 implementations, along with some early performance results. We will conclude with a speculative look at the future of MPI, including its role in other programming approaches, fault tolerance, and its applicability to advanced architectures.
{"title":"MPI in 2002: has it been ten years already?","authors":"E. Lusk","doi":"10.1109/CLUSTR.2002.1137776","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137776","url":null,"abstract":"Summary form only given. In April of 1992, a group of parallel computing vendors, computer science researchers, and application scientists met at a one-day workshop and agreed to cooperate on the development of a community standard for the message-passing model of parallel computing. The MPI Forum that eventually emerged from that workshop became a model of how a broad community could work together to improve an important component of the high performance computing environment. The Message Passing Interface (MPI) definition that resulted from this effort has been widely adopted and implemented, and is now virtually synonymous with the message-passing model itself MPI not only standardized existing practice in the service of making applications portable in the rapidly changing world of parallel computing, but also consolidated research advances into novel features that extended existing practice and have proven useful in developing a new generation of applications. This talk will discuss some of the procedures and approaches of the MPI Forum that led to MPI's early adoption, and then describe some of the features that have led to its persistence as a reference model for parallel computing. Although clusters were only just emerging as a significant parallel computing production platform as MPI was being defined, MPI has proven to be a useful way of programming them for high performance, and we will discuss the current situation in MPI implementations for clusters. MPI was deliberately designed to grant considerable flexibility to implementors, and thus provides a useful framework for implementation research. Successful implementation techniques within the MPI standard can be utilized immediately by applications already using MPI, thus providing an unusually fast path front research results to their application. At Argonne National Laboratory we have been developing and distributing MPICH, a portable, high performance implementation of MPI, from the very beginning of the MPI effort. We will describe MPICH-2, a completely new version of MPICH just being released. We will present some of its novel design features that we hope will stimulate both further research and a new generation of complete MPI-2 implementations, along with some early performance results. We will conclude with a speculative look at the future of MPI, including its role in other programming approaches, fault tolerance, and its applicability to advanced architectures.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89640924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137751
E. Cecchet
Distributed Shared Memories (DSM) performance has always suffered from high network latencies and software communication layers with a large overhead. Memory mapped networks such as Scalable Coherent Interface (SCI) allow to reliably access remote memory without involving the operating system. To show how DSM systems can benefit from this technology, we have developed SciFS, a DSM tightly integrated with the operating system, that exploits the high performance and the remote memory access capabilities of SCI. We first show the respective advantages of two communications techniques with SCI: programmed IO (PIO) and remote DMA (RDMA). Then, we describe how to build a scalable page transfer mechanism by mixing PIO and RDMA. Despite the lack of a broadcast mechanism with SCI, we demonstrate that it is possible to build scalable synchronization primitives using PIO. Finally, we evaluate various consistency models with scientific computing applications from the Splash benchmark. We observe that, even if the rough network performance is good, it is not sufficient to obtain acceptable results with applications that require fine grain parallelism. However, we show that memory mapped networks provide an efficient hardware support to implement software DSM systems without requiring complex relaxed consistency models. This way, DSM design can be greatly simplified using this technology.
{"title":"Memory mapped networks: a new deal for distributed shared memories ? the SciFS experience","authors":"E. Cecchet","doi":"10.1109/CLUSTR.2002.1137751","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137751","url":null,"abstract":"Distributed Shared Memories (DSM) performance has always suffered from high network latencies and software communication layers with a large overhead. Memory mapped networks such as Scalable Coherent Interface (SCI) allow to reliably access remote memory without involving the operating system. To show how DSM systems can benefit from this technology, we have developed SciFS, a DSM tightly integrated with the operating system, that exploits the high performance and the remote memory access capabilities of SCI. We first show the respective advantages of two communications techniques with SCI: programmed IO (PIO) and remote DMA (RDMA). Then, we describe how to build a scalable page transfer mechanism by mixing PIO and RDMA. Despite the lack of a broadcast mechanism with SCI, we demonstrate that it is possible to build scalable synchronization primitives using PIO. Finally, we evaluate various consistency models with scientific computing applications from the Splash benchmark. We observe that, even if the rough network performance is good, it is not sufficient to obtain acceptable results with applications that require fine grain parallelism. However, we show that memory mapped networks provide an efficient hardware support to implement software DSM systems without requiring complex relaxed consistency models. This way, DSM design can be greatly simplified using this technology.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85266469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137732
Rinku Gupta, V. Tipparaju, J. Nieplocha, D. Panda
Most high performance scientific applications require efficient support for collective communication. Point-to-point message-passing communication in current generation clusters are based on the Send/Recv communication model. Collective communication operations built on top of such point-to-point message-passing operations might achieve suboptimal performance. VIA and the emerging InfiniBand architecture support remote DMA operations, which allow data to be moved between the nodes with low overhead; they also allow to create and provide a logical shared memory address space across the nodes. In this paper we focus on barrier, a frequently-used collective operations. We demonstrate how RDMA write operations can be used to support an inter-node barrier in a cluster with SMP nodes. Combining this with a scheme to exploit shared memory within a SMP node, we develop a fast barrier algorithm for a cluster of SMP nodes with a cLAN VIA interconnect. Compared to current barrier algorithms using the Send/Recv communication model, the new approach is shown to reduce barrier latency on a 64 processor (32 dual nodes) system by up to 66%. These results demonstrate that high performance and scalable barrier implementations can be delivered on current and next generation VIA/Infiniband-based clusters with RDMA support.
{"title":"Efficient barrier using remote memory operations on VIA-based clusters","authors":"Rinku Gupta, V. Tipparaju, J. Nieplocha, D. Panda","doi":"10.1109/CLUSTR.2002.1137732","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137732","url":null,"abstract":"Most high performance scientific applications require efficient support for collective communication. Point-to-point message-passing communication in current generation clusters are based on the Send/Recv communication model. Collective communication operations built on top of such point-to-point message-passing operations might achieve suboptimal performance. VIA and the emerging InfiniBand architecture support remote DMA operations, which allow data to be moved between the nodes with low overhead; they also allow to create and provide a logical shared memory address space across the nodes. In this paper we focus on barrier, a frequently-used collective operations. We demonstrate how RDMA write operations can be used to support an inter-node barrier in a cluster with SMP nodes. Combining this with a scheme to exploit shared memory within a SMP node, we develop a fast barrier algorithm for a cluster of SMP nodes with a cLAN VIA interconnect. Compared to current barrier algorithms using the Send/Recv communication model, the new approach is shown to reduce barrier latency on a 64 processor (32 dual nodes) system by up to 66%. These results demonstrate that high performance and scalable barrier implementations can be delivered on current and next generation VIA/Infiniband-based clusters with RDMA support.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74901901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137741
Yu Chen, Xiaoge Wang, Z. Jiao, Jun Xie, Zhihui Du, Sanli Li
Virtual Interface Architecture (VIA) established a communication model with low latency and high bandwidth, and defined the standard of user-level high-performance communication specification in cluster systems. This paper analyzes the current development, principle and implementations of VIA, and presents user-level high-performance communication software, MyVIA, based on Myrinet, which is comfortable with VIA specification. The paper first describes the design principle and framework of MyVIA, then proposes new technologies of MyVIA including User TLB, continued host physical memory and varied NIC buffer, the pipelining communication based on resource and DMA chain, and physical descriptor ring. Experimental results of performance comparisons and analysis are presented; the one-way bandwidth of MyVIA for a 4 KB message is 250 MB/s, and the lowest one-way latency is 8.46 /spl mu/s, which shows that the performance of MyVIA surpassed that of other implementations of VIA.
{"title":"MyVIA: a design and implementation of the high performance Virtual Interface Architecture","authors":"Yu Chen, Xiaoge Wang, Z. Jiao, Jun Xie, Zhihui Du, Sanli Li","doi":"10.1109/CLUSTR.2002.1137741","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137741","url":null,"abstract":"Virtual Interface Architecture (VIA) established a communication model with low latency and high bandwidth, and defined the standard of user-level high-performance communication specification in cluster systems. This paper analyzes the current development, principle and implementations of VIA, and presents user-level high-performance communication software, MyVIA, based on Myrinet, which is comfortable with VIA specification. The paper first describes the design principle and framework of MyVIA, then proposes new technologies of MyVIA including User TLB, continued host physical memory and varied NIC buffer, the pipelining communication based on resource and DMA chain, and physical descriptor ring. Experimental results of performance comparisons and analysis are presented; the one-way bandwidth of MyVIA for a 4 KB message is 250 MB/s, and the lowest one-way latency is 8.46 /spl mu/s, which shows that the performance of MyVIA surpassed that of other implementations of VIA.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83789813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137753
Wu-chun Feng, Michael S. Warren, E. Weigle
We present a new twist to the Beowulf cluster - the Bladed Beowulf. In contrast to traditional Beowulfs which typically use Intel or AMD processors, our Bladed Beowulf uses Trans-meta processors in order to keep thermal power dissipation low and reliability and density high while still achieving comparable performance to Intel- and AMD-based clusters. Given the ever increasing complexity of traditional supercomputers and Beowulf clusters; the issues of size, reliability power consumption, and ease of administration and use will be "the" issues of this decade for high-performance computing. Bigger and faster machines are simply not good enough anymore. To illustrate, we present the results of performance benchmarks on our Bladed Beowulf and introduce two performance metrics that contribute to the total cost of ownership (TCO) of a computing system - performance/power and performance/space.
{"title":"The Bladed Beowulf: a cost-effective alternative to traditional Beowulfs","authors":"Wu-chun Feng, Michael S. Warren, E. Weigle","doi":"10.1109/CLUSTR.2002.1137753","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137753","url":null,"abstract":"We present a new twist to the Beowulf cluster - the Bladed Beowulf. In contrast to traditional Beowulfs which typically use Intel or AMD processors, our Bladed Beowulf uses Trans-meta processors in order to keep thermal power dissipation low and reliability and density high while still achieving comparable performance to Intel- and AMD-based clusters. Given the ever increasing complexity of traditional supercomputers and Beowulf clusters; the issues of size, reliability power consumption, and ease of administration and use will be \"the\" issues of this decade for high-performance computing. Bigger and faster machines are simply not good enough anymore. To illustrate, we present the results of performance benchmarks on our Bladed Beowulf and introduce two performance metrics that contribute to the total cost of ownership (TCO) of a computing system - performance/power and performance/space.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75702160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137782
R. Diaconescu, R. Conradi
This paper proposes a data parallel programming model suitable for loosely synchronous, irregular applications. At the core of the model are distributed objects that express non-trivial data parallelism. Sequential objects express independent computations. The goal is to use objects to fold synchronization into data accesses and thus, free the user from concurrency aspects. Distributed objects encapsulate large data partitioned across multiple address spaces. The system classifies accesses to distributed objects as read and write. Furthermore, it uses the access patterns to maintain information about dependences across partitions. The system guarantees inter-object consistency using a relaxed update scheme. Typical access patterns uncover dependences for data on the border between partitions. Experimental results show that this approach is highly usable and efficient.
{"title":"A data parallel programming model based on distributed objects","authors":"R. Diaconescu, R. Conradi","doi":"10.1109/CLUSTR.2002.1137782","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137782","url":null,"abstract":"This paper proposes a data parallel programming model suitable for loosely synchronous, irregular applications. At the core of the model are distributed objects that express non-trivial data parallelism. Sequential objects express independent computations. The goal is to use objects to fold synchronization into data accesses and thus, free the user from concurrency aspects. Distributed objects encapsulate large data partitioned across multiple address spaces. The system classifies accesses to distributed objects as read and write. Furthermore, it uses the access patterns to maintain information about dependences across partitions. The system guarantees inter-object consistency using a relaxed update scheme. Typical access patterns uncover dependences for data on the border between partitions. Experimental results show that this approach is highly usable and efficient.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77375216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137723
R. Prodan, T. Fahringer
The need to conduct and manage large sets of experiments for scientific applications dramatically increased over the last decade. However, there is still very little tool support for this complex and tedious process. We introduce the ZENTURIO experiment management system for parameter studies, performance analysis, and software testing for cluster and Grid architectures. ZENTURIO uses the ZEN directive-based language to specify arbitrary complex program executions. ZENTURIO is designed as a collection of Grid services that comprise: (1) a registry service which supports registering and locating Grid services; (2) an experiment generator that parses files with ZEN directives and instruments applications for performance analysis and parameter studies; (3) an experiment executor that compiles and controls the execution of experiments on the target machine. A graphical user portal allows the user to control and monitor the experiments and to automatically visualise performance and output data across multiple experiments. ZENTURIO has been implemented based on Java/Jini distributed technology. It supports experiment management on cluster architectures via PBS and on Grid infrastructures through GRAM. We report results of using ZENTURIO for performance analysis of an ocean simulation application and a parameter study of a computational finance code.
{"title":"ZENTURIO: an experiment management system for cluster and Grid computing","authors":"R. Prodan, T. Fahringer","doi":"10.1109/CLUSTR.2002.1137723","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137723","url":null,"abstract":"The need to conduct and manage large sets of experiments for scientific applications dramatically increased over the last decade. However, there is still very little tool support for this complex and tedious process. We introduce the ZENTURIO experiment management system for parameter studies, performance analysis, and software testing for cluster and Grid architectures. ZENTURIO uses the ZEN directive-based language to specify arbitrary complex program executions. ZENTURIO is designed as a collection of Grid services that comprise: (1) a registry service which supports registering and locating Grid services; (2) an experiment generator that parses files with ZEN directives and instruments applications for performance analysis and parameter studies; (3) an experiment executor that compiles and controls the execution of experiments on the target machine. A graphical user portal allows the user to control and monitor the experiments and to automatically visualise performance and output data across multiple experiments. ZENTURIO has been implemented based on Java/Jini distributed technology. It supports experiment management on cluster architectures via PBS and on Grid infrastructures through GRAM. We report results of using ZENTURIO for performance analysis of an ocean simulation application and a parameter study of a computational finance code.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81161915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137736
Jianwei Li, W. Liao, A. Choudhary, V. Taylor
In this paper we investigate the data access patterns and file I/O behaviors of a production cosmology application that uses the adaptive mesh refinement (AMR) technique for its domain decomposition. This application was originally developed using Hierarchical Data Format (HDF version 4) I/O library and since HDF4 does not provide parallel I/O facilities, the global file I/O operations were carried out by one of the allocated processors. When the number of processors becomes large, the I/O performance of this design degrades significantly due to the high communication cost and sequential file access. In this work, we present two additional I/O implementations, using MPI-IO and parallel HDF version 5, and analyze their impacts to the I/O performance for this typical AMR application. Based on the I/O patterns discovered in this application, we also discuss the interaction between user level parallel I/O operations and different parallel file systems and point out the advantages and disadvantages. The performance results presented in this work are obtained from an SGI Origin2000 using XFS, an IBM SP using GPFS, and a Linux cluster using PVFS.
{"title":"I/O analysis and optimization for an AMR cosmology application","authors":"Jianwei Li, W. Liao, A. Choudhary, V. Taylor","doi":"10.1109/CLUSTR.2002.1137736","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137736","url":null,"abstract":"In this paper we investigate the data access patterns and file I/O behaviors of a production cosmology application that uses the adaptive mesh refinement (AMR) technique for its domain decomposition. This application was originally developed using Hierarchical Data Format (HDF version 4) I/O library and since HDF4 does not provide parallel I/O facilities, the global file I/O operations were carried out by one of the allocated processors. When the number of processors becomes large, the I/O performance of this design degrades significantly due to the high communication cost and sequential file access. In this work, we present two additional I/O implementations, using MPI-IO and parallel HDF version 5, and analyze their impacts to the I/O performance for this typical AMR application. Based on the I/O patterns discovered in this application, we also discuss the interaction between user level parallel I/O operations and different parallel file systems and point out the advantages and disadvantages. The performance results presented in this work are obtained from an SGI Origin2000 using XFS, an IBM SP using GPFS, and a Linux cluster using PVFS.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82922178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137785
W. Lawry, Christopher Wilson, A. Maccabe, R. Brightwell
This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two methods to characterize the ability of messages to make progress concurrently, with computational processing on the host processor(s). COMB measures the relationship between MPI communication bandwidth and host CPU availability.
{"title":"COMB: a portable benchmark suite for assessing MPI overlap","authors":"W. Lawry, Christopher Wilson, A. Maccabe, R. Brightwell","doi":"10.1109/CLUSTR.2002.1137785","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137785","url":null,"abstract":"This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two methods to characterize the ability of messages to make progress concurrently, with computational processing on the host processor(s). COMB measures the relationship between MPI communication bandwidth and host CPU availability.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82343729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-23DOI: 10.1109/CLUSTR.2002.1137779
Liang Peng, W. Wong, C. Yuen
A parallel programming paradigm dictates the way in which an application is to be expressed. It also restricts the algorithms that may be used in the application. Unfortunately, runtime systems for parallel computing often impose a particular programming paradigm. For a wider choice of algorithms, it is desirable to support more than one paradigm. In this paper we consider SilkRoad II, a variant of the Cilk runtime system for cluster computing. What is unique about SilkRoad II is its memory model which supports multiple paradigms with the underlying software distributed shared memory. The RC-dag memory consistency model of SilkRoad II is introduced. Our experimental results show that the stronger RC-dag can achieve performance comparable to LC of Cilk while supporting a bigger set of paradigms with rather good performance.
{"title":"SilkRoad II: a multi-paradigm runtime system for cluster computing","authors":"Liang Peng, W. Wong, C. Yuen","doi":"10.1109/CLUSTR.2002.1137779","DOIUrl":"https://doi.org/10.1109/CLUSTR.2002.1137779","url":null,"abstract":"A parallel programming paradigm dictates the way in which an application is to be expressed. It also restricts the algorithms that may be used in the application. Unfortunately, runtime systems for parallel computing often impose a particular programming paradigm. For a wider choice of algorithms, it is desirable to support more than one paradigm. In this paper we consider SilkRoad II, a variant of the Cilk runtime system for cluster computing. What is unique about SilkRoad II is its memory model which supports multiple paradigms with the underlying software distributed shared memory. The RC-dag memory consistency model of SilkRoad II is introduced. Our experimental results show that the stronger RC-dag can achieve performance comparable to LC of Cilk while supporting a bigger set of paradigms with rather good performance.","PeriodicalId":92128,"journal":{"name":"Proceedings. IEEE International Conference on Cluster Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2002-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81483437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}