Pub Date : 2019-11-01DOI: 10.1109/ipdrm49579.2019.00002
{"title":"[Copyright notice]","authors":"","doi":"10.1109/ipdrm49579.2019.00002","DOIUrl":"https://doi.org/10.1109/ipdrm49579.2019.00002","url":null,"abstract":"","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-01DOI: 10.1109/IPDRM49579.2019.00010
Shulei Xu, J. Hashmi, S. Chakraborty, H. Subramoni, D. Panda
Recent advances in processor technologies have led to highly multi-threaded and dense multi- and many-core HPC systems. The adoption of such dense multi-core processors is widespread in the Top500 systems. Message Passing Interface (MPI) has been widely used to scale out scientific applications. The communication designs for intra-node communication in MPI are mainly based on shared memory communication. The increased core-density of modern processors warrants the use of efficient shared memory communication designs to achieve optimal performance. While there have been various algorithms and data-structures proposed for the producer-consumer like scenarios in the literature, there is a need to revisit them in the context of MPI communication on modern architectures to find the optimal solutions that work best for modern architectures. In this paper, we first propose a set of low-level benchmarks to evaluate various data-structures such as Lamport queues, Fast-Forward queues, and Fastboxes (FB) for shared memory communication. Then, we bring these designs into the MVAPICH2 MPI library and measure their impact on the MPI intra-node communication for a wide variety of communication patterns. The benchmarking results are carried out on modern multi-/many-core architectures including Intel Xeon CascadeLake and Intel Knights Landing.
{"title":"Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2","authors":"Shulei Xu, J. Hashmi, S. Chakraborty, H. Subramoni, D. Panda","doi":"10.1109/IPDRM49579.2019.00010","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00010","url":null,"abstract":"Recent advances in processor technologies have led to highly multi-threaded and dense multi- and many-core HPC systems. The adoption of such dense multi-core processors is widespread in the Top500 systems. Message Passing Interface (MPI) has been widely used to scale out scientific applications. The communication designs for intra-node communication in MPI are mainly based on shared memory communication. The increased core-density of modern processors warrants the use of efficient shared memory communication designs to achieve optimal performance. While there have been various algorithms and data-structures proposed for the producer-consumer like scenarios in the literature, there is a need to revisit them in the context of MPI communication on modern architectures to find the optimal solutions that work best for modern architectures. In this paper, we first propose a set of low-level benchmarks to evaluate various data-structures such as Lamport queues, Fast-Forward queues, and Fastboxes (FB) for shared memory communication. Then, we bring these designs into the MVAPICH2 MPI library and measure their impact on the MPI intra-node communication for a wide variety of communication patterns. The benchmarking results are carried out on modern multi-/many-core architectures including Intel Xeon CascadeLake and Intel Knights Landing.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128505951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-01DOI: 10.1109/IPDRM49579.2019.00008
Parsa Amini, H. Kaiser
In this research, we describe the functionality of AGAS (Active Global Address Space), a subsystem of the HPX runtime system that is designed to handle data locality at runtime, independent of the hardware and architecture configuration. AGAS enables transparent runtime global data access and data migration, but incurs a an overhead cost at runtime. We present a method to assess the performance of AGAS and the amount of impact it has on the execution time of the Octo-Tiger application. With our assessment method we identify the four most expensive AGAS operations in HPX and demonstrate that the overhead caused by AGAS is negligible.
在本研究中,我们描述了AGAS (Active Global Address Space)的功能,AGAS是HPX运行时系统的一个子系统,设计用于在运行时处理数据局域性,独立于硬件和架构配置。AGAS支持透明的运行时全局数据访问和数据迁移,但会在运行时产生开销成本。我们提出了一种方法来评估AGAS的性能及其对Octo-Tiger应用程序执行时间的影响程度。通过我们的评估方法,我们确定了HPX中四个最昂贵的AGAS操作,并证明由AGAS引起的开销可以忽略不计。
{"title":"Assessing the Performance Impact of using an Active Global Address Space in HPX: A Case for AGAS","authors":"Parsa Amini, H. Kaiser","doi":"10.1109/IPDRM49579.2019.00008","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00008","url":null,"abstract":"In this research, we describe the functionality of AGAS (Active Global Address Space), a subsystem of the HPX runtime system that is designed to handle data locality at runtime, independent of the hardware and architecture configuration. AGAS enables transparent runtime global data access and data migration, but incurs a an overhead cost at runtime. We present a method to assess the performance of AGAS and the amount of impact it has on the execution time of the Octo-Tiger application. With our assessment method we identify the four most expensive AGAS operations in HPX and demonstrate that the overhead caused by AGAS is negligible.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"83 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132518862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-01DOI: 10.1109/IPDRM49579.2019.00005
Jose M Monsalve, K. Harms, Kalyan Kumaran, G. Gao
The Sequential Codelet Model is a definition of a program execution model that aims to achieve parallel execution of programs that are expressed sequentially and in a hierarchical manner. The Sequential Codelet Model heavily borrows from the successful experience acquired through decades of sequential program execution, in particular, the use of Instruction Level Parallelism optimizations for implicit parallel execution of code. We revisit and re-define the Universal Turing Machine and the Von Neumann Architecture to account for the hierarchical organization of the whole computation system and its resources (i.e. memory, computational capabilities, and interconnection networks), as well as consider program complexity and structure in relation to its execution. This work defines the Sequential Codelet Model (SCM), the Hierarchical Turing Machine (HTM), and the Hierarchal Von Neumann Architecture, as well as explains how implicit parallel execution of programs could be achieved by using these definitions.
{"title":"Sequential Codelet Model of Program Execution. A Super-Codelet model based on the Hierarchical Turing Machine.","authors":"Jose M Monsalve, K. Harms, Kalyan Kumaran, G. Gao","doi":"10.1109/IPDRM49579.2019.00005","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00005","url":null,"abstract":"The Sequential Codelet Model is a definition of a program execution model that aims to achieve parallel execution of programs that are expressed sequentially and in a hierarchical manner. The Sequential Codelet Model heavily borrows from the successful experience acquired through decades of sequential program execution, in particular, the use of Instruction Level Parallelism optimizations for implicit parallel execution of code. We revisit and re-define the Universal Turing Machine and the Von Neumann Architecture to account for the hierarchical organization of the whole computation system and its resources (i.e. memory, computational capabilities, and interconnection networks), as well as consider program complexity and structure in relation to its execution. This work defines the Sequential Codelet Model (SCM), the Hierarchical Turing Machine (HTM), and the Hierarchal Von Neumann Architecture, as well as explains how implicit parallel execution of programs could be achieved by using these definitions.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130969175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-01DOI: 10.1109/ipdrm49579.2019.00001
{"title":"[Title page]","authors":"","doi":"10.1109/ipdrm49579.2019.00001","DOIUrl":"https://doi.org/10.1109/ipdrm49579.2019.00001","url":null,"abstract":"","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126575672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-01DOI: 10.1109/IPDRM49579.2019.00009
Amit Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, D. Panda
The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is the key to achieve good performance speed-ups and portability. Broadcast or One-to-all communication is one of the most commonly used collectives in MPI applications. However, the existing algorithms for broadcast do not effectively utilize the high degree of parallelism and increased message rate capabilities offered by modern architectures. In this paper, we address these challenges and propose a Scalable Multi-Endpoint broadcast algorithm that combines hierarchical communication with multiple endpoints per node for high performance and scalability. We evaluate the proposed algorithm against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the proposed algorithm at benchmark and application level at scale on four different hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed design shows up to 2.5 times performance improvements at a microbenchmark level with 128 Nodes. We also observe up to 37% improvement in broadcast communication latency for the SPECMPI scientific applications
{"title":"Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast","authors":"Amit Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, D. Panda","doi":"10.1109/IPDRM49579.2019.00009","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00009","url":null,"abstract":"The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is the key to achieve good performance speed-ups and portability. Broadcast or One-to-all communication is one of the most commonly used collectives in MPI applications. However, the existing algorithms for broadcast do not effectively utilize the high degree of parallelism and increased message rate capabilities offered by modern architectures. In this paper, we address these challenges and propose a Scalable Multi-Endpoint broadcast algorithm that combines hierarchical communication with multiple endpoints per node for high performance and scalability. We evaluate the proposed algorithm against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the proposed algorithm at benchmark and application level at scale on four different hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed design shows up to 2.5 times performance improvements at a microbenchmark level with 128 Nodes. We also observe up to 37% improvement in broadcast communication latency for the SPECMPI scientific applications","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125291804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-01DOI: 10.1109/IPDRM49579.2019.00006
Ryan D. Friese, Antonino Tumeo, R. Gioiosa, Mark Raugas, T. Warfel
The Data Vortex Network is a novel high-radix, congestion free interconnect able to cope with the fine-grained, unpredictable communication patterns of irregular applications. This paper presents ADVERT, an asynchronous runtime system that provides performance and productivity for the Data Vortex Network. ADVERT integrates a lightweight memory manager (DVMem) for the user accessible SRAM integrated in the network interface, and a communication library (DVComm) that implements active messaging primitives (get, put, and remote execution). ADVERT hides the complexity of controlling all the network hardware features through the low-level Data Vortex programming interface, while providing comparable performance. We discuss ADVERT's design and present microbenchmarks to examine different runtime features. ADVERT provides the functionalities for building higher level asynchronous many tasking runtimes and partitioned global address space (PGAS) libraries on top of the Data Vortex Network.
{"title":"Advert: An Asynchronous Runtime for Fine-Grained Network Systems","authors":"Ryan D. Friese, Antonino Tumeo, R. Gioiosa, Mark Raugas, T. Warfel","doi":"10.1109/IPDRM49579.2019.00006","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00006","url":null,"abstract":"The Data Vortex Network is a novel high-radix, congestion free interconnect able to cope with the fine-grained, unpredictable communication patterns of irregular applications. This paper presents ADVERT, an asynchronous runtime system that provides performance and productivity for the Data Vortex Network. ADVERT integrates a lightweight memory manager (DVMem) for the user accessible SRAM integrated in the network interface, and a communication library (DVComm) that implements active messaging primitives (get, put, and remote execution). ADVERT hides the complexity of controlling all the network hardware features through the low-level Data Vortex programming interface, while providing comparable performance. We discuss ADVERT's design and present microbenchmarks to examine different runtime features. ADVERT provides the functionalities for building higher level asynchronous many tasking runtimes and partitioned global address space (PGAS) libraries on top of the Data Vortex Network.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131867333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-08DOI: 10.1109/IPDRM49579.2019.00007
M. Turilli, André Merzky, T. Naughton, W. Elwasif, S. Jha
Many scientific workloads are comprised of many tasks, where each task is an independent simulation or analysis of data. The execution of millions of tasks on heterogeneous HPC platforms requires scalable dynamic resource management and multi-level scheduling. RADICAL-Pilot (RP) -- an implementation of the Pilot abstraction, addresses these challenges and serves as an effective runtime system to execute workloads comprised of many tasks. In this paper, we characterize the performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit: RP is responsible for resource management and task scheduling on acquired resource; JSM or PRRTE enact the placement of launching of scheduled tasks. Our experiments provide lower bounds on the performance of RP when integrated with JSM and PRRTE. Specifically, for workloads comprised of homogeneous single-core, 15 minutes-long tasks we find that: PRRTE scales better than JSM for > O(1000) tasks; PRRTE overheads are negligible; and PRRTE supports optimizations that lower the impact of overheads and enable resource utilization of 63% when executing O(16K), 1-core tasks over 404 compute nodes.
{"title":"Characterizing the Performance of Executing Many-tasks on Summit","authors":"M. Turilli, André Merzky, T. Naughton, W. Elwasif, S. Jha","doi":"10.1109/IPDRM49579.2019.00007","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00007","url":null,"abstract":"Many scientific workloads are comprised of many tasks, where each task is an independent simulation or analysis of data. The execution of millions of tasks on heterogeneous HPC platforms requires scalable dynamic resource management and multi-level scheduling. RADICAL-Pilot (RP) -- an implementation of the Pilot abstraction, addresses these challenges and serves as an effective runtime system to execute workloads comprised of many tasks. In this paper, we characterize the performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit: RP is responsible for resource management and task scheduling on acquired resource; JSM or PRRTE enact the placement of launching of scheduled tasks. Our experiments provide lower bounds on the performance of RP when integrated with JSM and PRRTE. Specifically, for workloads comprised of homogeneous single-core, 15 minutes-long tasks we find that: PRRTE scales better than JSM for > O(1000) tasks; PRRTE overheads are negligible; and PRRTE supports optimizations that lower the impact of overheads and enable resource utilization of 63% when executing O(16K), 1-core tasks over 404 compute nodes.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"373 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133248071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}