2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)最新文献

英文中文

[Copyright notice] (版权)

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-11-01 DOI: 10.1109/ipdrm49579.2019.00002

引用次数: 0

Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2 基于MVAPICH2的新兴体系结构上共享内存通信的设计与评估

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-11-01 DOI: 10.1109/IPDRM49579.2019.00010

Shulei Xu, J. Hashmi, S. Chakraborty, H. Subramoni, D. Panda

Recent advances in processor technologies have led to highly multi-threaded and dense multi- and many-core HPC systems. The adoption of such dense multi-core processors is widespread in the Top500 systems. Message Passing Interface (MPI) has been widely used to scale out scientific applications. The communication designs for intra-node communication in MPI are mainly based on shared memory communication. The increased core-density of modern processors warrants the use of efficient shared memory communication designs to achieve optimal performance. While there have been various algorithms and data-structures proposed for the producer-consumer like scenarios in the literature, there is a need to revisit them in the context of MPI communication on modern architectures to find the optimal solutions that work best for modern architectures. In this paper, we first propose a set of low-level benchmarks to evaluate various data-structures such as Lamport queues, Fast-Forward queues, and Fastboxes (FB) for shared memory communication. Then, we bring these designs into the MVAPICH2 MPI library and measure their impact on the MPI intra-node communication for a wide variety of communication patterns. The benchmarking results are carried out on modern multi-/many-core architectures including Intel Xeon CascadeLake and Intel Knights Landing.

处理器技术的最新进展导致了高度多线程和密集的多核和多核HPC系统。这种密集多核处理器的采用在Top500系统中很普遍。消息传递接口(Message Passing Interface, MPI)被广泛用于科学应用的扩展。MPI节点内通信的通信设计主要基于共享内存通信。现代处理器核密度的提高保证了使用高效的共享内存通信设计来实现最佳性能。虽然文献中已经为类似生产者-消费者的场景提出了各种算法和数据结构，但有必要在现代体系结构上的MPI通信上下文中重新审视它们，以找到最适合现代体系结构的最佳解决方案。在本文中，我们首先提出了一组低级基准来评估各种数据结构，如Lamport队列、Fast-Forward队列和Fastboxes (FB)，用于共享内存通信。然后，我们将这些设计引入MVAPICH2 MPI库，并测量它们对各种通信模式的MPI节点内通信的影响。基准测试结果在现代多核/多核架构上进行，包括Intel Xeon CascadeLake和Intel Knights Landing。

{"title":"Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2","authors":"Shulei Xu, J. Hashmi, S. Chakraborty, H. Subramoni, D. Panda","doi":"10.1109/IPDRM49579.2019.00010","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00010","url":null,"abstract":"Recent advances in processor technologies have led to highly multi-threaded and dense multi- and many-core HPC systems. The adoption of such dense multi-core processors is widespread in the Top500 systems. Message Passing Interface (MPI) has been widely used to scale out scientific applications. The communication designs for intra-node communication in MPI are mainly based on shared memory communication. The increased core-density of modern processors warrants the use of efficient shared memory communication designs to achieve optimal performance. While there have been various algorithms and data-structures proposed for the producer-consumer like scenarios in the literature, there is a need to revisit them in the context of MPI communication on modern architectures to find the optimal solutions that work best for modern architectures. In this paper, we first propose a set of low-level benchmarks to evaluate various data-structures such as Lamport queues, Fast-Forward queues, and Fastboxes (FB) for shared memory communication. Then, we bring these designs into the MVAPICH2 MPI library and measure their impact on the MPI intra-node communication for a wide variety of communication patterns. The benchmarking results are carried out on modern multi-/many-core architectures including Intel Xeon CascadeLake and Intel Knights Landing.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128505951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessing the Performance Impact of using an Active Global Address Space in HPX: A Case for AGAS 评估在HPX中使用主动全局地址空间的性能影响:AGAS的案例

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-11-01 DOI: 10.1109/IPDRM49579.2019.00008

Parsa Amini, H. Kaiser

In this research, we describe the functionality of AGAS (Active Global Address Space), a subsystem of the HPX runtime system that is designed to handle data locality at runtime, independent of the hardware and architecture configuration. AGAS enables transparent runtime global data access and data migration, but incurs a an overhead cost at runtime. We present a method to assess the performance of AGAS and the amount of impact it has on the execution time of the Octo-Tiger application. With our assessment method we identify the four most expensive AGAS operations in HPX and demonstrate that the overhead caused by AGAS is negligible.

在本研究中，我们描述了AGAS (Active Global Address Space)的功能，AGAS是HPX运行时系统的一个子系统，设计用于在运行时处理数据局域性，独立于硬件和架构配置。AGAS支持透明的运行时全局数据访问和数据迁移，但会在运行时产生开销成本。我们提出了一种方法来评估AGAS的性能及其对Octo-Tiger应用程序执行时间的影响程度。通过我们的评估方法，我们确定了HPX中四个最昂贵的AGAS操作，并证明由AGAS引起的开销可以忽略不计。

引用次数: 11

Sequential Codelet Model of Program Execution. A Super-Codelet model based on the Hierarchical Turing Machine. 程序执行的顺序编码模型。一种基于分层图灵机的超级编码模型。

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-11-01 DOI: 10.1109/IPDRM49579.2019.00005

Jose M Monsalve, K. Harms, Kalyan Kumaran, G. Gao

The Sequential Codelet Model is a definition of a program execution model that aims to achieve parallel execution of programs that are expressed sequentially and in a hierarchical manner. The Sequential Codelet Model heavily borrows from the successful experience acquired through decades of sequential program execution, in particular, the use of Instruction Level Parallelism optimizations for implicit parallel execution of code. We revisit and re-define the Universal Turing Machine and the Von Neumann Architecture to account for the hierarchical organization of the whole computation system and its resources (i.e. memory, computational capabilities, and interconnection networks), as well as consider program complexity and structure in relation to its execution. This work defines the Sequential Codelet Model (SCM), the Hierarchical Turing Machine (HTM), and the Hierarchal Von Neumann Architecture, as well as explains how implicit parallel execution of programs could be achieved by using these definitions.

顺序代码删除模型是程序执行模型的定义，其目的是实现以顺序和分层方式表示的程序的并行执行。顺序代码删除模型大量借鉴了几十年来顺序程序执行的成功经验，特别是使用指令级并行优化来隐式并行执行代码。我们重新审视并重新定义通用图灵机和冯·诺依曼架构，以解释整个计算系统及其资源(即内存、计算能力和互连网络)的分层组织，并考虑与其执行相关的程序复杂性和结构。这项工作定义了顺序代码删除模型(SCM)、分层图灵机(HTM)和分层冯·诺伊曼架构，并解释了如何通过使用这些定义来实现程序的隐式并行执行。

引用次数: 5

[Title page] (标题页)

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-11-01 DOI: 10.1109/ipdrm49579.2019.00001

引用次数: 0

Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast 利用MPI广播的多进程端点的网络级并行性

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-11-01 DOI: 10.1109/IPDRM49579.2019.00009

Amit Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, D. Panda

The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is the key to achieve good performance speed-ups and portability. Broadcast or One-to-all communication is one of the most commonly used collectives in MPI applications. However, the existing algorithms for broadcast do not effectively utilize the high degree of parallelism and increased message rate capabilities offered by modern architectures. In this paper, we address these challenges and propose a Scalable Multi-Endpoint broadcast algorithm that combines hierarchical communication with multiple endpoints per node for high performance and scalability. We evaluate the proposed algorithm against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the proposed algorithm at benchmark and application level at scale on four different hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed design shows up to 2.5 times performance improvements at a microbenchmark level with 128 Nodes. We also observe up to 37% improvement in broadcast communication latency for the SPECMPI scientific applications

消息传递接口一直是开发可伸缩和高性能并行应用程序的主要编程模型。集合操作以一种可移植、高效的方式支持组通信操作，并被跨不同领域的大量应用程序使用。集体操作的优化是实现良好性能加速和可移植性的关键。广播或一对所有通信是MPI应用程序中最常用的集合之一。然而，现有的广播算法不能有效地利用现代体系结构提供的高度并行性和增加的消息速率能力。在本文中，我们解决了这些挑战，并提出了一种可扩展的多端点广播算法，该算法将分层通信与每个节点的多个端点相结合，以实现高性能和可扩展性。我们针对其他MPI库(包括MVAPICH2、Intel MPI和Spectrum MPI)中的最新设计评估了所提出的算法。我们在四种不同的硬件体系结构(包括Intel Cascade Lake、Intel Skylake、AMD EPYC和IBM POWER9)以及InfiniBand和Omni-Path互连上演示了所提出算法在基准测试和应用级别上的优势。与其他最先进的设计相比，我们提出的设计在具有128个节点的微基准测试级别上的性能提高了2.5倍。我们还观察到，在SPECMPI科学应用中，广播通信延迟提高了37%

{"title":"Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast","authors":"Amit Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, D. Panda","doi":"10.1109/IPDRM49579.2019.00009","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00009","url":null,"abstract":"The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is the key to achieve good performance speed-ups and portability. Broadcast or One-to-all communication is one of the most commonly used collectives in MPI applications. However, the existing algorithms for broadcast do not effectively utilize the high degree of parallelism and increased message rate capabilities offered by modern architectures. In this paper, we address these challenges and propose a Scalable Multi-Endpoint broadcast algorithm that combines hierarchical communication with multiple endpoints per node for high performance and scalability. We evaluate the proposed algorithm against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the proposed algorithm at benchmark and application level at scale on four different hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed design shows up to 2.5 times performance improvements at a microbenchmark level with 128 Nodes. We also observe up to 37% improvement in broadcast communication latency for the SPECMPI scientific applications","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125291804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Advert: An Asynchronous Runtime for Fine-Grained Network Systems 广告:用于细粒度网络系统的异步运行时

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-11-01 DOI: 10.1109/IPDRM49579.2019.00006

Ryan D. Friese, Antonino Tumeo, R. Gioiosa, Mark Raugas, T. Warfel

The Data Vortex Network is a novel high-radix, congestion free interconnect able to cope with the fine-grained, unpredictable communication patterns of irregular applications. This paper presents ADVERT, an asynchronous runtime system that provides performance and productivity for the Data Vortex Network. ADVERT integrates a lightweight memory manager (DVMem) for the user accessible SRAM integrated in the network interface, and a communication library (DVComm) that implements active messaging primitives (get, put, and remote execution). ADVERT hides the complexity of controlling all the network hardware features through the low-level Data Vortex programming interface, while providing comparable performance. We discuss ADVERT's design and present microbenchmarks to examine different runtime features. ADVERT provides the functionalities for building higher level asynchronous many tasking runtimes and partitioned global address space (PGAS) libraries on top of the Data Vortex Network.

数据漩涡网络是一种新型的高基数、无拥塞互连，能够应对不规则应用程序的细粒度、不可预测的通信模式。本文介绍了一个异步运行时系统ADVERT，它为数据漩涡网络提供了性能和生产力。ADVERT集成了一个轻量级内存管理器(DVMem)，用于用户可访问集成在网络接口中的SRAM，以及一个通信库(DVComm)，该库实现了活动消息传递原语(get、put和远程执行)。ADVERT通过低级的Data Vortex编程接口隐藏了控制所有网络硬件特性的复杂性，同时提供了相当的性能。我们将讨论ADVERT的设计，并提供微基准测试来检查不同的运行时特性。ADVERT提供了在数据漩涡网络之上构建高级异步多任务运行时和分区全局地址空间(PGAS)库的功能。

引用次数: 0

Characterizing the Performance of Executing Many-tasks on Summit 在峰会上执行多任务的表现特征

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

Pub Date : 2019-09-08 DOI: 10.1109/IPDRM49579.2019.00007

M. Turilli, André Merzky, T. Naughton, W. Elwasif, S. Jha

Many scientific workloads are comprised of many tasks, where each task is an independent simulation or analysis of data. The execution of millions of tasks on heterogeneous HPC platforms requires scalable dynamic resource management and multi-level scheduling. RADICAL-Pilot (RP) -- an implementation of the Pilot abstraction, addresses these challenges and serves as an effective runtime system to execute workloads comprised of many tasks. In this paper, we characterize the performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit: RP is responsible for resource management and task scheduling on acquired resource; JSM or PRRTE enact the placement of launching of scheduled tasks. Our experiments provide lower bounds on the performance of RP when integrated with JSM and PRRTE. Specifically, for workloads comprised of homogeneous single-core, 15 minutes-long tasks we find that: PRRTE scales better than JSM for > O(1000) tasks; PRRTE overheads are negligible; and PRRTE supports optimizations that lower the impact of overheads and enable resource utilization of 63% when executing O(16K), 1-core tasks over 404 compute nodes.

许多科学工作负载由许多任务组成，其中每个任务都是对数据的独立模拟或分析。在异构HPC平台上执行数百万个任务需要可扩展的动态资源管理和多级调度。RADICAL-Pilot (RP)——Pilot抽象的实现，解决了这些挑战，并作为一个有效的运行时系统来执行由许多任务组成的工作负载。在本文中，我们描述了使用RP与JSM和PRRTE在Summit上接口时执行许多任务的性能:RP负责资源管理和获取资源的任务调度;JSM或PRRTE制定计划任务的启动位置。我们的实验提供了RP与JSM和PRRTE集成时的性能下限。具体来说，对于由同构单核、15分钟任务组成的工作负载，我们发现:对于> 0(1000)个任务，PRRTE的可伸缩性优于JSM;prte的开销可以忽略不计;PRRTE支持降低开销影响的优化，并在404个计算节点上执行O(16K)、1核任务时实现63%的资源利用率。

{"title":"Characterizing the Performance of Executing Many-tasks on Summit","authors":"M. Turilli, André Merzky, T. Naughton, W. Elwasif, S. Jha","doi":"10.1109/IPDRM49579.2019.00007","DOIUrl":"https://doi.org/10.1109/IPDRM49579.2019.00007","url":null,"abstract":"Many scientific workloads are comprised of many tasks, where each task is an independent simulation or analysis of data. The execution of millions of tasks on heterogeneous HPC platforms requires scalable dynamic resource management and multi-level scheduling. RADICAL-Pilot (RP) -- an implementation of the Pilot abstraction, addresses these challenges and serves as an effective runtime system to execute workloads comprised of many tasks. In this paper, we characterize the performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit: RP is responsible for resource management and task scheduling on acquired resource; JSM or PRRTE enact the placement of launching of scheduled tasks. Our experiments provide lower bounds on the performance of RP when integrated with JSM and PRRTE. Specifically, for workloads comprised of homogeneous single-core, 15 minutes-long tasks we find that: PRRTE scales better than JSM for > O(1000) tasks; PRRTE overheads are negligible; and PRRTE supports optimizations that lower the impact of overheads and enable resource utilization of 63% when executing O(16K), 1-core tasks over 404 compute nodes.","PeriodicalId":256149,"journal":{"name":"2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)","volume":"373 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133248071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀