2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献

英文中文

Autonomous Task Dropping Mechanism to Achieve Robustness in Heterogeneous Computing Systems 实现异构计算系统鲁棒性的自主任务丢弃机制

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00013

Ali Mokhtari, Chavit Denninnart, M. Salehi

Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.

分布式计算系统的鲁棒性被定义为在不确定参数存在的情况下保持其性能的能力。在异构(甚至同质)分布式计算系统中，不确定性是影响系统鲁棒性的关键问题。值得注意的是，这些系统的性能受到任务执行时间和到达时间的不确定性的干扰。因此，我们的目标是使系统对这些不确定性具有鲁棒性。将任务执行时间作为一个随机变量，利用概率分析方法建立了一种自主主动任务丢弃机制，以实现鲁棒性目标。具体来说，我们提供了一个数学模型来识别任务丢弃决策的最优性，从而使系统的鲁棒性最大化。然后，我们利用数学模型开发了一种任务丢弃启发式算法，在可行的时间复杂度内实现了系统的鲁棒性。虽然所提出的模型是通用的，可以应用于任何分布式系统，但我们关注的是异构计算(HC)系统，它比同构系统具有更高程度的不确定性。实验结果表明，自主主动跌落机制可使系统鲁棒性提高20%。

{"title":"Autonomous Task Dropping Mechanism to Achieve Robustness in Heterogeneous Computing Systems","authors":"Ali Mokhtari, Chavit Denninnart, M. Salehi","doi":"10.1109/IPDPSW50202.2020.00013","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00013","url":null,"abstract":"Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Smart Streaming: A High-Throughput Fault-tolerant Online Processing System 智能流:一种高吞吐量容错在线处理系统

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00075

Jia Guo, G. Agrawal

In recent years, there has been considerable interest in developing frameworks for processing streaming data. Like the precursor commercial systems for data-intensive processing, these systems have largely not used methods popular within the HPC community (for example, MPI for communication). In this paper, we demonstrate a system for stream processing that offers a high-level API to the users (similar to MapReduce), is fault-tolerant, and is also more efficient and scalable than current solutions. Particularly, a cost-efficient MPI/OpenMP based fault-tolerant scheme is incorporated so that the system can survive node failures with only a modest degradation of performance. We evaluate both the functionality and efficiency of Smart Streaming using four common applications in machine learning and data analytics. A comparison against state-of-the-art streaming frameworks shows our system boosts the throughput of test cases by up to 10X and achieve desirable parallelism when scaled out. Additionally, the performance loss upon failures is only proportional to the share of failed resources.

近年来，人们对开发处理流数据的框架非常感兴趣。与用于数据密集型处理的先驱商业系统一样，这些系统在很大程度上没有使用HPC社区中流行的方法(例如，用于通信的MPI)。在本文中，我们展示了一个流处理系统，它为用户提供了一个高级API(类似于MapReduce)，具有容错性，并且比当前的解决方案更高效和可扩展。特别是，采用了一种经济高效的基于MPI/OpenMP的容错方案，使系统能够在节点故障时仅以适度的性能下降存活下来。我们使用机器学习和数据分析中的四种常见应用来评估智能流的功能和效率。与最先进的流框架的比较表明，我们的系统将测试用例的吞吐量提高了10倍，并且在扩展时实现了理想的并行性。此外，故障时的性能损失仅与故障资源的份额成正比。

引用次数: 1

Unified data movement for offloading Charm++ applications 用于卸载Charm++应用程序的统一数据移动

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00085

M. Diener, L. Kalé

Data movement between host and accelerators is one of the most challenging aspects of developing applications for heterogeneous systems. Most existing runtime systems for GPGPU programming require developers to perform data movement manually in the source code, while having to support different hardware and software environments. In this paper, we present a novel way to perform data movement for distributed applications based on the Charm ++ programming system. We extend Charm ++’s support for migration across memory address spaces to handle accelerator devices by making use of the description of data contained in Charm ++’s parallel objects. This allows the Charm ++ runtime to handle data movement automatically to a large extent, while supporting different hardware platforms transparently. This increases both developer productivity and the portability of Charm ++ applications. We demonstrate our proposal with a Charm ++ application that runs offloaded CUDA code on three different hardware platforms with a single data movement specification.

主机和加速器之间的数据移动是为异构系统开发应用程序时最具挑战性的方面之一。大多数现有的GPGPU编程运行时系统要求开发人员在源代码中手动执行数据移动，同时必须支持不同的硬件和软件环境。本文提出了一种基于charm++编程系统的分布式应用程序数据移动的新方法。我们扩展了Charm ++对跨内存地址空间迁移的支持，通过使用Charm ++的并行对象中包含的数据描述来处理加速器设备。这允许Charm ++运行时在很大程度上自动处理数据移动，同时透明地支持不同的硬件平台。这既提高了开发人员的工作效率，又提高了Charm ++应用程序的可移植性。我们用一个Charm ++应用程序演示了我们的建议，该应用程序在三个不同的硬件平台上使用单个数据移动规范运行卸载CUDA代码。

引用次数: 2

Workshop on Resource Arbitration for Dynamic Runtimes (RADR) 动态运行时(RADR)资源仲裁研讨会

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00157

P. Beckman, E. Jeannot, Swann Perarnau

The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.

由独立库或运行时系统有效地动态分配计算节点资源(如内核)的问题可能是一场噩梦。编写应用程序组件的科学家无法有效地指定和组合需要大量资源的组件。随着应用软件栈的深入以及多个运行时层之间的交互对操作系统资源的争夺，智能协作的需求变得越来越明显。计算核心、包内内存甚至电力等资源必须跨应用程序组件进行动态编排，并具有相互查询和适当响应的能力。更集成的解决方案将减少应用程序内部的资源竞争并提高性能。此外，应用程序运行时系统可以请求和分配特定的硬件资产，并在软件堆栈上下调整运行时调优参数。本次研讨会的目的是收集和分享来自社区的最新学术研究，这些研究涉及HPC软件堆栈的各个层面。这包括从运行时系统设计器到编译器的线程分配、资源仲裁和管理、容器等等。我们还将利用小组会议和主题演讲来讨论这些问题，分享愿景，并提出解决方案。

{"title":"Workshop on Resource Arbitration for Dynamic Runtimes (RADR)","authors":"P. Beckman, E. Jeannot, Swann Perarnau","doi":"10.1109/ipdpsw50202.2020.00157","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00157","url":null,"abstract":"The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GrAPL 2020 Keynote Speaker The GraphIt Universal Graph Framework: Achieving HighPerformance across Algorithms, Graph Types, and Architectures GraphIt通用图形框架:实现跨算法、图形类型和架构的高性能

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00044

Saman P. Amarasinghe

In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. However, optimizing the performance of graph applications is notoriously difficult due to irregular memory access patterns and load imbalance across cores. The performance of graph programs depends highly on the algorithm, the size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or single hardware platform works well across all applications.

近年来，拥有数十亿个顶点和数万亿条边的大型图出现在许多领域，如社交网络分析、机器学习、物理模拟和生物学。然而，由于不规则的内存访问模式和内核之间的负载不平衡，优化图形应用程序的性能是出了名的困难。图形程序的性能在很大程度上取决于输入图形的算法、大小和结构，以及底层硬件的特征。没有一组优化或单一硬件平台可以很好地适用于所有应用程序。

引用次数: 0

PDCunplugged: A Free Repository of Unplugged Parallel Distributed Computing Activities PDCunplugged:一个不插电并行分布式计算活动的免费存储库

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00060

Suzanne J. Matthews

Integrating parallel and distributed computing (PDC) topics in core computing courses is a topic of increasing interest for educators. However, there is a question of how best to introduce PDC to undergraduates. Several educators have proposed the use of “unplugged activities”, such as role-playing dramatizations and analogies, to introduce PDC concepts. Yet, unplugged activities for PDC are widely-scattered and often difficult to find, making it challenging for educators to create and incorporate unplugged interventions in their classrooms. The PDCunplugged project seeks to rectify these issues by providing a free repository where educators can find and share unplugged activities related to PDC. The existing curation contains nearly forty unique unplugged activities collected from thirty years of the PDC literature and from all over the Internet, and maps each activity to relevant CS2013 PDC knowledge units and TCPP PDC topic areas. Learn more about the project at pdcunplugged.org.

将并行和分布式计算(PDC)主题整合到核心计算课程中是教育工作者越来越感兴趣的话题。然而，如何最好地向大学生介绍PDC是一个问题。一些教育工作者建议使用“不插电活动”，例如角色扮演戏剧和类比，来介绍PDC概念。然而，PDC的不插电作业非常分散，很难找到，这给教育工作者在课堂上创建和纳入不插电干预措施带来了挑战。PDCunplugged项目旨在通过提供一个免费的存储库来纠正这些问题，教育工作者可以在该存储库中找到并分享与PDC相关的不插电活动。现有的策展包含近40个独特的不插电活动，这些活动收集自30年来的PDC文献和整个互联网，并将每个活动映射到相关的CS2013 PDC知识单元和TCPP PDC主题领域。在pdcunplugged.org上了解更多关于这个项目的信息。

引用次数: 6

Importance of Selecting Data Layouts in the Tsunami Simulation Code 海啸模拟代码中选择数据布局的重要性

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00140

Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi

Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.

利用内存性能是加速内存密集型应用程序的关键之一。提高内存性能的一种方法是使内存访问更高效。由于内存访问模式会根据数据布局而变化，因此有必要选择适当的数据布局进行有效的内存访问。海啸模拟是对存储性能要求较高的高性能计算应用之一。为了检查由于数据布局导致的性能差异，将几种数据布局应用于海啸模拟。从评价结果来看，海啸模拟的性能对输入数据、计算系统和数据布局都很敏感。结构阵列海啸仿真的执行时间比离散阵列和结构阵列海啸仿真的执行时间要长得多。离散阵列的性能和阵列的结构在特定情况下并不高，而是根据计算系统和输入数据的不同而变化。在此基础上，本文指出了数据布局选择对提高存储性能的重要性。

{"title":"Importance of Selecting Data Layouts in the Tsunami Simulation Code","authors":"Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/IPDPSW50202.2020.00140","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00140","url":null,"abstract":"Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122683193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Kronecker Graph Generation with Ground Truth for 4-Cycles and Dense Structure in Bipartite Graphs 二部图中4环和密集结构的Kronecker图生成及其真值

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00052

Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest

We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).

我们证明了非随机Kronecker图生成器产生具有真实全局和局部性质的大规模二部图，并讨论了它们在图分析验证中的应用。给定两个具有邻接矩阵$A$和$B$的小连通无标度图，它们的Kronecker积图[1]具有邻接矩阵$C=Ao * B$。我们首先证明了一个因子$A$非二部(或者，将所有自循环加到一个二部$A$)与另一个因子$B$二部保证$mathcal {G}c$是二部且连通的。许多图属性(包括度、直径和偏心率)的基本真值公式直接继承了先前工作[2]、[3]中提出的一般情况。然而，在二部图中，高阶结构和密集结构的分析是不同的，因为不存在奇长环(包括三角形)，最密集的可能结构是双曲线。我们推导出在$mathcal {G}c$中每个顶点和边上的4个循环(也就是正方形或蝴蝶)的基本真理的公式。此外，我们证明了因子$A， $ B$中的二部群落(密集顶点子集)在Kronecker积$C中产生密集的二部群落。我们还讨论了公式揭示的克罗内克产品图的有趣属性及其对使用它们作为各种复杂分析的基准的影响。例如，对于连通的非平凡大小的$A$和$B$， $mathcal {G}c$在与$A$和$B$的顶点/边相关联的顶点/边处有4个循环，而$A$和$B$的顶点/边没有循环，这使得很难生成具有桁架分解的真二部推广的图(例如，[4]的k翼分解)。

{"title":"Kronecker Graph Generation with Ground Truth for 4-Cycles and Dense Structure in Bipartite Graphs","authors":"Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest","doi":"10.1109/IPDPSW50202.2020.00052","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00052","url":null,"abstract":"We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121274152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime 在基于PaRSEC任务的运行时上避免2D模板实现的通信

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00127

Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra

Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.

模板计算或一般稀疏矩阵向量积(SpMV)是几何多重网格或Krylov求解等算法的关键组成部分。但是它们较低的算术强度意味着内存带宽和网络延迟将成为性能限制因素。当前的架构趋势更倾向于计算而不是带宽，从而加剧了本已不利的不平衡。以前的工作通过提高内存带宽使用或提供通信避免(CA)方案来实现模板内核优化，通过复制远程工作来最小化重复稀疏向量乘法中的网络延迟，从而延迟关键路径上的通信。为了最大限度地减少分布式模板计算中的通信瓶颈，在本研究中，我们将CA方案与基于数据流任务的运行时系统(如PaRSEC)中固有的计算和通信重叠结合起来，以展示它们的综合优势。我们在PETSc中实现了2D五点模板(Jacobi迭代)，并在PaRSEC上以两种方式实现，即完全通信(基本PaRSEC)和直接在2D计算网格上操作的CA-PaRSEC。我们在两个集群(NaCL和Stampede2)上运行的结果表明，我们可以比在PETSc中实现的标准SpMV解决方案实现2倍的加速，并且在内核执行不主导执行时间的某些情况下，CA-PaRSEC版本分别比在NaCL和Stampede2上实现的base-PaRSEC实现实现达到57%和33%的加速。

{"title":"Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime","authors":"Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra","doi":"10.1109/IPDPSW50202.2020.00127","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00127","url":null,"abstract":"Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

EduPar-20 Keynote Speaker edupar20主题演讲

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00054

Martin Langhammer

The fields of computer and information science and engineering (CISE) are central to nearly all of society’s needs, opportunities, and challenges. The US National Science Foundation (NSF) was created 70 years ago with a broad mission to promote the progress of science and to catalyze societal and economic benefits. NSF, largely through its CISE directorate which has an annual budget of more than $1B, accounts for over 85% of federally-funded, academic, fundamental computer science research in the US. My talk will give an overview of NSF/CISE research, education, and research infrastructure programs, and relate them to the technical and societal trends and topics that will impact their future trajectory. My talk will highlight opportunity areas for education and workforce development across the computing and information sciences, with a particular emphasis on parallelism and advanced computing and information topics.

计算机和信息科学与工程(CISE)领域是几乎所有社会需求、机遇和挑战的核心。美国国家科学基金会(NSF)成立于70年前，肩负着促进科学进步、促进社会和经济效益的广泛使命。NSF主要通过其CISE理事会，其年度预算超过10亿美元，占美国联邦政府资助的学术基础计算机科学研究的85%以上。我的演讲将概述NSF/CISE的研究、教育和研究基础设施项目，并将它们与影响其未来轨迹的技术和社会趋势和主题联系起来。我的演讲将强调在计算和信息科学中教育和劳动力发展的机会领域，特别强调并行性和高级计算和信息主题。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀