Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00013
Ali Mokhtari, Chavit Denninnart, M. Salehi
Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.
{"title":"Autonomous Task Dropping Mechanism to Achieve Robustness in Heterogeneous Computing Systems","authors":"Ali Mokhtari, Chavit Denninnart, M. Salehi","doi":"10.1109/IPDPSW50202.2020.00013","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00013","url":null,"abstract":"Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00075
Jia Guo, G. Agrawal
In recent years, there has been considerable interest in developing frameworks for processing streaming data. Like the precursor commercial systems for data-intensive processing, these systems have largely not used methods popular within the HPC community (for example, MPI for communication). In this paper, we demonstrate a system for stream processing that offers a high-level API to the users (similar to MapReduce), is fault-tolerant, and is also more efficient and scalable than current solutions. Particularly, a cost-efficient MPI/OpenMP based fault-tolerant scheme is incorporated so that the system can survive node failures with only a modest degradation of performance. We evaluate both the functionality and efficiency of Smart Streaming using four common applications in machine learning and data analytics. A comparison against state-of-the-art streaming frameworks shows our system boosts the throughput of test cases by up to 10X and achieve desirable parallelism when scaled out. Additionally, the performance loss upon failures is only proportional to the share of failed resources.
{"title":"Smart Streaming: A High-Throughput Fault-tolerant Online Processing System","authors":"Jia Guo, G. Agrawal","doi":"10.1109/ipdpsw50202.2020.00075","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00075","url":null,"abstract":"In recent years, there has been considerable interest in developing frameworks for processing streaming data. Like the precursor commercial systems for data-intensive processing, these systems have largely not used methods popular within the HPC community (for example, MPI for communication). In this paper, we demonstrate a system for stream processing that offers a high-level API to the users (similar to MapReduce), is fault-tolerant, and is also more efficient and scalable than current solutions. Particularly, a cost-efficient MPI/OpenMP based fault-tolerant scheme is incorporated so that the system can survive node failures with only a modest degradation of performance. We evaluate both the functionality and efficiency of Smart Streaming using four common applications in machine learning and data analytics. A comparison against state-of-the-art streaming frameworks shows our system boosts the throughput of test cases by up to 10X and achieve desirable parallelism when scaled out. Additionally, the performance loss upon failures is only proportional to the share of failed resources.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125316456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00085
M. Diener, L. Kalé
Data movement between host and accelerators is one of the most challenging aspects of developing applications for heterogeneous systems. Most existing runtime systems for GPGPU programming require developers to perform data movement manually in the source code, while having to support different hardware and software environments. In this paper, we present a novel way to perform data movement for distributed applications based on the Charm ++ programming system. We extend Charm ++’s support for migration across memory address spaces to handle accelerator devices by making use of the description of data contained in Charm ++’s parallel objects. This allows the Charm ++ runtime to handle data movement automatically to a large extent, while supporting different hardware platforms transparently. This increases both developer productivity and the portability of Charm ++ applications. We demonstrate our proposal with a Charm ++ application that runs offloaded CUDA code on three different hardware platforms with a single data movement specification.
{"title":"Unified data movement for offloading Charm++ applications","authors":"M. Diener, L. Kalé","doi":"10.1109/IPDPSW50202.2020.00085","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00085","url":null,"abstract":"Data movement between host and accelerators is one of the most challenging aspects of developing applications for heterogeneous systems. Most existing runtime systems for GPGPU programming require developers to perform data movement manually in the source code, while having to support different hardware and software environments. In this paper, we present a novel way to perform data movement for distributed applications based on the Charm ++ programming system. We extend Charm ++’s support for migration across memory address spaces to handle accelerator devices by making use of the description of data contained in Charm ++’s parallel objects. This allows the Charm ++ runtime to handle data movement automatically to a large extent, while supporting different hardware platforms transparently. This increases both developer productivity and the portability of Charm ++ applications. We demonstrate our proposal with a Charm ++ application that runs offloaded CUDA code on three different hardware platforms with a single data movement specification.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116822294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00157
P. Beckman, E. Jeannot, Swann Perarnau
The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.
{"title":"Workshop on Resource Arbitration for Dynamic Runtimes (RADR)","authors":"P. Beckman, E. Jeannot, Swann Perarnau","doi":"10.1109/ipdpsw50202.2020.00157","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00157","url":null,"abstract":"The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00044
Saman P. Amarasinghe
In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. However, optimizing the performance of graph applications is notoriously difficult due to irregular memory access patterns and load imbalance across cores. The performance of graph programs depends highly on the algorithm, the size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or single hardware platform works well across all applications.
{"title":"GrAPL 2020 Keynote Speaker The GraphIt Universal Graph Framework: Achieving HighPerformance across Algorithms, Graph Types, and Architectures","authors":"Saman P. Amarasinghe","doi":"10.1109/IPDPSW50202.2020.00044","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00044","url":null,"abstract":"In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. However, optimizing the performance of graph applications is notoriously difficult due to irregular memory access patterns and load imbalance across cores. The performance of graph programs depends highly on the algorithm, the size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or single hardware platform works well across all applications.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133877968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00060
Suzanne J. Matthews
Integrating parallel and distributed computing (PDC) topics in core computing courses is a topic of increasing interest for educators. However, there is a question of how best to introduce PDC to undergraduates. Several educators have proposed the use of “unplugged activities”, such as role-playing dramatizations and analogies, to introduce PDC concepts. Yet, unplugged activities for PDC are widely-scattered and often difficult to find, making it challenging for educators to create and incorporate unplugged interventions in their classrooms. The PDCunplugged project seeks to rectify these issues by providing a free repository where educators can find and share unplugged activities related to PDC. The existing curation contains nearly forty unique unplugged activities collected from thirty years of the PDC literature and from all over the Internet, and maps each activity to relevant CS2013 PDC knowledge units and TCPP PDC topic areas. Learn more about the project at pdcunplugged.org.
{"title":"PDCunplugged: A Free Repository of Unplugged Parallel Distributed Computing Activities","authors":"Suzanne J. Matthews","doi":"10.1109/IPDPSW50202.2020.00060","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00060","url":null,"abstract":"Integrating parallel and distributed computing (PDC) topics in core computing courses is a topic of increasing interest for educators. However, there is a question of how best to introduce PDC to undergraduates. Several educators have proposed the use of “unplugged activities”, such as role-playing dramatizations and analogies, to introduce PDC concepts. Yet, unplugged activities for PDC are widely-scattered and often difficult to find, making it challenging for educators to create and incorporate unplugged interventions in their classrooms. The PDCunplugged project seeks to rectify these issues by providing a free repository where educators can find and share unplugged activities related to PDC. The existing curation contains nearly forty unique unplugged activities collected from thirty years of the PDC literature and from all over the Internet, and maps each activity to relevant CS2013 PDC knowledge units and TCPP PDC topic areas. Learn more about the project at pdcunplugged.org.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114006888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00140
Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi
Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.
{"title":"Importance of Selecting Data Layouts in the Tsunami Simulation Code","authors":"Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/IPDPSW50202.2020.00140","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00140","url":null,"abstract":"Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122683193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00052
Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest
We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).
{"title":"Kronecker Graph Generation with Ground Truth for 4-Cycles and Dense Structure in Bipartite Graphs","authors":"Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest","doi":"10.1109/IPDPSW50202.2020.00052","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00052","url":null,"abstract":"We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121274152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00127
Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra
Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.
{"title":"Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime","authors":"Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra","doi":"10.1109/IPDPSW50202.2020.00127","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00127","url":null,"abstract":"Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00054
Martin Langhammer
The fields of computer and information science and engineering (CISE) are central to nearly all of society’s needs, opportunities, and challenges. The US National Science Foundation (NSF) was created 70 years ago with a broad mission to promote the progress of science and to catalyze societal and economic benefits. NSF, largely through its CISE directorate which has an annual budget of more than $1B, accounts for over 85% of federally-funded, academic, fundamental computer science research in the US. My talk will give an overview of NSF/CISE research, education, and research infrastructure programs, and relate them to the technical and societal trends and topics that will impact their future trajectory. My talk will highlight opportunity areas for education and workforce development across the computing and information sciences, with a particular emphasis on parallelism and advanced computing and information topics.
{"title":"EduPar-20 Keynote Speaker","authors":"Martin Langhammer","doi":"10.1109/ipdpsw50202.2020.00054","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00054","url":null,"abstract":"The fields of computer and information science and engineering (CISE) are central to nearly all of society’s needs, opportunities, and challenges. The US National Science Foundation (NSF) was created 70 years ago with a broad mission to promote the progress of science and to catalyze societal and economic benefits. NSF, largely through its CISE directorate which has an annual budget of more than $1B, accounts for over 85% of federally-funded, academic, fundamental computer science research in the US. My talk will give an overview of NSF/CISE research, education, and research infrastructure programs, and relate them to the technical and societal trends and topics that will impact their future trajectory. My talk will highlight opportunity areas for education and workforce development across the computing and information sciences, with a particular emphasis on parallelism and advanced computing and information topics.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115413213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}