Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00077
Chen Zeng, Yifan Wang, Fan Liang, Xiaohui Peng
As the advancement of the IoT market and technology, sensor-equipped devices belonging to different interest groups have more opportunities to cooperate, fulfilling an assignment. How to ensure an agreement on correct sensed data in such a low trust environment is a new challenge in nowadays market. To bridge the gap, we propose Pinocchio, a blockchain-based algorithm which can tolerate faults in WSN (Wireless Sensor Network) along with the ability to defend sensed data against malicious attacks of data tampering and masquerade. Compared to other distributed approaches of sensor fault tolerance, Pinocchio greatly reduced the message complexity of the entire network to $O(N)$ and that of a single node to $O(1)$. Considering the possible waste of resources brought about by vicious competition of hash power in the blockchain-based approach, we design the Geppetto algorithm to supervise and control hash power in a distributed manner, and its effectiveness is demonstrated by experiments.
{"title":"Pinocchio: A Blockchain-Based Algorithm for Sensor Fault Tolerance in Low Trust Environment","authors":"Chen Zeng, Yifan Wang, Fan Liang, Xiaohui Peng","doi":"10.1109/IPDPSW50202.2020.00077","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00077","url":null,"abstract":"As the advancement of the IoT market and technology, sensor-equipped devices belonging to different interest groups have more opportunities to cooperate, fulfilling an assignment. How to ensure an agreement on correct sensed data in such a low trust environment is a new challenge in nowadays market. To bridge the gap, we propose Pinocchio, a blockchain-based algorithm which can tolerate faults in WSN (Wireless Sensor Network) along with the ability to defend sensed data against malicious attacks of data tampering and masquerade. Compared to other distributed approaches of sensor fault tolerance, Pinocchio greatly reduced the message complexity of the entire network to $O(N)$ and that of a single node to $O(1)$. Considering the possible waste of resources brought about by vicious competition of hash power in the blockchain-based approach, we design the Geppetto algorithm to supervise and control hash power in a distributed manner, and its effectiveness is demonstrated by experiments.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126405275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00147
Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda
Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.
{"title":"Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR","authors":"Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00147","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00147","url":null,"abstract":"Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125907614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00021
Stephen Tridgell, D. Boland, P. Leong, R. Kastner, Alireza Khodamoradi, Siddhartha
The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations. Very low precision activations and weights can have a significant impact on the accuracy however. This work demonstrates by exploiting throughput matching that higher precision on certain layers can be used to recover this accuracy. This is applied to the domain of automatic modulation classification for radio signals leveraging the RF capabilities offered by the Xilinx ZCU111 RFSoC platform. The implemented networks achieve high-speed real-time performance with a classification latency of $approx8mu$s, and an operational throughput of 488k classifications per second. On the open-source RadioML dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage with our technique.
深度学习的计算复杂性导致研究人员努力减少所需的计算量。低精度的使用在fpga上特别有效,因为它们不局限于字节可寻址操作。然而,非常低的精度激活和权重会对精度产生重大影响。这项工作通过利用吞吐量匹配证明,在某些层上可以使用更高的精度来恢复这种精度。这适用于利用Xilinx ZCU111 RFSoC平台提供的RF功能的无线电信号自动调制分类领域。实现的网络实现了高速实时性能,分类延迟为$approx8mu$ s,操作吞吐量为每秒488k个分类。在开源的RadioML数据集上,我们演示了如何恢复4.3% in accuracy with the same hardware usage with our technique.
{"title":"Real-time Automatic Modulation Classification using RFSoC","authors":"Stephen Tridgell, D. Boland, P. Leong, R. Kastner, Alireza Khodamoradi, Siddhartha","doi":"10.1109/IPDPSW50202.2020.00021","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00021","url":null,"abstract":"The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations. Very low precision activations and weights can have a significant impact on the accuracy however. This work demonstrates by exploiting throughput matching that higher precision on certain layers can be used to recover this accuracy. This is applied to the domain of automatic modulation classification for radio signals leveraging the RF capabilities offered by the Xilinx ZCU111 RFSoC platform. The implemented networks achieve high-speed real-time performance with a classification latency of $approx8mu$s, and an operational throughput of 488k classifications per second. On the open-source RadioML dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage with our technique.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122262462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00083
N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, Tomohiro Ueno, K. Sano, T. Boku
In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.
{"title":"Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA","authors":"N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, Tomohiro Ueno, K. Sano, T. Boku","doi":"10.1109/IPDPSW50202.2020.00083","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00083","url":null,"abstract":"In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125726824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00013
Ali Mokhtari, Chavit Denninnart, M. Salehi
Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.
{"title":"Autonomous Task Dropping Mechanism to Achieve Robustness in Heterogeneous Computing Systems","authors":"Ali Mokhtari, Chavit Denninnart, M. Salehi","doi":"10.1109/IPDPSW50202.2020.00013","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00013","url":null,"abstract":"Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00060
Suzanne J. Matthews
Integrating parallel and distributed computing (PDC) topics in core computing courses is a topic of increasing interest for educators. However, there is a question of how best to introduce PDC to undergraduates. Several educators have proposed the use of “unplugged activities”, such as role-playing dramatizations and analogies, to introduce PDC concepts. Yet, unplugged activities for PDC are widely-scattered and often difficult to find, making it challenging for educators to create and incorporate unplugged interventions in their classrooms. The PDCunplugged project seeks to rectify these issues by providing a free repository where educators can find and share unplugged activities related to PDC. The existing curation contains nearly forty unique unplugged activities collected from thirty years of the PDC literature and from all over the Internet, and maps each activity to relevant CS2013 PDC knowledge units and TCPP PDC topic areas. Learn more about the project at pdcunplugged.org.
{"title":"PDCunplugged: A Free Repository of Unplugged Parallel Distributed Computing Activities","authors":"Suzanne J. Matthews","doi":"10.1109/IPDPSW50202.2020.00060","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00060","url":null,"abstract":"Integrating parallel and distributed computing (PDC) topics in core computing courses is a topic of increasing interest for educators. However, there is a question of how best to introduce PDC to undergraduates. Several educators have proposed the use of “unplugged activities”, such as role-playing dramatizations and analogies, to introduce PDC concepts. Yet, unplugged activities for PDC are widely-scattered and often difficult to find, making it challenging for educators to create and incorporate unplugged interventions in their classrooms. The PDCunplugged project seeks to rectify these issues by providing a free repository where educators can find and share unplugged activities related to PDC. The existing curation contains nearly forty unique unplugged activities collected from thirty years of the PDC literature and from all over the Internet, and maps each activity to relevant CS2013 PDC knowledge units and TCPP PDC topic areas. Learn more about the project at pdcunplugged.org.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114006888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00140
Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi
Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.
{"title":"Importance of Selecting Data Layouts in the Tsunami Simulation Code","authors":"Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/IPDPSW50202.2020.00140","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00140","url":null,"abstract":"Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122683193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00052
Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest
We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).
{"title":"Kronecker Graph Generation with Ground Truth for 4-Cycles and Dense Structure in Bipartite Graphs","authors":"Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest","doi":"10.1109/IPDPSW50202.2020.00052","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00052","url":null,"abstract":"We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121274152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00127
Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra
Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.
{"title":"Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime","authors":"Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra","doi":"10.1109/IPDPSW50202.2020.00127","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00127","url":null,"abstract":"Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00054
Martin Langhammer
The fields of computer and information science and engineering (CISE) are central to nearly all of society’s needs, opportunities, and challenges. The US National Science Foundation (NSF) was created 70 years ago with a broad mission to promote the progress of science and to catalyze societal and economic benefits. NSF, largely through its CISE directorate which has an annual budget of more than $1B, accounts for over 85% of federally-funded, academic, fundamental computer science research in the US. My talk will give an overview of NSF/CISE research, education, and research infrastructure programs, and relate them to the technical and societal trends and topics that will impact their future trajectory. My talk will highlight opportunity areas for education and workforce development across the computing and information sciences, with a particular emphasis on parallelism and advanced computing and information topics.
{"title":"EduPar-20 Keynote Speaker","authors":"Martin Langhammer","doi":"10.1109/ipdpsw50202.2020.00054","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00054","url":null,"abstract":"The fields of computer and information science and engineering (CISE) are central to nearly all of society’s needs, opportunities, and challenges. The US National Science Foundation (NSF) was created 70 years ago with a broad mission to promote the progress of science and to catalyze societal and economic benefits. NSF, largely through its CISE directorate which has an annual budget of more than $1B, accounts for over 85% of federally-funded, academic, fundamental computer science research in the US. My talk will give an overview of NSF/CISE research, education, and research infrastructure programs, and relate them to the technical and societal trends and topics that will impact their future trajectory. My talk will highlight opportunity areas for education and workforce development across the computing and information sciences, with a particular emphasis on parallelism and advanced computing and information topics.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115413213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}