Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00182
Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, J. Langguth
We design a system for efficient in-memory analysis of data from the GDELT database of news events. The specialization of the system allows us to avoid the inefficiencies of existing alternatives, and make full use of modern parallel high-performance computing hardware. We then present a series of experiments showcasing the system’s ability to analyze correlations in the entire GDELT 2.0 database containing more than a billion news items. The results reveal large scale trends in the world of today’s online news.
{"title":"A System for High Performance Mining on GDELT Data","authors":"Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, J. Langguth","doi":"10.1109/IPDPSW50202.2020.00182","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00182","url":null,"abstract":"We design a system for efficient in-memory analysis of data from the GDELT database of news events. The specialization of the system allows us to avoid the inefficiencies of existing alternatives, and make full use of modern parallel high-performance computing hardware. We then present a series of experiments showcasing the system’s ability to analyze correlations in the entire GDELT 2.0 database containing more than a billion news items. The results reveal large scale trends in the world of today’s online news.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124214643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00045
Márton Elekes, Gábor Szárnyas
Graphs are increasingly important for modelling and analysing connected data sets. Traditionally, graph analytical tools targeted global fixed-point computations, while graph databases focused on simpler transactional read operations such as retrieving the neighbours of a node. However, recent applications of graph processing (such as financial fraud detection and serving personalized recommendations) often necessitate a mix of the two workload profiles. A potential approach to tackle these complex workloads is to formulate graph algorithms in the language of linear algebra. To this end, the recent GraphBLAS standard defines a linear algebraic graph computational model and an API for implementing such algorithms. To investigate its usability and efficiency, we have implemented a GraphBLAS solution for the “Social Media” case study of the 2018 Transformation Tool Contest. This paper presents our solution along with an incrementalized variant to improve its runtime for repeated evaluations. Preliminary results show that the GraphBLAS-based solution is competitive but implementing it requires significant development efforts.
{"title":"An incremental GraphBLAS solution for the 2018 TTC Social Media case study","authors":"Márton Elekes, Gábor Szárnyas","doi":"10.1109/IPDPSW50202.2020.00045","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00045","url":null,"abstract":"Graphs are increasingly important for modelling and analysing connected data sets. Traditionally, graph analytical tools targeted global fixed-point computations, while graph databases focused on simpler transactional read operations such as retrieving the neighbours of a node. However, recent applications of graph processing (such as financial fraud detection and serving personalized recommendations) often necessitate a mix of the two workload profiles. A potential approach to tackle these complex workloads is to formulate graph algorithms in the language of linear algebra. To this end, the recent GraphBLAS standard defines a linear algebraic graph computational model and an API for implementing such algorithms. To investigate its usability and efficiency, we have implemented a GraphBLAS solution for the “Social Media” case study of the 2018 Transformation Tool Contest. This paper presents our solution along with an incrementalized variant to improve its runtime for repeated evaluations. Preliminary results show that the GraphBLAS-based solution is competitive but implementing it requires significant development efforts.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124438043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00037
Brandon Gildemaster, P. Ghalsasi, S. Rajopadhye
RNA-RNA interaction (RRI) is important in processes such as gene regulation, and certain classes of RRI are known to play roles in various diseases including cancer and Alzheimer’s. Other classes are not as well studied but could have biological importance, thus there is a need for highthroughput tools which enable the study of these molecules. Current computational tools for RRI are slow: execution times in days, weeks or even months for large experiments, because the algorithms have time and space complexity, respectively $mathrm {O}( N^{3}M^{3})$ and $mathrm {O}( N^{2}M^{2})$, for two sequences length $N$ and $M$. No GPU parallelization of such algorithms exists. We show how the most computationally expensive portion of RRI base pair maximization algorithms, an $mathrm {O}( NM ) ^{3}$ computation, can be expressed as $mathrm {O}( N^{3})$ instances of such matrix products. We therefore propose an optimized library for the core computation of BPMax, an RRI algorithm based on weighted base pair counting. Our library multiplies multiple pairs of matrices in the max-plus semiring. We explore multiple tradeoffs: a square matrix product library attains close to the machine peak, but does 6-fold unnecessary computations and has $mathrm {a}2 times $ higher data footprint, while the one with the minimum work and memory footprint has thread divergence and unbalanced load. We also specialize for upper banded (trapezoidal shaped) matrices, which are relevant to a windowed version of the algorithm. STOP PRESS: just before we submitted the camera-ready version of the paper, we incorporated our library into a GPU implementation of the complete BPMax algorithm. We will report performance numbers at the workshop.
{"title":"A Tropical Semiring Multiple Matrix-Product Library on GPUs: (not just) a step towards RNA-RNA Interaction Computations","authors":"Brandon Gildemaster, P. Ghalsasi, S. Rajopadhye","doi":"10.1109/IPDPSW50202.2020.00037","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00037","url":null,"abstract":"RNA-RNA interaction (RRI) is important in processes such as gene regulation, and certain classes of RRI are known to play roles in various diseases including cancer and Alzheimer’s. Other classes are not as well studied but could have biological importance, thus there is a need for highthroughput tools which enable the study of these molecules. Current computational tools for RRI are slow: execution times in days, weeks or even months for large experiments, because the algorithms have time and space complexity, respectively $mathrm {O}( N^{3}M^{3})$ and $mathrm {O}( N^{2}M^{2})$, for two sequences length $N$ and $M$. No GPU parallelization of such algorithms exists. We show how the most computationally expensive portion of RRI base pair maximization algorithms, an $mathrm {O}( NM ) ^{3}$ computation, can be expressed as $mathrm {O}( N^{3})$ instances of such matrix products. We therefore propose an optimized library for the core computation of BPMax, an RRI algorithm based on weighted base pair counting. Our library multiplies multiple pairs of matrices in the max-plus semiring. We explore multiple tradeoffs: a square matrix product library attains close to the machine peak, but does 6-fold unnecessary computations and has $mathrm {a}2 times $ higher data footprint, while the one with the minimum work and memory footprint has thread divergence and unbalanced load. We also specialize for upper banded (trapezoidal shaped) matrices, which are relevant to a windowed version of the algorithm. STOP PRESS: just before we submitted the camera-ready version of the paper, we incorporated our library into a GPU implementation of the complete BPMax algorithm. We will report performance numbers at the workshop.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117061899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00112
J. Firoz, Louis Jenkins, C. Joslyn, Brenda Praggastis, Emilie Purvine, Mark Raugas
In this paper, we discuss our experience in implementing homology computation, in particular the Betti number calculations in Chapel hypergraph Library (CHGL). Given a dataset represented as a hypergraph, a Betti number for a particular dimension k indicates how many k-dimensional ‘voids’ are present in the dataset. Computing the Betti numbers involve various array-centric and linear algebra operations. We demonstrate that implementing these operations in Chapel is both concise and intuitive. In addition, we show that Chapel provides language constructs for implementing parallel and distributed execution of the linear algebra kernels with minimal effort. Syntactically, Chapel provides succinctness of Python, while delivering comparable and better performance than C++-based and Julia-based packages for calculating the Betti numbers respectively.
{"title":"Computing Hypergraph Homology in Chapel","authors":"J. Firoz, Louis Jenkins, C. Joslyn, Brenda Praggastis, Emilie Purvine, Mark Raugas","doi":"10.1109/IPDPSW50202.2020.00112","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00112","url":null,"abstract":"In this paper, we discuss our experience in implementing homology computation, in particular the Betti number calculations in Chapel hypergraph Library (CHGL). Given a dataset represented as a hypergraph, a Betti number for a particular dimension k indicates how many k-dimensional ‘voids’ are present in the dataset. Computing the Betti numbers involve various array-centric and linear algebra operations. We demonstrate that implementing these operations in Chapel is both concise and intuitive. In addition, we show that Chapel provides language constructs for implementing parallel and distributed execution of the linear algebra kernels with minimal effort. Syntactically, Chapel provides succinctness of Python, while delivering comparable and better performance than C++-based and Julia-based packages for calculating the Betti numbers respectively.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116432404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00057
David W. Brown, Vitaly Ford, S. Ghafoor
This paper proposes a classification scheme for categorization of PDC educational resources. We have also proposed an evaluation framework for assessing the PDC resources. Under the proposed framework, each resource type has a set of criteria and an associated score. A PDC resource will obtain a score if evaluated under our proposed framework that is the sum of the scores of the criteria that the resource satisfies. The evaluation of whether a resource met a criterion is subjective. We have also presented our evaluation of PDC educational resources appropriate for CS1, CS2 (Computer Science 1 and 2), and DS/A (Data Structures and Algorithms) available on the web using our proposed framework.
{"title":"A Framework for the Evaluation of Parallel and Distributed Computing Educational Resources","authors":"David W. Brown, Vitaly Ford, S. Ghafoor","doi":"10.1109/IPDPSW50202.2020.00057","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00057","url":null,"abstract":"This paper proposes a classification scheme for categorization of PDC educational resources. We have also proposed an evaluation framework for assessing the PDC resources. Under the proposed framework, each resource type has a set of criteria and an associated score. A PDC resource will obtain a score if evaluated under our proposed framework that is the sum of the scores of the criteria that the resource satisfies. The evaluation of whether a resource met a criterion is subjective. We have also presented our evaluation of PDC educational resources appropriate for CS1, CS2 (Computer Science 1 and 2), and DS/A (Data Structures and Algorithms) available on the web using our proposed framework.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125015253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00095
J. Bordim, K. Nakano, Susumu Matsumae, M. Shibata
The past thirty years have seen a flurry of activity in the area of parallel and distributed computing. In recent years, novel parallel and distributed computational models have been proposed in the literature, reflecting advances in new computational devices and environments such as optical interconnects, programmable logic arrays, networks of workstations, radio communications, mobile computing, DNA computing, quantum computing, sensor networks etc. It is very encouraging to note that the advent of these new models has led to significant advances in the resolution of various difficult problems of practical interest.
{"title":"Workshop 10: APDCM Advances in Parallel and Distributed Computational Models","authors":"J. Bordim, K. Nakano, Susumu Matsumae, M. Shibata","doi":"10.1109/ipdpsw50202.2020.00095","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00095","url":null,"abstract":"The past thirty years have seen a flurry of activity in the area of parallel and distributed computing. In recent years, novel parallel and distributed computational models have been proposed in the literature, reflecting advances in new computational devices and environments such as optical interconnects, programmable logic arrays, networks of workstations, radio communications, mobile computing, DNA computing, quantum computing, sensor networks etc. It is very encouraging to note that the advent of these new models has led to significant advances in the resolution of various difficult problems of practical interest.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"47 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126909402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00089
Grégoire Danoy, D. E. Baz, V. Boyer, B. Dorronsoro, L. Yang, Keqin Li
The IEEE Workshop on Parallel / Distributed Combinatorics and Optimization aims at providing a forum for scientific researchers and engineers on recent advances in the field of parallel or distributed computing for difficult combinatorial optimization problems, like 0–1 multidimensional knapsack problems, cutting stock problems, scheduling problems, large scale linear programming problems, nonlinear optimization problems and global optimization problems. Emphasis is placed on new techniques for the solution of these difficult problems like cooperative methods for integer programming problems. Techniques based on metaheuristics and nature-inspired paradigms are considered. Aspects related to Combinatorial Scientific Computing (CSC) are considered. In particular, we solicit submissions of original manuscripts on sparse matrix computations, graph algorithm and original parallel or distributed algorithms. The use of new approaches in parallel and distributed computing like GPU, MIC, FPGA, volunteer computing are considered. Application to cloud computing, planning, logistics, manufacturing, finance, telecommunications and computational biology are considered.
{"title":"Workshop 9: PDCO Parallel / Distributed Combinatorics and Optimization","authors":"Grégoire Danoy, D. E. Baz, V. Boyer, B. Dorronsoro, L. Yang, Keqin Li","doi":"10.1109/ipdpsw50202.2020.00089","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00089","url":null,"abstract":"The IEEE Workshop on Parallel / Distributed Combinatorics and Optimization aims at providing a forum for scientific researchers and engineers on recent advances in the field of parallel or distributed computing for difficult combinatorial optimization problems, like 0–1 multidimensional knapsack problems, cutting stock problems, scheduling problems, large scale linear programming problems, nonlinear optimization problems and global optimization problems. Emphasis is placed on new techniques for the solution of these difficult problems like cooperative methods for integer programming problems. Techniques based on metaheuristics and nature-inspired paradigms are considered. Aspects related to Combinatorial Scientific Computing (CSC) are considered. In particular, we solicit submissions of original manuscripts on sparse matrix computations, graph algorithm and original parallel or distributed algorithms. The use of new approaches in parallel and distributed computing like GPU, MIC, FPGA, volunteer computing are considered. Application to cloud computing, planning, logistics, manufacturing, finance, telecommunications and computational biology are considered.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130516453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00144
R. Schmid, Bjarne Pfitzner, Jossekin Beilharz, B. Arnrich, A. Polze
Federated learning has the potential to make machine learning applicable to highly privacy-sensitive domains and distributed datasets. In some scenarios, however, a central server for aggregating the partial learning results is not available. In fully decentralized learning, a network of peer-to-peer nodes collaborates to form a consensus on a global model without a trusted aggregating party. Often, the network consists of Internet of Things (IoT) and Edge computing nodes.Previous approaches for decentralized learning map the gradient batching and averaging algorithm from traditional federated learning to blockchain architectures. In an open network of participating nodes, the threat of adversarial nodes introducing poisoned models into the network increases compared to a federated learning scenario which is controlled by a single authority. Hence, the decentralized architecture must additionally include a machine learning-aware fault tolerance mechanism to address the increased attack surface.We propose a tangle architecture for decentralized learning, where the validity of model updates is checked as part of the basic consensus. We provide an experimental evaluation of the proposed architecture, showing that it performs well in both model convergence and model poisoning protection.
{"title":"Tangle Ledger for Decentralized Learning","authors":"R. Schmid, Bjarne Pfitzner, Jossekin Beilharz, B. Arnrich, A. Polze","doi":"10.1109/IPDPSW50202.2020.00144","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00144","url":null,"abstract":"Federated learning has the potential to make machine learning applicable to highly privacy-sensitive domains and distributed datasets. In some scenarios, however, a central server for aggregating the partial learning results is not available. In fully decentralized learning, a network of peer-to-peer nodes collaborates to form a consensus on a global model without a trusted aggregating party. Often, the network consists of Internet of Things (IoT) and Edge computing nodes.Previous approaches for decentralized learning map the gradient batching and averaging algorithm from traditional federated learning to blockchain architectures. In an open network of participating nodes, the threat of adversarial nodes introducing poisoned models into the network increases compared to a federated learning scenario which is controlled by a single authority. Hence, the decentralized architecture must additionally include a machine learning-aware fault tolerance mechanism to address the increased attack surface.We propose a tangle architecture for decentralized learning, where the validity of model updates is checked as part of the basic consensus. We provide an experimental evaluation of the proposed architecture, showing that it performs well in both model convergence and model poisoning protection.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121310173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00148
Scott Levy, Patrick M. Widener, C. Ulmer, T. Kordenbrock
Remote Direct Memory Access (RDMA) is an increasingly important technology in high-performance computing (HPC). RDMA provides low-latency, high-bandwidth data transfer between compute nodes. Additionally, it does not require explicit synchronization with the destination processor. Eliminating unnecessary synchronization can significantly improve the communication performance of large-scale scientific codes. A long-standing challenge presented by RDMA communication is mitigating the cost of registering memory with the network interface controller (NIC). Reusing memory once it is registered has been shown to significantly reduce the cost of RDMA communication. However, existing approaches for reusing memory rely on implicit memory semantics. In this paper, we introduce an approach that makes memory reuse semantics explicit by exposing a separate allocator for registered memory. The data and analysis in this paper yield the following contributions: (i) managing registered memory explicitly enables efficient reuse of registered memory; (ii) registering large memory regions to amortize the registration cost over multiple user requests can significantly reduce cost of acquiring new registered memory; and (iii) reducing the cost of acquiring registered memory can significantly improve the performance of RDMA communication. Reusing registered memory is key to high-performance RDMA communication. By making reuse semantics explicit, our approach has the potential to improve RDMA performance by making it significantly easier for programmers to efficiently reuse registered memory.
RDMA (Remote Direct Memory Access)是高性能计算(HPC)领域一项日益重要的技术。RDMA在计算节点之间提供低延迟、高带宽的数据传输。此外,它不需要与目标处理器显式同步。消除不必要的同步可以显著提高大规模科学码的通信性能。RDMA通信带来的一个长期挑战是降低向网络接口控制器(NIC)注册内存的成本。一旦内存被注册,重用内存可以显著降低RDMA通信的成本。然而,现有的内存重用方法依赖于隐式内存语义。在本文中,我们介绍了一种方法,通过为注册内存公开一个单独的分配器,使内存重用语义显式。本文的数据和分析产生了以下贡献:(i)显式管理注册内存可以有效地重用注册内存;(ii)注册大内存区域以分摊多个用户请求的注册成本可以显著降低获取新注册内存的成本;以及(iii)降低获取注册存储器的成本可以显著提高RDMA通信的性能。重用注册内存是实现高性能RDMA通信的关键。通过使重用语义显式,我们的方法有可能通过使程序员更容易有效地重用注册内存来提高RDMA性能。
{"title":"The Case for Explicit Reuse Semantics for RDMA Communication","authors":"Scott Levy, Patrick M. Widener, C. Ulmer, T. Kordenbrock","doi":"10.1109/IPDPSW50202.2020.00148","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00148","url":null,"abstract":"Remote Direct Memory Access (RDMA) is an increasingly important technology in high-performance computing (HPC). RDMA provides low-latency, high-bandwidth data transfer between compute nodes. Additionally, it does not require explicit synchronization with the destination processor. Eliminating unnecessary synchronization can significantly improve the communication performance of large-scale scientific codes. A long-standing challenge presented by RDMA communication is mitigating the cost of registering memory with the network interface controller (NIC). Reusing memory once it is registered has been shown to significantly reduce the cost of RDMA communication. However, existing approaches for reusing memory rely on implicit memory semantics. In this paper, we introduce an approach that makes memory reuse semantics explicit by exposing a separate allocator for registered memory. The data and analysis in this paper yield the following contributions: (i) managing registered memory explicitly enables efficient reuse of registered memory; (ii) registering large memory regions to amortize the registration cost over multiple user requests can significantly reduce cost of acquiring new registered memory; and (iii) reducing the cost of acquiring registered memory can significantly improve the performance of RDMA communication. Reusing registered memory is key to high-performance RDMA communication. By making reuse semantics explicit, our approach has the potential to improve RDMA performance by making it significantly easier for programmers to efficiently reuse registered memory.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116656177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}