Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00109
Timothy A. K. Zakian, L. Capelli, Zhenjiang Hu
As the graphs in our world become ever larger, the need for programmable, easy to use, and highly scalable graph processing has become ever greater. One such popular graph processing model—the vertex-centric computational model—does precisely this by distributing computations across the vertices of the graph being computed over. Due to this distribution of the program to the vertices of the graph, the programmer "thinks like a vertex" when writing their graph computation, with limited to no sense of shared memory and where almost all communication between each on-vertex computation must be sent over the network. Because of this inherent communication overhead in the computational model, reducing the number of messages sent while performing a given computation is a central aspect of any efforts to optimize vertex-centric programs. While previous work has focused on reducing communication overhead by directly changing communication patterns—by altering the way the graph is partitioned and distributed, or by altering the graph topology itself—in this paper we present a different optimization strategy based on a family of complementary compile-time program transformations in order to minimize communication overhead by changing both the messaging and computational structures of programs. Particularly, we present and formalize a method by which a compiler can automatically incrementalize a vertex-centric program through a series of compile-time program transformations—by modifying the on-vertex computation and messaging between vertices so that messages between vertices represent patches to be applied to the other vertex's local state. We empirically evaluate these transformations on a set of common vertex-centric algorithms and graphs and achieve an average reduction of 2.7X in total computational time, and 2.9X in the number of messages sent across all programs in the benchmark suite. Furthermore, since these are compile-time program transformations alone, other prior optimization strategies for vertex-centric programs can work with the resulting vertex-centric program just as they would a non-incrementalized program.
{"title":"Incrementalization of Vertex-Centric Programs","authors":"Timothy A. K. Zakian, L. Capelli, Zhenjiang Hu","doi":"10.1109/IPDPS.2019.00109","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00109","url":null,"abstract":"As the graphs in our world become ever larger, the need for programmable, easy to use, and highly scalable graph processing has become ever greater. One such popular graph processing model—the vertex-centric computational model—does precisely this by distributing computations across the vertices of the graph being computed over. Due to this distribution of the program to the vertices of the graph, the programmer \"thinks like a vertex\" when writing their graph computation, with limited to no sense of shared memory and where almost all communication between each on-vertex computation must be sent over the network. Because of this inherent communication overhead in the computational model, reducing the number of messages sent while performing a given computation is a central aspect of any efforts to optimize vertex-centric programs. While previous work has focused on reducing communication overhead by directly changing communication patterns—by altering the way the graph is partitioned and distributed, or by altering the graph topology itself—in this paper we present a different optimization strategy based on a family of complementary compile-time program transformations in order to minimize communication overhead by changing both the messaging and computational structures of programs. Particularly, we present and formalize a method by which a compiler can automatically incrementalize a vertex-centric program through a series of compile-time program transformations—by modifying the on-vertex computation and messaging between vertices so that messages between vertices represent patches to be applied to the other vertex's local state. We empirically evaluate these transformations on a set of common vertex-centric algorithms and graphs and achieve an average reduction of 2.7X in total computational time, and 2.9X in the number of messages sent across all programs in the benchmark suite. Furthermore, since these are compile-time program transformations alone, other prior optimization strategies for vertex-centric programs can work with the resulting vertex-centric program just as they would a non-incrementalized program.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00087
S. M. Ghazimirsaeed, S. Mirsadeghi, A. Afsahi
Neighborhood collectives are introduced in MPI-3.0 standard to provide users with the opportunity to define their own communication patterns through the process topology interface of MPI. In this paper, we propose a collaborative communication mechanism based on common neighborhoods that might exist among groups of k processes. Such common neighborhoods are used to decrease the number of communication stages through message combining. We show how designing our desired communication pattern can be modeled as a maximum weighted matching problem in distributed hypergraphs, and propose a distributed algorithm to solve it. Moreover, we consider two design alternatives: topology-agnostic and topology-aware. The former ignores the physical topology of the system and the mapping of processes, whereas the latter takes them into account to further optimize the communication pattern. Our experimental results show that we can gain up to 8x and 5.2x improvement for various process topologies and a SpMM kernel, respectively.
{"title":"An Efficient Collaborative Communication Mechanism for MPI Neighborhood Collectives","authors":"S. M. Ghazimirsaeed, S. Mirsadeghi, A. Afsahi","doi":"10.1109/IPDPS.2019.00087","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00087","url":null,"abstract":"Neighborhood collectives are introduced in MPI-3.0 standard to provide users with the opportunity to define their own communication patterns through the process topology interface of MPI. In this paper, we propose a collaborative communication mechanism based on common neighborhoods that might exist among groups of k processes. Such common neighborhoods are used to decrease the number of communication stages through message combining. We show how designing our desired communication pattern can be modeled as a maximum weighted matching problem in distributed hypergraphs, and propose a distributed algorithm to solve it. Moreover, we consider two design alternatives: topology-agnostic and topology-aware. The former ignores the physical topology of the system and the mapping of processes, whereas the latter takes them into account to further optimize the communication pattern. Our experimental results show that we can gain up to 8x and 5.2x improvement for various process topologies and a SpMM kernel, respectively.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125653345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00074
Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, M. Guo
Predicting performance degradation of a GPU application when it is co-located with other applications on a spatial multitasking GPU without prior application knowledge is essential in public Clouds. Prior work mainly targets CPU co-location, and is inaccurate and/or inefficient for predicting performance of applications at co-location on spatial multitasking GPUs. Our investigation shows that hardware event statistics caused by co-located applications, which can be collected with negligible overhead, strongly correlate with their slowdowns. Based on this observation, we present Themis, an online slowdown predictor that can precisely and efficiently predict application slowdown without prior application knowledge. We first train a precise slowdown model offline using hardware event statistics collected from representative co-locations. When new applications co-run, Themis collects event statistics and predicts their slowdowns simultaneously. Our evaluation shows that Themis has negligible runtime overhead and can precisely predict application-level slowdown with prediction error smaller than 9.5%. Based on Themis, we also implement an SM allocation engine to rein in application slowdown at co-location. Case studies show that the engine successfully enforces fair sharing and QoS.
{"title":"Themis: Predicting and Reining in Application-Level Slowdown on Spatial Multitasking GPUs","authors":"Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, M. Guo","doi":"10.1109/IPDPS.2019.00074","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00074","url":null,"abstract":"Predicting performance degradation of a GPU application when it is co-located with other applications on a spatial multitasking GPU without prior application knowledge is essential in public Clouds. Prior work mainly targets CPU co-location, and is inaccurate and/or inefficient for predicting performance of applications at co-location on spatial multitasking GPUs. Our investigation shows that hardware event statistics caused by co-located applications, which can be collected with negligible overhead, strongly correlate with their slowdowns. Based on this observation, we present Themis, an online slowdown predictor that can precisely and efficiently predict application slowdown without prior application knowledge. We first train a precise slowdown model offline using hardware event statistics collected from representative co-locations. When new applications co-run, Themis collects event statistics and predicts their slowdowns simultaneously. Our evaluation shows that Themis has negligible runtime overhead and can precisely predict application-level slowdown with prediction error smaller than 9.5%. Based on Themis, we also implement an SM allocation engine to rein in application slowdown at co-location. Case studies show that the engine successfully enforces fair sharing and QoS.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132213779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00082
Philip Dexter, K. Chiu, Bedri Sendir
Consistency models for distributed data stores offer insights and paths to reasoning about what a user of such a system can expect. However, often consistency models are defined or implemented in coarse-grained manners, making it difficult to achieve precisely the consistency required. Further, many domains are already written to handle anomalies in distributed systems, yet they have little opportunity for expressing or taking advantage of their leniency. We propose reflective consistency-an active solution which adapts an underlying data store to changing loads and resource availability to meet a given consistency level. We implement reflective consistency in Cassandra, an existing distributed data store supporting per-read and per-write consistency. Our implementation allows users to express their anomaly leniency directly and the system will react to the presence of anomalies, changing Cassandra's consistency only when needed. Users of Reflective Cassandra can expect minimal overhead (anywhere from 1% to 14% depending on configuration) and a 50% decrease in the amount of costly strong reads.
{"title":"An Error-Reflective Consistency Model for Distributed Data Stores","authors":"Philip Dexter, K. Chiu, Bedri Sendir","doi":"10.1109/IPDPS.2019.00082","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00082","url":null,"abstract":"Consistency models for distributed data stores offer insights and paths to reasoning about what a user of such a system can expect. However, often consistency models are defined or implemented in coarse-grained manners, making it difficult to achieve precisely the consistency required. Further, many domains are already written to handle anomalies in distributed systems, yet they have little opportunity for expressing or taking advantage of their leniency. We propose reflective consistency-an active solution which adapts an underlying data store to changing loads and resource availability to meet a given consistency level. We implement reflective consistency in Cassandra, an existing distributed data store supporting per-read and per-write consistency. Our implementation allows users to express their anomaly leniency directly and the system will react to the presence of anomalies, changing Cassandra's consistency only when needed. Users of Reflective Cassandra can expect minimal overhead (anywhere from 1% to 14% depending on configuration) and a 50% decrease in the amount of costly strong reads.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124907563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00102
Matthias Hauck, M. Paradies, H. Fröning
An important concept for indivisible updates in parallel computing are atomic operations. For most architectures, they also provide ordering guarantees, which in practice can hurt performance. For associative and commutative updates, in this paper we present software buffering techniques that overcome the problem of ordering by combining multiple updates in a temporary buffer and by prefetching addresses before updating them. As a result, our buffering techniques reduce contention and avoid unnecessary ordering constraints, in order to increase the amount of memory parallelism. We evaluate our techniques in different scenarios, including applications like histogram and graph computations, and reason about the applicability for standard systems and multi-socket systems.
{"title":"Software-Based Buffering of Associative Operations on Random Memory Addresses","authors":"Matthias Hauck, M. Paradies, H. Fröning","doi":"10.1109/IPDPS.2019.00102","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00102","url":null,"abstract":"An important concept for indivisible updates in parallel computing are atomic operations. For most architectures, they also provide ordering guarantees, which in practice can hurt performance. For associative and commutative updates, in this paper we present software buffering techniques that overcome the problem of ordering by combining multiple updates in a temporary buffer and by prefetching addresses before updating them. As a result, our buffering techniques reduce contention and avoid unnecessary ordering constraints, in order to increase the amount of memory parallelism. We evaluate our techniques in different scenarios, including applications like histogram and graph computations, and reason about the applicability for standard systems and multi-socket systems.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125284765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00076
Kyung Hoon Kim, Priyank Devpura, Abhishek Nayyar, Andrew Doolittle, K. H. Yum, Eun Jung Kim
Graphics Processing Units (GPUs) have been widely accepted for diverse general purpose applications due to a massive degree of parallelism. The demand for large-scale GPUs processing a large volume of data with high throughput has been rising rapidly. However, in large-scale GPUs, a bandwidth-efficient network design is challenging. Compression techniques are a practical remedy to effectively increase network bandwidth by reducing data size transferred. We propose a new simple compression mechanism, Dual Pattern Compression (DPC), that compresses only two patterns with a very low latency. The simplicity of compression/decompression is achieved through data remapping and data-type-aware data preprocessing which exploits bit-level data redundancy. The data type is detected during runtime. We demonstrate that our compression scheme effectively mitigates the network congestion in a large-scale GPU. It achieves IPC improvement by 33% on average (up to 126%) across various benchmarks with average space savings ratios of 61% in integer, 46% (up to 72%) in floating-point and 23% (up to 57%) in character type benchmarks.
{"title":"Dual Pattern Compression Using Data-Preprocessing for Large-Scale GPU Architectures","authors":"Kyung Hoon Kim, Priyank Devpura, Abhishek Nayyar, Andrew Doolittle, K. H. Yum, Eun Jung Kim","doi":"10.1109/IPDPS.2019.00076","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00076","url":null,"abstract":"Graphics Processing Units (GPUs) have been widely accepted for diverse general purpose applications due to a massive degree of parallelism. The demand for large-scale GPUs processing a large volume of data with high throughput has been rising rapidly. However, in large-scale GPUs, a bandwidth-efficient network design is challenging. Compression techniques are a practical remedy to effectively increase network bandwidth by reducing data size transferred. We propose a new simple compression mechanism, Dual Pattern Compression (DPC), that compresses only two patterns with a very low latency. The simplicity of compression/decompression is achieved through data remapping and data-type-aware data preprocessing which exploits bit-level data redundancy. The data type is detected during runtime. We demonstrate that our compression scheme effectively mitigates the network congestion in a large-scale GPU. It achieves IPC improvement by 33% on average (up to 126%) across various benchmarks with average space savings ratios of 61% in integer, 46% (up to 72%) in floating-point and 23% (up to 57%) in character type benchmarks.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121464208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00028
Bruno R. C. Magalhães, T. Sterling, F. Schürmann, M. Hines
Exposing parallelism in scientific applications has become a core requirement for efficiently running on modern distributed multicore SIMD compute architectures. The granularity of parallelism that can be attained is a key determinant for the achievable acceleration and time to solution. Motivated by a scientific use case that requires the simulation of long spans of time — the study of plasticity and learning in detailed models of brain tissue — we present a strategy that exposes and exploits multicore and SIMD micro-parallelism from unrolling flow dependencies and concurrent outputs in a large system of coupled ordinary differential equations (ODEs). An implementation of a parallel simulator is presented, running on the HPX runtime system for the ParalleX execution model, providing dynamic task-scheduling and asynchronous execution. The implementation was tested on different architectures using a previously published brain tissue model. Benchmark of single neurons on a single compute node present a speed-up of circa 4-7x when compared with the state of the art Single Instruction Multiple Data (SIMD) implementation and 13-40x over its Single Instruction Single Data (SISD) counterpart. Large scale benchmarks suggest almost ideal strong scaling and a speed-up of 2-8x on a distributed architecture of 128 Cray X6 compute nodes.
{"title":"Exploiting Flow Graph of System of ODEs to Accelerate the Simulation of Biologically-Detailed Neural Networks","authors":"Bruno R. C. Magalhães, T. Sterling, F. Schürmann, M. Hines","doi":"10.1109/IPDPS.2019.00028","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00028","url":null,"abstract":"Exposing parallelism in scientific applications has become a core requirement for efficiently running on modern distributed multicore SIMD compute architectures. The granularity of parallelism that can be attained is a key determinant for the achievable acceleration and time to solution. Motivated by a scientific use case that requires the simulation of long spans of time — the study of plasticity and learning in detailed models of brain tissue — we present a strategy that exposes and exploits multicore and SIMD micro-parallelism from unrolling flow dependencies and concurrent outputs in a large system of coupled ordinary differential equations (ODEs). An implementation of a parallel simulator is presented, running on the HPX runtime system for the ParalleX execution model, providing dynamic task-scheduling and asynchronous execution. The implementation was tested on different architectures using a previously published brain tissue model. Benchmark of single neurons on a single compute node present a speed-up of circa 4-7x when compared with the state of the art Single Instruction Multiple Data (SIMD) implementation and 13-40x over its Single Instruction Single Data (SISD) counterpart. Large scale benchmarks suggest almost ideal strong scaling and a speed-up of 2-8x on a distributed architecture of 128 Cray X6 compute nodes.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131218607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00014
U. Agarwal, V. Ramachandran
We present new results for the distributed computation of all pairs shortest paths (APSP) in the CONGEST model in an n-node graph with moderate non-negative integer weights. Our methods can handle zero-weight edges which are known to present difficulties for distributed APSP algorithms. The current best deterministic distributed algorithm in the CONGEST model that handles zero weight edges is the Õ(n^3/2)-round algorithm of Agarwal et al. [ARKP18] that works for arbitrary edge weights. Our new deterministic algorithms run in Õ(W^1/4⋅ n^5/4) rounds in graphs with non-negative integer edge-weight at most W, and in Õ(n ⋅ Δ^1/3) rounds for shortest path distances at most Δ. These algorithms are built on top of a new pipelined algorithm we present for this problem that runs in at most 2n √Δ + 2n rounds. Additionally, we show that the techniques in our results simplify some of the procedures in the earlier APSP algorithms for non-negative edge weights in [HNS17, ARKP18]. We also present new results for computing h-hop shortest paths from k given sources, and we present an Õ(n/ε^2)-round deterministic $(1+ε) approximation algorithm for graphs with non-negative poly(n) integer weights, improving results in [Nanongkai14, LP15] that hold only for positive integer weights.
{"title":"Distributed Weighted All Pairs Shortest Paths Through Pipelining","authors":"U. Agarwal, V. Ramachandran","doi":"10.1109/IPDPS.2019.00014","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00014","url":null,"abstract":"We present new results for the distributed computation of all pairs shortest paths (APSP) in the CONGEST model in an n-node graph with moderate non-negative integer weights. Our methods can handle zero-weight edges which are known to present difficulties for distributed APSP algorithms. The current best deterministic distributed algorithm in the CONGEST model that handles zero weight edges is the Õ(n^3/2)-round algorithm of Agarwal et al. [ARKP18] that works for arbitrary edge weights. Our new deterministic algorithms run in Õ(W^1/4⋅ n^5/4) rounds in graphs with non-negative integer edge-weight at most W, and in Õ(n ⋅ Δ^1/3) rounds for shortest path distances at most Δ. These algorithms are built on top of a new pipelined algorithm we present for this problem that runs in at most 2n √Δ + 2n rounds. Additionally, we show that the techniques in our results simplify some of the procedures in the earlier APSP algorithms for non-negative edge weights in [HNS17, ARKP18]. We also present new results for computing h-hop shortest paths from k given sources, and we present an Õ(n/ε^2)-round deterministic $(1+ε) approximation algorithm for graphs with non-negative poly(n) integer weights, improving results in [Nanongkai14, LP15] that hold only for positive integer weights.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"393 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113986889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00021
Jordi Wolfson-Pou, Edmond Chow
Reducing synchronization in iterative methods for solving large sparse linear systems may become one of the most important goals for such solvers on exascale computers. Research in asynchronous iterative methods has primarily considered basic iterative methods. In this paper, we examine how multigrid methods can be executed asynchronously. We present models of asynchronous additive multigrid methods, and use these models to study the convergence properties of these methods. We also introduce two parallel algorithms for implementing asynchronous additive multigrid, the global-res and local-res algorithms. These two algorithms differ in how the fine grid residual is computed, where local-res requires less computation than global-res but converges more slowly. We compare two types of asynchronous additive multigrid methods: the asynchronous fast adaptive composite grid method with smoothing (AFACx) and additive variants of the classical multiplicative method (Multadd). We implement asynchronous versions of Multadd and AFACx in OpenMP and generate the prolongation and coarse grid matrices using the BoomerAMG package. Our experimental results show that asynchronous multigrid can exhibit grid-size independent convergence and can be faster than classical multigrid in terms of solve wall-clock time. We also show that asynchronous smoothing is the best choice of smoother for our test cases, even when only one smoothing sweep is used.
{"title":"Asynchronous Multigrid Methods","authors":"Jordi Wolfson-Pou, Edmond Chow","doi":"10.1109/IPDPS.2019.00021","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00021","url":null,"abstract":"Reducing synchronization in iterative methods for solving large sparse linear systems may become one of the most important goals for such solvers on exascale computers. Research in asynchronous iterative methods has primarily considered basic iterative methods. In this paper, we examine how multigrid methods can be executed asynchronously. We present models of asynchronous additive multigrid methods, and use these models to study the convergence properties of these methods. We also introduce two parallel algorithms for implementing asynchronous additive multigrid, the global-res and local-res algorithms. These two algorithms differ in how the fine grid residual is computed, where local-res requires less computation than global-res but converges more slowly. We compare two types of asynchronous additive multigrid methods: the asynchronous fast adaptive composite grid method with smoothing (AFACx) and additive variants of the classical multiplicative method (Multadd). We implement asynchronous versions of Multadd and AFACx in OpenMP and generate the prolongation and coarse grid matrices using the BoomerAMG package. Our experimental results show that asynchronous multigrid can exhibit grid-size independent convergence and can be faster than classical multigrid in terms of solve wall-clock time. We also show that asynchronous smoothing is the best choice of smoother for our test cases, even when only one smoothing sweep is used.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125052361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00049
Zhichao Yan, Hong Jiang, Yujuan Tan, S. Skelton, Hao Luo
Lossless data reduction techniques, particularly compression and deduplication, have emerged as effective approaches to tackling the combined challenge of explosive growth in data volumes but lagging growth in network bandwidth, to improve space and bandwidth efficiency in the cloud storage environment. However, our observations reveal that traditional deduplication solutions are rendered essentially useless in detecting and removing redundant data from the compressed packages in the cloud, which are poised to greatly increase in their presence and popularity. This is because even uncompressed, compressed and differently compressed packages of the exact same contents tend to have completely different byte stream patterns, whose redundancy cannot be identified by comparing their fingerprints. This, combined with different compressed packets mixed with different data but containing significant duplicate data, will further exacerbate the problem in the cloud storage environment. To address this fundamental problem, we propose Z-Dedup, a novel deduplication system that is able to detect and remove redundant data in compressed packages, by exploiting some key invariant information embedded in the metadata of compressed packages such as file-based checksum and original file length information. Our evaluations show that Z-Dedup can significantly improve both space and bandwidth efficiency over traditional approaches by eliminating 1.61% to 98.75% redundant data of a compressed package based on our collected datasets, and even more storage space and bandwidth are expected to be saved after the storage servers have accumulated more compressed contents.
{"title":"Z-Dedup:A Case for Deduplicating Compressed Contents in Cloud","authors":"Zhichao Yan, Hong Jiang, Yujuan Tan, S. Skelton, Hao Luo","doi":"10.1109/IPDPS.2019.00049","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00049","url":null,"abstract":"Lossless data reduction techniques, particularly compression and deduplication, have emerged as effective approaches to tackling the combined challenge of explosive growth in data volumes but lagging growth in network bandwidth, to improve space and bandwidth efficiency in the cloud storage environment. However, our observations reveal that traditional deduplication solutions are rendered essentially useless in detecting and removing redundant data from the compressed packages in the cloud, which are poised to greatly increase in their presence and popularity. This is because even uncompressed, compressed and differently compressed packages of the exact same contents tend to have completely different byte stream patterns, whose redundancy cannot be identified by comparing their fingerprints. This, combined with different compressed packets mixed with different data but containing significant duplicate data, will further exacerbate the problem in the cloud storage environment. To address this fundamental problem, we propose Z-Dedup, a novel deduplication system that is able to detect and remove redundant data in compressed packages, by exploiting some key invariant information embedded in the metadata of compressed packages such as file-based checksum and original file length information. Our evaluations show that Z-Dedup can significantly improve both space and bandwidth efficiency over traditional approaches by eliminating 1.61% to 98.75% redundant data of a compressed package based on our collected datasets, and even more storage space and bandwidth are expected to be saved after the storage servers have accumulated more compressed contents.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116142087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}