Many petascale and exascale scientific simulations involve the time evolution of systems modelled as Partial Differential Equations (PDEs). The sparse grid combination technique (SGCT) is a cost-effective method for solve time-evolving PDEs, especially for higher-dimensional problems. It consists of evolving PDE over a set of grids of differing resolution in each dimension, and then combining the results to approximate the solution of the PDE on a grid of high resolution in all dimensions. It can also be extended to support algorithmic-based fault-tolerance, which is also important for computations at this scale. In this paper, we present two new parallel algorithms for the SGCT that supports the full distributed memory parallelization over the dimensions of the component grids, as well as over the grids as well. The direct algorithm is so called because it directly implements a SGCT combination formula. The second algorithm converts each component grid into their hierarchical surpluses, and then uses the direct algorithm on each of the hierarchical surpluses. The conversion to/from the hierarchical surpluses is also an important algorithm in its own right. An analysis of both indicates the direct algorithm minimizes the number of messages, whereas the hierarchical surplus minimizes memory consumption and offers a reduction in bandwidth by a factor of 1 -- 2 -- d, where d is the dimensionality of the SGCT. However, this is offset by its incomplete parallelism and factor of two load imbalance in practical scenarios. Our analysis also indicates both are suitable in a bandwidth-limiting regime. Experimental results including the strong and weak scalability of the algorithms indicates that, for scenarios of practical interest, both are sufficiently scalable to support large-scale SGCT but the direct algorithm has generally better performance, to within a factor of 2. Hierarchical surplus formation is much less communication intensive, but shows less scalability with increasing core counts.
{"title":"Highly Scalable Algorithms for the Sparse Grid Combination Technique","authors":"P. Strazdins, Md. Mohsin Ali, B. Harding","doi":"10.1109/IPDPSW.2015.76","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.76","url":null,"abstract":"Many petascale and exascale scientific simulations involve the time evolution of systems modelled as Partial Differential Equations (PDEs). The sparse grid combination technique (SGCT) is a cost-effective method for solve time-evolving PDEs, especially for higher-dimensional problems. It consists of evolving PDE over a set of grids of differing resolution in each dimension, and then combining the results to approximate the solution of the PDE on a grid of high resolution in all dimensions. It can also be extended to support algorithmic-based fault-tolerance, which is also important for computations at this scale. In this paper, we present two new parallel algorithms for the SGCT that supports the full distributed memory parallelization over the dimensions of the component grids, as well as over the grids as well. The direct algorithm is so called because it directly implements a SGCT combination formula. The second algorithm converts each component grid into their hierarchical surpluses, and then uses the direct algorithm on each of the hierarchical surpluses. The conversion to/from the hierarchical surpluses is also an important algorithm in its own right. An analysis of both indicates the direct algorithm minimizes the number of messages, whereas the hierarchical surplus minimizes memory consumption and offers a reduction in bandwidth by a factor of 1 -- 2 -- d, where d is the dimensionality of the SGCT. However, this is offset by its incomplete parallelism and factor of two load imbalance in practical scenarios. Our analysis also indicates both are suitable in a bandwidth-limiting regime. Experimental results including the strong and weak scalability of the algorithms indicates that, for scenarios of practical interest, both are sufficiently scalable to support large-scale SGCT but the direct algorithm has generally better performance, to within a factor of 2. Hierarchical surplus formation is much less communication intensive, but shows less scalability with increasing core counts.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123165615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many FPGAs support partial reconfiguration, where the states of a subset of configurable elements is (potentially) altered. However, the configuration bits often enter the chip through a small number of pins. Thus, the time needed to partially reconfigure an FPGA depends, to a large extent, on the number of configuration bits to be input into the chip. This is a key consideration, particularly where partial reconfiguration is performed during the computation. Therefore, it is important that the size of a frame (an atomic configuration unit) be small and the configuration be focused on the bits that truly need to be altered. Suppose C denotes the set of elements that need to be configured during a partial reconfiguration phase, here C is a (small) subset of k frames from a much larger set S of n frames. In this paper we present a method to configure the k elements of C by setting up a configuration path that strings its way through only those frames that require reconfiguration, the configuration bit stream can be now shifted in through this path. Our method also automatically selects (in hardware) a suitable clock speed that can be used to input these configuration bits. If the elements of C show spatial locality, then the configuration time could be made largely independent of n.
{"title":"An Architecture for Configuring an Effcient Scan Path for a Subset of Elements","authors":"Arash Ashrafi, R. Vaidyanathan","doi":"10.1109/IPDPSW.2015.124","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.124","url":null,"abstract":"Many FPGAs support partial reconfiguration, where the states of a subset of configurable elements is (potentially) altered. However, the configuration bits often enter the chip through a small number of pins. Thus, the time needed to partially reconfigure an FPGA depends, to a large extent, on the number of configuration bits to be input into the chip. This is a key consideration, particularly where partial reconfiguration is performed during the computation. Therefore, it is important that the size of a frame (an atomic configuration unit) be small and the configuration be focused on the bits that truly need to be altered. Suppose C denotes the set of elements that need to be configured during a partial reconfiguration phase, here C is a (small) subset of k frames from a much larger set S of n frames. In this paper we present a method to configure the k elements of C by setting up a configuration path that strings its way through only those frames that require reconfiguration, the configuration bit stream can be now shifted in through this path. Our method also automatically selects (in hardware) a suitable clock speed that can be used to input these configuration bits. If the elements of C show spatial locality, then the configuration time could be made largely independent of n.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132162316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the HCW Program Committee Chair","authors":"D. Trystram","doi":"10.1109/IPDPSW.2015.155","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.155","url":null,"abstract":"Message from the HCW Program Chair","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114708039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brendan Benshoof, Andrew Rosen, A. Bourgeois, R. Harrison
Computing Voronoi tessellations in an arbitrary number of dimensions is a computationally difficult task. This problem becomes exacerbated in distributed environments, such as Peer-to-Peer networks and Wireless networks, where Voronoi tessellations have useful applications. We present our Distributed Greedy Voronoi Heuristic, which approximates Voronoi tessellations in distributed environments. Our heuristic is fast, scalable, works in any geometric space with a distance and midpoint function, and has interesting applications in embedding metrics such as latency in the links of a distributed network.
{"title":"A Distributed Greedy Heuristic for Computing Voronoi Tessellations with Applications Towards Peer-to-Peer Networks","authors":"Brendan Benshoof, Andrew Rosen, A. Bourgeois, R. Harrison","doi":"10.1109/IPDPSW.2015.120","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.120","url":null,"abstract":"Computing Voronoi tessellations in an arbitrary number of dimensions is a computationally difficult task. This problem becomes exacerbated in distributed environments, such as Peer-to-Peer networks and Wireless networks, where Voronoi tessellations have useful applications. We present our Distributed Greedy Voronoi Heuristic, which approximates Voronoi tessellations in distributed environments. Our heuristic is fast, scalable, works in any geometric space with a distance and midpoint function, and has interesting applications in embedding metrics such as latency in the links of a distributed network.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115465885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Gadepally, Jake Bolewski, D. Hook, D. Hutchison, B. A. Miller, J. Kepner
Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (Graph BLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the Graph BLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the Graph BLAS building blocks.
{"title":"Graphulo: Linear Algebra Graph Kernels for NoSQL Databases","authors":"V. Gadepally, Jake Bolewski, D. Hook, D. Hutchison, B. A. Miller, J. Kepner","doi":"10.1109/IPDPSW.2015.19","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.19","url":null,"abstract":"Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (Graph BLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the Graph BLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the Graph BLAS building blocks.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129919097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An R-tree is a data structure for organizing and querying multi-dimensional non-uniform and overlapping data. Efficient parallelization of R-tree is an important problem due to societal applications such as geographic information systems (GIS), spatial database management systems, and VLSI layout which employ R-trees for spatial analysis tasks such as map-overlay. As graphics processing units (GPUs) have emerged as powerful computing platforms, these R-tree related applications demand efficient R-tree construction and search algorithms on GPUs. This problem is hard both due to (i) non-linear tree topology of the data structure itself and (ii) the unconventional single-instruction multiple-thread (SIMT) architecture of modern GPUs requiring careful engineering of a host of issues. Therefore, the current best parallelizations of R-tree on GPU has limited speedup of only about 20-fold. We present a space-efficient data structure design and a non-trivial bottom-up construction algorithm for R-tree on GPUs. This has yielded the first demonstrated 226-fold speedup in parallel construction of an R-tree on a GPU compared to one-core execution on a CPU. We also present innovative R-tree search algorithms that are designed to overcome GPU's architectural and resource limitations. The best of these algorithms gives a speed up of 91-fold to 180-fold on an R-tree with 16384 base objects for query sizes ranging from 2k to 16k.
{"title":"GPU-based Parallel R-tree Construction and Querying","authors":"S. Prasad, Michael McDermott, Xi He, S. Puri","doi":"10.1109/IPDPSW.2015.127","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.127","url":null,"abstract":"An R-tree is a data structure for organizing and querying multi-dimensional non-uniform and overlapping data. Efficient parallelization of R-tree is an important problem due to societal applications such as geographic information systems (GIS), spatial database management systems, and VLSI layout which employ R-trees for spatial analysis tasks such as map-overlay. As graphics processing units (GPUs) have emerged as powerful computing platforms, these R-tree related applications demand efficient R-tree construction and search algorithms on GPUs. This problem is hard both due to (i) non-linear tree topology of the data structure itself and (ii) the unconventional single-instruction multiple-thread (SIMT) architecture of modern GPUs requiring careful engineering of a host of issues. Therefore, the current best parallelizations of R-tree on GPU has limited speedup of only about 20-fold. We present a space-efficient data structure design and a non-trivial bottom-up construction algorithm for R-tree on GPUs. This has yielded the first demonstrated 226-fold speedup in parallel construction of an R-tree on a GPU compared to one-core execution on a CPU. We also present innovative R-tree search algorithms that are designed to overcome GPU's architectural and resource limitations. The best of these algorithms gives a speed up of 91-fold to 180-fold on an R-tree with 16384 base objects for query sizes ranging from 2k to 16k.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125390418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bing Lin, Wenzhong Guo, Guolong Chen, N. Xiong, Rongrong Li
The tremendous parallel computing ability of Cloud computing as a new service provisioning paradigm encourages investigators to research its drawbacks and advantages on processing large-scale scientific applications such as workflows. The current Cloud market is composed of numerous diverse Cloud providers and workflow scheduling is one of the biggest challenges on Multi-Clouds. However, the existing works fail to either satisfy the Quality of Service (QoS) requirements of end users or involve some fundamental principles of Cloud computing such as pay-as-you-go pricing model and heterogeneous computing resources. In this paper, we adapt the Partial Critical Paths algorithm (PCPA) for the multi-cloud environment and propose a scheduling strategy for scientific workflow, called Multi-Cloud Partial Critical Paths (MCPCP), which aims to minimize the execution cost of workflow while satisfying the defined deadline constrain. Our approach takes into account the essential characteristics on Multi-Clouds such as charge per time interval, various instance types from different Cloud providers as well as homogeneous intra-bandwidth vs. Heterogeneous inter-bandwidth. Various well-know workflows are used for evaluating our strategy and the experimental results show that the proposed approach has a good performance on Multi-Clouds.
{"title":"Cost-Driven Scheduling for Deadline-Constrained Workflow on Multi-clouds","authors":"Bing Lin, Wenzhong Guo, Guolong Chen, N. Xiong, Rongrong Li","doi":"10.1109/IPDPSW.2015.56","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.56","url":null,"abstract":"The tremendous parallel computing ability of Cloud computing as a new service provisioning paradigm encourages investigators to research its drawbacks and advantages on processing large-scale scientific applications such as workflows. The current Cloud market is composed of numerous diverse Cloud providers and workflow scheduling is one of the biggest challenges on Multi-Clouds. However, the existing works fail to either satisfy the Quality of Service (QoS) requirements of end users or involve some fundamental principles of Cloud computing such as pay-as-you-go pricing model and heterogeneous computing resources. In this paper, we adapt the Partial Critical Paths algorithm (PCPA) for the multi-cloud environment and propose a scheduling strategy for scientific workflow, called Multi-Cloud Partial Critical Paths (MCPCP), which aims to minimize the execution cost of workflow while satisfying the defined deadline constrain. Our approach takes into account the essential characteristics on Multi-Clouds such as charge per time interval, various instance types from different Cloud providers as well as homogeneous intra-bandwidth vs. Heterogeneous inter-bandwidth. Various well-know workflows are used for evaluating our strategy and the experimental results show that the proposed approach has a good performance on Multi-Clouds.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126175835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/many cores implementation. We propose a 2-stage index structure which is based on neighborhood indexing and perfect hashing techniques. This structure performs a filtering phase over the neighborhood regions around the seeds in constant time and avoid as much as possible random memory accesses and branch divergences. Moreover, it fits particularly well on parallel SIMD processors, because it requires intensive but homogeneous computational operations. Using this data structure, we developed a fast and sensitive Open CL prototype read mapper.
{"title":"Perfect Hashing Structures for Parallel Similarity Searches","authors":"T. T. Tran, Mathieu Giraud, Jean-Stéphane Varré","doi":"10.1109/IPDPSW.2015.105","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.105","url":null,"abstract":"Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/many cores implementation. We propose a 2-stage index structure which is based on neighborhood indexing and perfect hashing techniques. This structure performs a filtering phase over the neighborhood regions around the seeds in constant time and avoid as much as possible random memory accesses and branch divergences. Moreover, it fits particularly well on parallel SIMD processors, because it requires intensive but homogeneous computational operations. Using this data structure, we developed a fast and sensitive Open CL prototype read mapper.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126031053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last check pointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-check pointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.
{"title":"Scheduling Computational Workflows on Failure-Prone Platforms","authors":"G. Aupy, A. Benoit, H. Casanova, Y. Robert","doi":"10.1109/IPDPSW.2015.33","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.33","url":null,"abstract":"We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last check pointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-check pointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125850745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a novel sparse matrix partitioning scheme, called semi-two-dimensional (s2D), for efficient parallelization of sparse matrix-vector multiply (SpMV) operations on distributed memory systems. In s2D, matrix nonzeros are more flexibly distributed among processors than one dimensional (row wise or column wise) partitioning schemes. Yet, there is a constraint which renders s2D less flexible than two-dimensional (nonzero based) partitioning schemes. The constraint is enforced to confine all communication operations in a single phase, as in 1D partition, in a parallel SpMV operation. In a positive view, s2D thus can be seen as being close to 2D partitions in terms of flexibility, and being close 1D partitions in terms of computation/communication organization. We describe two methods that take partitions on the input and output vectors of SpMV and produce s2D partitions while reducing the total communication volume. The first method obtains optimal total communication volume, while the second one heuristically reduces this quantity and takes computational load balance into account. We demonstrate that the proposed partitioning method improves the performance of parallel SpMV operations both in theory and practice with respect to 1D and 2D partitionings.
{"title":"Semi-two-dimensional Partitioning for Parallel Sparse Matrix-Vector Multiplication","authors":"Enver Kayaaslan, B. Uçar, C. Aykanat","doi":"10.1109/IPDPSW.2015.20","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.20","url":null,"abstract":"We propose a novel sparse matrix partitioning scheme, called semi-two-dimensional (s2D), for efficient parallelization of sparse matrix-vector multiply (SpMV) operations on distributed memory systems. In s2D, matrix nonzeros are more flexibly distributed among processors than one dimensional (row wise or column wise) partitioning schemes. Yet, there is a constraint which renders s2D less flexible than two-dimensional (nonzero based) partitioning schemes. The constraint is enforced to confine all communication operations in a single phase, as in 1D partition, in a parallel SpMV operation. In a positive view, s2D thus can be seen as being close to 2D partitions in terms of flexibility, and being close 1D partitions in terms of computation/communication organization. We describe two methods that take partitions on the input and output vectors of SpMV and produce s2D partitions while reducing the total communication volume. The first method obtains optimal total communication volume, while the second one heuristically reduces this quantity and takes computational load balance into account. We demonstrate that the proposed partitioning method improves the performance of parallel SpMV operations both in theory and practice with respect to 1D and 2D partitionings.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125957617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}