Cloud infrastructure providers deploy Dynamic Resource Management (DRM) to minimize the cost of datacenter operation, while maintaining the Service Level Agreement (SLA). Such DRM schemes depend on the capability to migrate virtual machine (VM) images. However, existing migration techniques are not suitable for highly utilized clouds due to their latency and bandwidth critical memory transfer mechanisms. In this paper, we propose guide-copy migration, a novel VM migration scheme to provide a fast and silent migration, which works nicely under highly utilized clouds. The guide-copy migration transfers only the memory pages accessed at the destination node in the near future by running a guide version of the VM at the source node and a migrated VM at the destination node simultaneously during the migration. The guide-copy migration's highly accurate and low-bandwidth memory transfer mechanism enables a fast and silent VM migration to maintain the SLA of all VMs in the cloud.
{"title":"Guide-copy: Fast and silent migration of virtual machine for datacenters","authors":"Jihun Kim, Dongju Chae, Jangwoong Kim, Jong Kim","doi":"10.1145/2503210.2503251","DOIUrl":"https://doi.org/10.1145/2503210.2503251","url":null,"abstract":"Cloud infrastructure providers deploy Dynamic Resource Management (DRM) to minimize the cost of datacenter operation, while maintaining the Service Level Agreement (SLA). Such DRM schemes depend on the capability to migrate virtual machine (VM) images. However, existing migration techniques are not suitable for highly utilized clouds due to their latency and bandwidth critical memory transfer mechanisms. In this paper, we propose guide-copy migration, a novel VM migration scheme to provide a fast and silent migration, which works nicely under highly utilized clouds. The guide-copy migration transfers only the memory pages accessed at the destination node in the near future by running a guide version of the VM at the source node and a migrated VM at the destination node simultaneously during the migration. The guide-copy migration's highly accurate and low-bandwidth memory transfer mechanism enables a fast and silent VM migration to maintain the SLA of all VMs in the cloud.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116092292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multilevel-cell (MLC) phase change memory (PCM) may provide both high capacity main memory and faster-than-Flash persistent storage. But slow growth in cell resistance with time, resistance drift, can cause transient errors in MLC-PCM. Drift errors increase with time, and prior work suggests refresh before the cell loses data. The need for refresh makes MLC-PCM volatile, taking away a key advantage. Based on the observation that most drift errors occur in a particular state in four-level-cell PCM, we propose to change from four levels to three levels, eliminating the most vulnerable state. This simple change lowers cell drift error rates by many orders of magnitude: three-level-cell PCM can retain data without power for more than ten years. With optimized encoding/decoding and a wearout tolerance mechanism, we can narrow the capacity gap between three-level and four-level cells. These techniques together enable low-cost, high-performance, genuinely nonvolatile MLC-PCM.
{"title":"Practical nonvolatile multilevel-cell phase change memory","authors":"D. Yoon, Jichuan Chang, R. Schreiber, N. Jouppi","doi":"10.1145/2503210.2503221","DOIUrl":"https://doi.org/10.1145/2503210.2503221","url":null,"abstract":"Multilevel-cell (MLC) phase change memory (PCM) may provide both high capacity main memory and faster-than-Flash persistent storage. But slow growth in cell resistance with time, resistance drift, can cause transient errors in MLC-PCM. Drift errors increase with time, and prior work suggests refresh before the cell loses data. The need for refresh makes MLC-PCM volatile, taking away a key advantage. Based on the observation that most drift errors occur in a particular state in four-level-cell PCM, we propose to change from four levels to three levels, eliminating the most vulnerable state. This simple change lowers cell drift error rates by many orders of magnitude: three-level-cell PCM can retain data without power for more than ten years. With optimized encoding/decoding and a wearout tolerance mechanism, we can narrow the capacity gap between three-level and four-level cells. These techniques together enable low-cost, high-performance, genuinely nonvolatile MLC-PCM.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122086018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Bermejo-Moreno, J. Bodart, J. Larsson, Blaise M. Barney, J. Nichols, Steve Jones
We present weak and strong scaling studies as well as performance analyses of the Hybrid code, a finite-difference solver of the compressible Navier-Stokes equations on structured grids used for the direct numerical simulation of isotropic turbulence and its interaction with shock waves. Parallelization is achieved through MPI, emphasizing the use of nonblocking communication with concurrent computation. The simulations, scaling and performance studies were done on the Sequoia, Vulcan and Vesta Blue Gene/Q systems, the first two accounting for a total of 1,966,080 cores when used in combination. The maximum number of grid points simulated was 4.12 trillion, with a memory usage of approximately 1.6 PB. We discuss the use of hyperthreading, which significantly improves the parallel performance of the code on this architecture.
我们提出了弱尺度和强尺度研究以及混合代码的性能分析,混合代码是用于直接数值模拟各向同性湍流及其与激波相互作用的结构网格上的可压缩Navier-Stokes方程的有限差分求解器。并行化是通过MPI实现的,强调在并发计算中使用非阻塞通信。模拟,缩放和性能研究是在Sequoia, Vulcan和Vesta Blue Gene/Q系统上完成的,前两个系统在组合使用时总共占1,966,080个内核。模拟的网格点的最大数量为4.12万亿,内存使用量约为1.6 PB。我们讨论了超线程的使用,它显著提高了该体系结构上代码的并行性能。
{"title":"Solving the compressible Navier-Stokes equations on up to 1.97 million cores and 4.1 trillion grid points","authors":"I. Bermejo-Moreno, J. Bodart, J. Larsson, Blaise M. Barney, J. Nichols, Steve Jones","doi":"10.1145/2503210.2503265","DOIUrl":"https://doi.org/10.1145/2503210.2503265","url":null,"abstract":"We present weak and strong scaling studies as well as performance analyses of the Hybrid code, a finite-difference solver of the compressible Navier-Stokes equations on structured grids used for the direct numerical simulation of isotropic turbulence and its interaction with shock waves. Parallelization is achieved through MPI, emphasizing the use of nonblocking communication with concurrent computation. The simulations, scaling and performance studies were done on the Sequoia, Vulcan and Vesta Blue Gene/Q systems, the first two accounting for a total of 1,966,080 cores when used in combination. The maximum number of grid points simulated was 4.12 trillion, with a memory usage of approximately 1.6 PB. We discuss the use of hyperthreading, which significantly improves the parallel performance of the code on this architecture.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122175587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Vandierendonck, Kallia Chronaki, Dimitrios S. Nikolopoulos
Ubiquitous parallel computing aims to make parallel programming accessible to a wide variety of programming areas using deterministic and scale-free programming models built on a task abstraction. However, it remains hard to reconcile these attributes with pipeline parallelism, where the number of pipeline stages is typically hard-coded in the program and defines the degree of parallelism. This paper introduces hyperqueues, a programming abstraction that enables the construction of deterministic and scale-free pipeline parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues require shared concurrent views on the underlying data structure. We define the semantics of hyperqueues and describe their implementation in a work-stealing scheduler. We demonstrate scalable performance on pipeline-parallel PARSEC benchmarks and find that hyperqueues provide comparable or up to 30% better performance than POSIX threads and Intel's Threading Building Blocks. The latter are highly tuned to the number of available processing cores, while programs using hyperqueues are scale-free.
{"title":"Deterministic scale-free pipeline parallelism with hyperqueues","authors":"H. Vandierendonck, Kallia Chronaki, Dimitrios S. Nikolopoulos","doi":"10.1145/2503210.2503233","DOIUrl":"https://doi.org/10.1145/2503210.2503233","url":null,"abstract":"Ubiquitous parallel computing aims to make parallel programming accessible to a wide variety of programming areas using deterministic and scale-free programming models built on a task abstraction. However, it remains hard to reconcile these attributes with pipeline parallelism, where the number of pipeline stages is typically hard-coded in the program and defines the degree of parallelism. This paper introduces hyperqueues, a programming abstraction that enables the construction of deterministic and scale-free pipeline parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues require shared concurrent views on the underlying data structure. We define the semantics of hyperqueues and describe their implementation in a work-stealing scheduler. We demonstrate scalable performance on pipeline-parallel PARSEC benchmarks and find that hyperqueues provide comparable or up to 30% better performance than POSIX threads and Intel's Threading Building Blocks. The latter are highly tuned to the number of available processing cores, while programs using hyperqueues are scale-free.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126592462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Laney, S. Langer, Christopher Weber, Peter Lindstrom, Al Wegener
This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3-5X can be applied without causing significant changes to important physical quantities. Rather than applying signal processing error metrics, we utilize physics-based metrics appropriate for each code to assess the impact of compression. We evaluate three different simulation codes: a Lagrangian shock-hydrodynamics code, an Eulerian higher-order hydrodynamics turbulence modeling code, and an Eulerian coupled laser-plasma interaction code. We compress relevant quantities after each time-step to approximate the effects of tightly coupled compression and study the compression rates to estimate memory and disk-bandwidth reduction. We find that the error characteristics of compression algorithms must be carefully considered in the context of the underlying physics being modeled.
{"title":"Assessing the effects of data compression in simulations using physically motivated metrics","authors":"D. Laney, S. Langer, Christopher Weber, Peter Lindstrom, Al Wegener","doi":"10.1145/2503210.2503283","DOIUrl":"https://doi.org/10.1145/2503210.2503283","url":null,"abstract":"This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3-5X can be applied without causing significant changes to important physical quantities. Rather than applying signal processing error metrics, we utilize physics-based metrics appropriate for each code to assess the impact of compression. We evaluate three different simulation codes: a Lagrangian shock-hydrodynamics code, an Eulerian higher-order hydrodynamics turbulence modeling code, and an Eulerian coupled laser-plasma interaction code. We compress relevant quantities after each time-step to approximate the effects of tightly coupled compression and study the compression rates to estimate memory and disk-bandwidth reduction. We find that the error characteristics of compression algorithms must be carefully considered in the context of the underlying physics being modeled.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117315663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikhil Jain, A. Bhatele, Michael P. Robson, T. Gamblin, L. Kalé
Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. In this paper, we attempt to model the performance of an application using communication data, such as the communication graph and network hardware counters. We use supervised learning algorithms, such as randomized decision trees, to correlate performance with prior and new metrics. We propose new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction. For three different communication patterns and a production application, we demonstrate a very strong correlation between the proposed metrics and the execution time of these codes.
{"title":"Predicting application performance using supervised learning on communication features","authors":"Nikhil Jain, A. Bhatele, Michael P. Robson, T. Gamblin, L. Kalé","doi":"10.1145/2503210.2503263","DOIUrl":"https://doi.org/10.1145/2503210.2503263","url":null,"abstract":"Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. In this paper, we attempt to model the performance of an application using communication data, such as the communication graph and network hardware counters. We use supervised learning algorithms, such as randomized decision trees, to correlate performance with prior and new metrics. We propose new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction. For three different communication patterns and a production application, we demonstrate a very strong correlation between the proposed metrics and the execution time of these codes.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115566517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present the parallel design and performance of the nested filtering factorization preconditioner (NFF), which can be used for solving linear systems arising from the discretization of a system of PDEs on unstructured grids. NFF has limited memory requirements, and it is based on a two level recursive decomposition that exploits a nested block arrow structure of the input matrix, obtained priorly by using graph partitioning techniques. It also allows to preserve several directions of interest of the input matrix to alleviate the effect of low frequency modes on the convergence of iterative methods. For a boundary value problem with highly heterogeneous coefficients, discretized on three-dimensional grids with 64 millions unknowns and 447 millions nonzero entries, we show experimentally that NFF scales up to 2048 cores of Genci's Bull system (Curie), and it is up to 2.6 times faster than the domain decomposition preconditioner Restricted Additive Schwarz implemented in PETSc.
{"title":"Parallel design and performance of nested filtering factorization preconditioner","authors":"Long Qu, L. Grigori, F. Nataf","doi":"10.1145/2503210.2503287","DOIUrl":"https://doi.org/10.1145/2503210.2503287","url":null,"abstract":"We present the parallel design and performance of the nested filtering factorization preconditioner (NFF), which can be used for solving linear systems arising from the discretization of a system of PDEs on unstructured grids. NFF has limited memory requirements, and it is based on a two level recursive decomposition that exploits a nested block arrow structure of the input matrix, obtained priorly by using graph partitioning techniques. It also allows to preserve several directions of interest of the input matrix to alleviate the effect of low frequency modes on the convergence of iterative methods. For a boundary value problem with highly heterogeneous coefficients, discretized on three-dimensional grids with 64 millions unknowns and 447 millions nonzero entries, we show experimentally that NFF scales up to 2048 cores of Genci's Bull system (Curie), and it is up to 2.6 times faster than the domain decomposition preconditioner Restricted Additive Schwarz implemented in PETSc.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115750683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.
{"title":"Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach","authors":"Dong Li, Zizhong Chen, Panruo Wu, J. Vetter","doi":"10.1145/2503210.2503226","DOIUrl":"https://doi.org/10.1145/2503210.2503226","url":null,"abstract":"Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121888463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Myoungsoo Jung, E. Wilson, Wonil Choi, J. Shalf, H. Aktulga, Chao Yang, Erik Saule, Ümit V. Çatalyürek, M. Kandemir
Drawing parallels to the rise of general purpose graphical processing units (GPGPUs) as accelerators for specific high-performance computing (HPC) workloads, there is a rise in the use of non-volatile memory (NVM) as accelerators for I/O-intensive scientific applications. However, existing works have explored use of NVM within dedicated I/O nodes, which are distant from the compute nodes that actually need such acceleration. As NVM bandwidth begins to out-pace point-to-point network capacity, we argue for the need to break from the archetype of completely separated storage. Therefore, in this work we investigate co-location of NVM and compute by varying I/O interfaces, file systems, types of NVM, and both current and future SSD architectures, uncovering numerous bottlenecks implicit in these various levels in the I/O stack. We present novel hardware and software solutions, including the new Unified File System (UFS), to enable fuller utilization of the new compute-local NVM storage. Our experimental evaluation, which employs a real-world Out-of-Core (OoC) HPC application, demonstrates throughput increases in excess of an order of magnitude over current approaches.
{"title":"Exploring the future of out-of-core computing with compute-local non-volatile memory","authors":"Myoungsoo Jung, E. Wilson, Wonil Choi, J. Shalf, H. Aktulga, Chao Yang, Erik Saule, Ümit V. Çatalyürek, M. Kandemir","doi":"10.1145/2503210.2503261","DOIUrl":"https://doi.org/10.1145/2503210.2503261","url":null,"abstract":"Drawing parallels to the rise of general purpose graphical processing units (GPGPUs) as accelerators for specific high-performance computing (HPC) workloads, there is a rise in the use of non-volatile memory (NVM) as accelerators for I/O-intensive scientific applications. However, existing works have explored use of NVM within dedicated I/O nodes, which are distant from the compute nodes that actually need such acceleration. As NVM bandwidth begins to out-pace point-to-point network capacity, we argue for the need to break from the archetype of completely separated storage. Therefore, in this work we investigate co-location of NVM and compute by varying I/O interfaces, file systems, types of NVM, and both current and future SSD architectures, uncovering numerous bottlenecks implicit in these various levels in the I/O stack. We present novel hardware and software solutions, including the new Unified File System (UFS), to enable fuller utilization of the new compute-local NVM storage. Our experimental evaluation, which employs a real-world Out-of-Core (OoC) HPC application, demonstrates throughput increases in excess of an order of magnitude over current approaches.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122017952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider partitioning a graph in parallel using a large number of processors. Parallel multilevel partitioners, such as Pt-Scotch and ParMetis, produce good quality partitions but their performance scales poorly. Coordinate bisection schemes such as those in Zoltan, which can be applied only to graphs with coordinates, scale well but partition quality is often compromised. We seek to address this gap by developing a scalable parallel scheme which imparts coordinates to a graph through a lattice-based multilevel embedding. Partitions are computed with a parallel formulation of a geometric scheme that has been shown to provide provably good cuts on certain classes of graphs. We analyze the parallel complexity of our scheme and we observe speed-ups and cut-sizes on large graphs. Our results indicate that our method is substantially faster than ParMetis and Pt-Scotch for hundreds to thousands of processors, while producing high quality cuts.
{"title":"Scalable parallel graph partitioning","authors":"Shad Kirmani, P. Raghavan","doi":"10.1145/2503210.2503280","DOIUrl":"https://doi.org/10.1145/2503210.2503280","url":null,"abstract":"We consider partitioning a graph in parallel using a large number of processors. Parallel multilevel partitioners, such as Pt-Scotch and ParMetis, produce good quality partitions but their performance scales poorly. Coordinate bisection schemes such as those in Zoltan, which can be applied only to graphs with coordinates, scale well but partition quality is often compromised. We seek to address this gap by developing a scalable parallel scheme which imparts coordinates to a graph through a lattice-based multilevel embedding. Partitions are computed with a parallel formulation of a geometric scheme that has been shown to provide provably good cuts on certain classes of graphs. We analyze the parallel complexity of our scheme and we observe speed-ups and cut-sizes on large graphs. Our results indicate that our method is substantially faster than ParMetis and Pt-Scotch for hundreds to thousands of processors, while producing high quality cuts.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125448954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}