V. Weber, C. Bekas, T. Laino, A. Curioni, A. Bertsch, S. Futral
In this work, we present a novel parallelization scheme for a highly efficient evaluation of the Hartree-Fock exact exchange (HFX) in ab initio molecular dynamics simulations, specifically tailored for condensed phase simulations. Our developments allow one to achieve the necessary accuracy for the evaluation of the HFX in a highly controllable manner. We show here that our solutions can take great advantage of the latest trends in HPC platforms, such as extreme threading, short vector instructions and highly dimensional interconnection networks. Indeed, all these trends are evident in the IBM Blue Gene/Q supercomputer. We demonstrate an unprecedented scalability up to 6,291,456 threads (96 BG/Q racks) with a near perfect parallel efficiency, which represents a more than 20-fold improvement as compared to the current state of the art. In terms of reduction of time to solution, we achieved an improvement that can surpass a 10-fold decrease in runtime with respect to directly comparable approaches. We exploit this development to enhance the accuracy of DFT based molecular dynamics by using the PBE0 hybrid functional. This approach allowed us to investigate the chemical behavior of organic solvents in one of the most challenging research topics in energy storage, lithium/air batteries, and to propose alternative solvents with enhanced stability to ensure an appropriate reversible electrochemical reaction. This step is key for the development of a viable lithium/air storage technology, which would have been a daunting computational task using standard methods. Recent research has shown that the electrolyte plays a key role in non-aqueous lithium/air batteries in producing the appropriate reversible electrochemical reduction. In particular, the chemical degradation of propylene carbonate, the typical electrolyte used, by lithium peroxide has been demonstrated by molecular dynamics simulations of highly realistic models. Reaching the necessary high accuracy in these simulations is a daunting computational task using standard methods.
{"title":"Shedding Light on Lithium/Air Batteries Using Millions of Threads on the BG/Q Supercomputer","authors":"V. Weber, C. Bekas, T. Laino, A. Curioni, A. Bertsch, S. Futral","doi":"10.1109/IPDPS.2014.81","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.81","url":null,"abstract":"In this work, we present a novel parallelization scheme for a highly efficient evaluation of the Hartree-Fock exact exchange (HFX) in ab initio molecular dynamics simulations, specifically tailored for condensed phase simulations. Our developments allow one to achieve the necessary accuracy for the evaluation of the HFX in a highly controllable manner. We show here that our solutions can take great advantage of the latest trends in HPC platforms, such as extreme threading, short vector instructions and highly dimensional interconnection networks. Indeed, all these trends are evident in the IBM Blue Gene/Q supercomputer. We demonstrate an unprecedented scalability up to 6,291,456 threads (96 BG/Q racks) with a near perfect parallel efficiency, which represents a more than 20-fold improvement as compared to the current state of the art. In terms of reduction of time to solution, we achieved an improvement that can surpass a 10-fold decrease in runtime with respect to directly comparable approaches. We exploit this development to enhance the accuracy of DFT based molecular dynamics by using the PBE0 hybrid functional. This approach allowed us to investigate the chemical behavior of organic solvents in one of the most challenging research topics in energy storage, lithium/air batteries, and to propose alternative solvents with enhanced stability to ensure an appropriate reversible electrochemical reaction. This step is key for the development of a viable lithium/air storage technology, which would have been a daunting computational task using standard methods. Recent research has shown that the electrolyte plays a key role in non-aqueous lithium/air batteries in producing the appropriate reversible electrochemical reduction. In particular, the chemical degradation of propylene carbonate, the typical electrolyte used, by lithium peroxide has been demonstrated by molecular dynamics simulations of highly realistic models. Reaching the necessary high accuracy in these simulations is a daunting computational task using standard methods.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130792369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MapReduce is a widely accepted framework for addressing big data challenges. Recently, it has also gained broad attention from scientists at the U.S. leadership computing facilities as a promising solution to process gigantic simulation results. However, conventional high-end computing systems are constructed based on the compute-centric paradigm while big data analytics applications prefer a data-centric paradigm such as MapReduce. This work characterizes the performance impact of key differences between compute- and data-centric paradigms and then provides optimizations to enable a dual-purpose HPC system that can efficiently support conventional HPC applications and new data analytics applications. Using a state-of-the-art MapReduce implementation Spark and the Hyperion system at Lawrence Livermore National Laboratory, we have examined the impact of storage architectures, data locality and task scheduling to the memory-resident MapReduce jobs. Based on our characterization and findings of the performance behaviors, we have introduced two optimization techniques, namely Enhanced Load Balancer and Congestion-Aware Task Dispatching, to improve the performance of Spark applications.
{"title":"Characterization and Optimization of Memory-Resident MapReduce on HPC Systems","authors":"Yandong Wang, R. Goldstone, Weikuan Yu, Teng Wang","doi":"10.1109/IPDPS.2014.87","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.87","url":null,"abstract":"MapReduce is a widely accepted framework for addressing big data challenges. Recently, it has also gained broad attention from scientists at the U.S. leadership computing facilities as a promising solution to process gigantic simulation results. However, conventional high-end computing systems are constructed based on the compute-centric paradigm while big data analytics applications prefer a data-centric paradigm such as MapReduce. This work characterizes the performance impact of key differences between compute- and data-centric paradigms and then provides optimizations to enable a dual-purpose HPC system that can efficiently support conventional HPC applications and new data analytics applications. Using a state-of-the-art MapReduce implementation Spark and the Hyperion system at Lawrence Livermore National Laboratory, we have examined the impact of storage architectures, data locality and task scheduling to the memory-resident MapReduce jobs. Based on our characterization and findings of the performance behaviors, we have introduced two optimization techniques, namely Enhanced Load Balancer and Congestion-Aware Task Dispatching, to improve the performance of Spark applications.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131553743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeremy T. Fineman, Calvin C. Newport, M. Sherr, Tonghe Wang
Finding a maximal independent set (MIS) is a classic problem in graph theory that has been widely studied in the context of distributed algorithms. Standard distributed solutions to the MIS problem focus on time complexity. In this paper, we also consider fairness. For a given MIS algorithm A and graph G, we define the inequality factor for A on G to be the largest ratio between the probabilities of the nodes joining an MIS in the graph. We say an algorithm is fair with respect to a family of graphs if it achieves a constant inequality factor for all graphs in the family. In this paper, we seek efficient and fair algorithms for common graph families. We begin by describing an algorithm that is fair and runs in O(log* n)-time in rooted trees of size n. Moving to unrooted trees, we describe a fair algorithm that runs in O(log n) time. Generalizing further to bipartite graphs, we describe a third fair algorithm that requires O(log2 n) rounds. We also show a fair algorithm for planar graphs that runs in O(log2 n) rounds, and describe an algorithm that can be run in any graph, yielding good bounds on inequality in regions that can be efficiently colored with a small number of colors. We conclude our theoretical analysis with a lower bound that identifies a graph where all MIS algorithms achieve an inequality bound in Ω(n)-eliminating the possibility of an MIS algorithm that is fair in all graphs. Finally, to motivate the need for provable fairness guarantees, we simulate both our tree algorithm and Luby's MIS algorithm [13] in a variety of different tree topologies-some synthetic and some derived from real world data. Whereas our algorithm always yield an inequality factor ≤3.25 in these simulations, Luby's algorithms yields factors as large as 168.
{"title":"Fair Maximal Independent Sets","authors":"Jeremy T. Fineman, Calvin C. Newport, M. Sherr, Tonghe Wang","doi":"10.1109/IPDPS.2014.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.79","url":null,"abstract":"Finding a maximal independent set (MIS) is a classic problem in graph theory that has been widely studied in the context of distributed algorithms. Standard distributed solutions to the MIS problem focus on time complexity. In this paper, we also consider fairness. For a given MIS algorithm A and graph G, we define the inequality factor for A on G to be the largest ratio between the probabilities of the nodes joining an MIS in the graph. We say an algorithm is fair with respect to a family of graphs if it achieves a constant inequality factor for all graphs in the family. In this paper, we seek efficient and fair algorithms for common graph families. We begin by describing an algorithm that is fair and runs in O(log* n)-time in rooted trees of size n. Moving to unrooted trees, we describe a fair algorithm that runs in O(log n) time. Generalizing further to bipartite graphs, we describe a third fair algorithm that requires O(log2 n) rounds. We also show a fair algorithm for planar graphs that runs in O(log2 n) rounds, and describe an algorithm that can be run in any graph, yielding good bounds on inequality in regions that can be efficiently colored with a small number of colors. We conclude our theoretical analysis with a lower bound that identifies a graph where all MIS algorithms achieve an inequality bound in Ω(n)-eliminating the possibility of an MIS algorithm that is fair in all graphs. Finally, to motivate the need for provable fairness guarantees, we simulate both our tree algorithm and Luby's MIS algorithm [13] in a variety of different tree topologies-some synthetic and some derived from real world data. Whereas our algorithm always yield an inequality factor ≤3.25 in these simulations, Luby's algorithms yields factors as large as 168.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121171701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a cluster scheduling technique for compute clusters with Xeon Phi coprocessors. Even though the Xeon Phi runs Linux which allows multiprocessing, cluster schedulers generally do not allow jobs to share coprocessors because sharing can cause oversubscription of coprocessor memory and thread resources. It has been shown that memory or thread oversubscription on a many core like the Phi results in job crashes or drastic performance loss. We first show that such an exclusive device allocation policy causes severe coprocessor underutilization: for typical workloads, on average only 38% of the Xeon Phi cores are busy across the cluster. Then, to improve coprocessor utilization, we propose a scheduling technique that enables safe coprocessor sharing without resource oversubscription. Jobs specify their maximum memory and thread requirements, and our scheduler packs as many jobs as possible on each coprocessor in the cluster, subject to resource limits. We solve this problem using a greedy approach at the cluster level combined with a knapsack-based algorithm for each node. Every coprocessor is modeled as a knapsack and jobs are packed into each knapsack with the goal of maximizing job concurrency, i.e., as many jobs as possible executing on each coprocessor. Given a set of jobs, we show that this strategy of packing for high concurrency is a good proxy for (i) reducing make span, without the need for users to specify job execution times and (ii) reducing coprocessor footprint, or the number of coprocessors required to finish the jobs without increasing make span. We implement the entire system as a seamless add on to Condor, a popular distributed job scheduler, and show make span and footprint reductions of more than 50% across a wide range of workloads.
{"title":"A Coprocessor Sharing-Aware Scheduler for Xeon Phi-Based Compute Clusters","authors":"G. Coviello, S. Cadambi, S. Chakradhar","doi":"10.1109/IPDPS.2014.44","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.44","url":null,"abstract":"We propose a cluster scheduling technique for compute clusters with Xeon Phi coprocessors. Even though the Xeon Phi runs Linux which allows multiprocessing, cluster schedulers generally do not allow jobs to share coprocessors because sharing can cause oversubscription of coprocessor memory and thread resources. It has been shown that memory or thread oversubscription on a many core like the Phi results in job crashes or drastic performance loss. We first show that such an exclusive device allocation policy causes severe coprocessor underutilization: for typical workloads, on average only 38% of the Xeon Phi cores are busy across the cluster. Then, to improve coprocessor utilization, we propose a scheduling technique that enables safe coprocessor sharing without resource oversubscription. Jobs specify their maximum memory and thread requirements, and our scheduler packs as many jobs as possible on each coprocessor in the cluster, subject to resource limits. We solve this problem using a greedy approach at the cluster level combined with a knapsack-based algorithm for each node. Every coprocessor is modeled as a knapsack and jobs are packed into each knapsack with the goal of maximizing job concurrency, i.e., as many jobs as possible executing on each coprocessor. Given a set of jobs, we show that this strategy of packing for high concurrency is a good proxy for (i) reducing make span, without the need for users to specify job execution times and (ii) reducing coprocessor footprint, or the number of coprocessors required to finish the jobs without increasing make span. We implement the entire system as a seamless add on to Condor, a popular distributed job scheduler, and show make span and footprint reductions of more than 50% across a wide range of workloads.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122016194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed applications often require high-performance networks with strict connectivity guarantees. For instance, many cloud applications suffer from today's variations of the intra-cloud bandwidth, which leads to poor and unpredictable application performance. Accordingly, we witness a trend towards virtual networks (VNets) which can provide resource isolation. Interestingly, while the problem of where to embed a VNet is fairly well-understood today, much less is known about when to optimally allocate a VNet. This however is important, as the requirements specified for a VNet do not have to be static, but can vary over time and even include certain temporal flexibilities. This paper initiates the study of the temporal VNet embedding problem (TVNEP). We propose a continuous-time mathematical programming approach to solve the TVNEP, and present and compare different algorithms. Based on these insights, we present the CSM-Model which incorporates both symmetry and state-space reductions to significantly speed up the process of computing exact solutions to the TVNEP. Based on the CSM-Model, we derive a greedy algorithm OGA to compute fast approximate solutions. In an extensive computational evaluation, we show that despite the hardness of the TVNEP, the CSM-Model is sufficiently powerful to solve moderately sized instances to optimality within one hour and under different objective functions (such as maximizing the number of embeddable VNets). We also show that the greedy algorithm exploits flexibilities well and yields good solutions. More generally, our results suggest that already little time flexibilities can improve the overall system performance significantly.
{"title":"It's About Time: On Optimal Virtual Network Embeddings under Temporal Flexibilities","authors":"Matthias Rost, S. Schmid, A. Feldmann","doi":"10.1109/IPDPS.2014.14","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.14","url":null,"abstract":"Distributed applications often require high-performance networks with strict connectivity guarantees. For instance, many cloud applications suffer from today's variations of the intra-cloud bandwidth, which leads to poor and unpredictable application performance. Accordingly, we witness a trend towards virtual networks (VNets) which can provide resource isolation. Interestingly, while the problem of where to embed a VNet is fairly well-understood today, much less is known about when to optimally allocate a VNet. This however is important, as the requirements specified for a VNet do not have to be static, but can vary over time and even include certain temporal flexibilities. This paper initiates the study of the temporal VNet embedding problem (TVNEP). We propose a continuous-time mathematical programming approach to solve the TVNEP, and present and compare different algorithms. Based on these insights, we present the CSM-Model which incorporates both symmetry and state-space reductions to significantly speed up the process of computing exact solutions to the TVNEP. Based on the CSM-Model, we derive a greedy algorithm OGA to compute fast approximate solutions. In an extensive computational evaluation, we show that despite the hardness of the TVNEP, the CSM-Model is sufficiently powerful to solve moderately sized instances to optimality within one hour and under different objective functions (such as maximizing the number of embeddable VNets). We also show that the greedy algorithm exploits flexibilities well and yields good solutions. More generally, our results suggest that already little time flexibilities can improve the overall system performance significantly.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123472117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Haidar, Chongxiao Cao, A. YarKhan, P. Luszczek, S. Tomov, K. Kabir, J. Dongarra
Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.
{"title":"Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment","authors":"A. Haidar, Chongxiao Cao, A. YarKhan, P. Luszczek, S. Tomov, K. Kabir, J. Dongarra","doi":"10.1109/IPDPS.2014.58","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.58","url":null,"abstract":"Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126239849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jae-Seung Yeom, A. Bhatele, K. Bisset, Eric J. Bohm, Abhishek K. Gupta, L. Kalé, M. Marathe, Dimitrios S. Nikolopoulos, M. Schulz, Lukasz Wesolowski
Modeling dynamical systems represents an important application class covering a wide range of disciplines including but not limited to biology, chemistry, finance, national security, and health care. Such applications typically involve large-scale, irregular graph processing, which makes them difficult to scale due to the evolutionary nature of their workload, irregular communication and load imbalance. EpiSimdemics is such an application simulating epidemic diffusion in extremely large and realistic social contact networks. It implements a graph-based system that captures dynamics among co-evolving entities. This paper presents an implementation of EpiSimdemics in Charm++ that enables future research by social, biological and computational scientists at unprecedented data and system scales. We present new methods for application-specific processing of graph data and demonstrate the effectiveness of these methods on a Cray XE6, specifically NCSA's Blue Waters system.
{"title":"Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters","authors":"Jae-Seung Yeom, A. Bhatele, K. Bisset, Eric J. Bohm, Abhishek K. Gupta, L. Kalé, M. Marathe, Dimitrios S. Nikolopoulos, M. Schulz, Lukasz Wesolowski","doi":"10.1109/IPDPS.2014.83","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.83","url":null,"abstract":"Modeling dynamical systems represents an important application class covering a wide range of disciplines including but not limited to biology, chemistry, finance, national security, and health care. Such applications typically involve large-scale, irregular graph processing, which makes them difficult to scale due to the evolutionary nature of their workload, irregular communication and load imbalance. EpiSimdemics is such an application simulating epidemic diffusion in extremely large and realistic social contact networks. It implements a graph-based system that captures dynamics among co-evolving entities. This paper presents an implementation of EpiSimdemics in Charm++ that enables future research by social, biological and computational scientists at unprecedented data and system scales. We present new methods for application-specific processing of graph data and demonstrate the effectiveness of these methods on a Cray XE6, specifically NCSA's Blue Waters system.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127335185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage requirements can vary greatly during application runtime, making worst-case over-provisioning a poor choice that leads to unnecessarily tied-up storage and extra costs for the user. While the ability to adapt dynamically to storage requirements is thus attractive, how to implement it is not well understood. Current approaches simply rely on users to attach and detach virtual disks to the virtual machine (VM) instances and then manage them manually, thus greatly increasing application complexity while reducing cost efficiency. Unlike such approaches, this paper aims to provide a transparent solution that presents a unified storage space to the VM in the form of a regular POSIX file system that hides the details of attaching and detaching virtual disks by handling those actions transparently based on dynamic application requirements. The main difficulty in this context is to understand the intent of the application and regulate the available storage in order to avoid running out of space while minimizing the performance overhead of doing so. To this end, we propose a storage space prediction scheme that analyzes multiple system parameters and dynamically adapts monitoring based on the intensity of the I/O in order to get as close as possible to the real usage. We show the value of our proposal over static worst-case over-provisioning and simpler elastic schemes that rely on a reactive model to attach and detach virtual disks, using both synthetic benchmarks and real-life data-intensive applications. Our experiments demonstrate that we can reduce storage waste/cost by 30-40% with only 2-5% performance overhead.
{"title":"Bursting the Cloud Data Bubble: Towards Transparent Storage Elasticity in IaaS Clouds","authors":"Bogdan Nicolae, Pierre Riteau, K. Keahey","doi":"10.1109/IPDPS.2014.25","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.25","url":null,"abstract":"Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage requirements can vary greatly during application runtime, making worst-case over-provisioning a poor choice that leads to unnecessarily tied-up storage and extra costs for the user. While the ability to adapt dynamically to storage requirements is thus attractive, how to implement it is not well understood. Current approaches simply rely on users to attach and detach virtual disks to the virtual machine (VM) instances and then manage them manually, thus greatly increasing application complexity while reducing cost efficiency. Unlike such approaches, this paper aims to provide a transparent solution that presents a unified storage space to the VM in the form of a regular POSIX file system that hides the details of attaching and detaching virtual disks by handling those actions transparently based on dynamic application requirements. The main difficulty in this context is to understand the intent of the application and regulate the available storage in order to avoid running out of space while minimizing the performance overhead of doing so. To this end, we propose a storage space prediction scheme that analyzes multiple system parameters and dynamically adapts monitoring based on the intensity of the I/O in order to get as close as possible to the real usage. We show the value of our proposal over static worst-case over-provisioning and simpler elastic schemes that rely on a reactive model to attach and detach virtual disks, using both synthetic benchmarks and real-life data-intensive applications. Our experiments demonstrate that we can reduce storage waste/cost by 30-40% with only 2-5% performance overhead.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126169073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents and evaluates a universal algorithm to improve the performance of MPI collective communication operations on hierarchical clusters with many-core nodes. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication (including collectives like Alltoallv). This algorithm improves on past works that convert a specific collective algorithm into a hierarchical version and are generally restricted to fan-in, fan-out, and All gather algorithms. Experimental results show impressive performance improvements utilizing a variety of collectives from MPICH as well as the closed-source Cray MPT for the inter-node communication. The experimental evaluation tests the new algorithms with as many as 65536 cores and sees speedups over the baseline averaging 14.2x for Alltoallv, 26x for All gather, and 32.7x for Reduce-Scatter. The paper further improves inter-node communication by utilizing multiple senders from the same shared memory buffer, achieving additional speedups averaging 2.5x. The discussion also evaluates special-purpose extensions to improve intra-node communication by returning shared memory or copy-on-write protected buffers from the collective.
{"title":"Accelerating MPI Collective Communications through Hierarchical Algorithms Without Sacrificing Inter-Node Communication Flexibility","authors":"Benjamin S. Parsons, Vijay S. Pai","doi":"10.1109/IPDPS.2014.32","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.32","url":null,"abstract":"This paper presents and evaluates a universal algorithm to improve the performance of MPI collective communication operations on hierarchical clusters with many-core nodes. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication (including collectives like Alltoallv). This algorithm improves on past works that convert a specific collective algorithm into a hierarchical version and are generally restricted to fan-in, fan-out, and All gather algorithms. Experimental results show impressive performance improvements utilizing a variety of collectives from MPICH as well as the closed-source Cray MPT for the inter-node communication. The experimental evaluation tests the new algorithms with as many as 65536 cores and sees speedups over the baseline averaging 14.2x for Alltoallv, 26x for All gather, and 32.7x for Reduce-Scatter. The paper further improves inter-node communication by utilizing multiple senders from the same shared memory buffer, achieving additional speedups averaging 2.5x. The discussion also evaluates special-purpose extensions to improve intra-node communication by returning shared memory or copy-on-write protected buffers from the collective.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121670765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, J. Dongarra
The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.
{"title":"Improving the Performance of CA-GMRES on Multicores with Multiple GPUs","authors":"I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, J. Dongarra","doi":"10.1109/IPDPS.2014.48","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.48","url":null,"abstract":"The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113975580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}