Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00048
Lawrence Rauchwerger
Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results using both approaches and draw some conclusions about their relative effectiveness and potential.
{"title":"Two Roads to Parallelism: From Serial Code to Programming with STAPL","authors":"Lawrence Rauchwerger","doi":"10.1109/IPDPS.2019.00048","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00048","url":null,"abstract":"Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results using both approaches and draw some conclusions about their relative effectiveness and potential.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123654150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00016
Alvaro Frank, Tim Süß, A. Brinkmann
Processor manufacturers today scale performance by increasing the number of cores on each CPU. Unfortunately, not all HPC applications can efficiently saturate all cores of a single node, even if they successfully scale to thousands of nodes. For these applications, sharing nodes with other applications can help to stress different resources on the nodes to more efficiently use them. Previous work has shown that the performance impact of node sharing is very application dependent but very little work has studied its effects within batch systems and for complex parallel application mixes. Administrators therefore typically fear the complexity of running a batch system supporting node sharing and also fear that interference between co-allocated jobs in practice leads to worse performance. This paper focuses on sharing nodes by oversubscribing cores through hyper-threading. We introduce new node sharing strategies for batch systems by deriving extensions to the well-known backfill and first fit algorithms. These strategies have been implemented in the SLURM workload manager and the evaluation is based on NERSC Trinity scientific mini applications. The evaluation of our node sharing strategies shows no overhead when using co-allocation, but an increased computational efficiency of 19% and an increased scheduling efficiency of 25.2% compared to standard node allocation.
{"title":"Effects and Benefits of Node Sharing Strategies in HPC Batch Systems","authors":"Alvaro Frank, Tim Süß, A. Brinkmann","doi":"10.1109/IPDPS.2019.00016","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00016","url":null,"abstract":"Processor manufacturers today scale performance by increasing the number of cores on each CPU. Unfortunately, not all HPC applications can efficiently saturate all cores of a single node, even if they successfully scale to thousands of nodes. For these applications, sharing nodes with other applications can help to stress different resources on the nodes to more efficiently use them. Previous work has shown that the performance impact of node sharing is very application dependent but very little work has studied its effects within batch systems and for complex parallel application mixes. Administrators therefore typically fear the complexity of running a batch system supporting node sharing and also fear that interference between co-allocated jobs in practice leads to worse performance. This paper focuses on sharing nodes by oversubscribing cores through hyper-threading. We introduce new node sharing strategies for batch systems by deriving extensions to the well-known backfill and first fit algorithms. These strategies have been implemented in the SLURM workload manager and the evaluation is based on NERSC Trinity scientific mini applications. The evaluation of our node sharing strategies shows no overhead when using co-allocation, but an increased computational efficiency of 19% and an increased scheduling efficiency of 25.2% compared to standard node allocation.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128458700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00090
Wenli Zheng, Xiaorui Wang, Yue Ma, Chao Li, Hao Lin, Bin Yao, Jianfeng Zhang, M. Guo
Computational sprinting is an effective mechanism to temporarily boost the performance of data center servers. However, given the great effect on performance improvement, how to make the sprinting process controllable and how to maximize the sprinting efficiency have not been well discussed yet. Those can be significant problems for a data center when computational sprinting is needed for more than a few minutes, since it requires the support of energy storage, whose capacity is limited. The control and efficiency of sprinting not only involve how fast to run servers and how to allocate resources to co-running workloads, but also the impact on power overload, and how to handle the overload with circuit breakers and energy storage to ensure power safety. Different workloads can impact sprinting in different ways, and hence efficient sprinting requires workload-specific strategies. In this paper, we propose SprintCon to realize controllable and efficient computational sprinting for data center servers. SprintCon mainly consists of a power load allocator and two different power controllers. The allocator analyzes how to divide the power load to different power sources. The server power controller adapts the CPU cores that process batch workloads, to improve the efficiency in terms of computing, energy and cost. The UPS power controller dynamically adjusts the discharge rate of UPS energy storage to satisfy the time-varying power demand of interactive workloads, and ensure power safety. The experiment results show that compared to state-of-the-art solutions, SprintCon can achieve 6-56% better computing performance and up to 87% less demand of energy storage.
{"title":"SprintCon: Controllable and Efficient Computational Sprinting for Data Center Servers","authors":"Wenli Zheng, Xiaorui Wang, Yue Ma, Chao Li, Hao Lin, Bin Yao, Jianfeng Zhang, M. Guo","doi":"10.1109/IPDPS.2019.00090","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00090","url":null,"abstract":"Computational sprinting is an effective mechanism to temporarily boost the performance of data center servers. However, given the great effect on performance improvement, how to make the sprinting process controllable and how to maximize the sprinting efficiency have not been well discussed yet. Those can be significant problems for a data center when computational sprinting is needed for more than a few minutes, since it requires the support of energy storage, whose capacity is limited. The control and efficiency of sprinting not only involve how fast to run servers and how to allocate resources to co-running workloads, but also the impact on power overload, and how to handle the overload with circuit breakers and energy storage to ensure power safety. Different workloads can impact sprinting in different ways, and hence efficient sprinting requires workload-specific strategies. In this paper, we propose SprintCon to realize controllable and efficient computational sprinting for data center servers. SprintCon mainly consists of a power load allocator and two different power controllers. The allocator analyzes how to divide the power load to different power sources. The server power controller adapts the CPU cores that process batch workloads, to improve the efficiency in terms of computing, energy and cost. The UPS power controller dynamically adjusts the discharge rate of UPS energy storage to satisfy the time-varying power demand of interactive workloads, and ensure power safety. The experiment results show that compared to state-of-the-art solutions, SprintCon can achieve 6-56% better computing performance and up to 87% less demand of energy storage.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127823187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00011
Ian T Foster
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
{"title":"Coding the Continuum","authors":"Ian T Foster","doi":"10.1109/IPDPS.2019.00011","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00011","url":null,"abstract":"In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and \"where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134349808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00039
Huizhang Luo, Dan Huang, Qing Liu, Zhenbo Qiao, Hong Jiang, J. Bi, Haitao Yuan, Mengchu Zhou, Jinzhen Wang, Zhenlu Qin
With the high volume and velocity of scientific data produced on high-performance computing systems, it has become increasingly critical to improve the compression performance. Leveraging the general tolerance of reduced accuracy in applications, lossy compressors can achieve much higher compression ratios with a user-prescribed error bound. However, they are still far from satisfying the reduction requirements from applications. In this paper, we propose and evaluate the idea that data need to be preconditioned prior to compression, such that they can better match the design philosophies of a compressor. In particular, we aim to identify a reduced model that can be utilized to transform the original data to a more compressible form. We begin with a case study of Heat3d as a proof of concept, in which we demonstrate that a reduced model can indeed reside in the full model output, and can be utilized to improve compression ratios. We further explore more general dimension reduction techniques to extract the reduced model, including principal component analysis, singular value decomposition, and discrete wavelet transform. After preconditioning, the reduced model in conjunction with delta is stored, which results in higher compression ratios. We evaluate the reduced models on nine scientific datasets, and the results show the effectiveness of our approaches.
{"title":"Identifying Latent Reduced Models to Precondition Lossy Compression","authors":"Huizhang Luo, Dan Huang, Qing Liu, Zhenbo Qiao, Hong Jiang, J. Bi, Haitao Yuan, Mengchu Zhou, Jinzhen Wang, Zhenlu Qin","doi":"10.1109/IPDPS.2019.00039","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00039","url":null,"abstract":"With the high volume and velocity of scientific data produced on high-performance computing systems, it has become increasingly critical to improve the compression performance. Leveraging the general tolerance of reduced accuracy in applications, lossy compressors can achieve much higher compression ratios with a user-prescribed error bound. However, they are still far from satisfying the reduction requirements from applications. In this paper, we propose and evaluate the idea that data need to be preconditioned prior to compression, such that they can better match the design philosophies of a compressor. In particular, we aim to identify a reduced model that can be utilized to transform the original data to a more compressible form. We begin with a case study of Heat3d as a proof of concept, in which we demonstrate that a reduced model can indeed reside in the full model output, and can be utilized to improve compression ratios. We further explore more general dimension reduction techniques to extract the reduced model, including principal component analysis, singular value decomposition, and discrete wavelet transform. After preconditioning, the reduced model in conjunction with delta is stored, which results in higher compression ratios. We evaluate the reduced models on nine scientific datasets, and the results show the effectiveness of our approaches.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130836567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00088
Srinivasan Ramesh, Swann Perarnau, Sridutt Bhalachandra, A. Malony, P. Beckman
Electrical power has become an important design constraint in high-performance computing (HPC) systems. On future HPC machines, power is likely to be a budgeted resource and thus managed dynamically. Power management software needs to reliably measure application performance at runtime in order to respond effectively to changes in application behavior. Execution time tells us little about how the science in the application is progressing toward an application-defined end goal. To the best of our knowledge, no study has defined or categorized online application progress in the context of power management. Based on semi-structured interviews with HPC application-specialists, we define an online notion of progress—an application-specific metric that can be monitored at runtime to provide a sense of the rate at which application science is being performed. Using instrumentation, we characterize and categorize the progress of various production scientific applications and benchmarks. We propose a model of the impact of dynamic power capping on application progress. By experimental evaluation, we show that our model accurately captures the general behavior of the progress of different classes of applications under a power cap. We believe that such a model is an important first step toward the design of more dynamic power management policies for HPC systems.
{"title":"Understanding the Impact of Dynamic Power Capping on Application Progress","authors":"Srinivasan Ramesh, Swann Perarnau, Sridutt Bhalachandra, A. Malony, P. Beckman","doi":"10.1109/IPDPS.2019.00088","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00088","url":null,"abstract":"Electrical power has become an important design constraint in high-performance computing (HPC) systems. On future HPC machines, power is likely to be a budgeted resource and thus managed dynamically. Power management software needs to reliably measure application performance at runtime in order to respond effectively to changes in application behavior. Execution time tells us little about how the science in the application is progressing toward an application-defined end goal. To the best of our knowledge, no study has defined or categorized online application progress in the context of power management. Based on semi-structured interviews with HPC application-specialists, we define an online notion of progress—an application-specific metric that can be monitored at runtime to provide a sense of the rate at which application science is being performed. Using instrumentation, we characterize and categorize the progress of various production scientific applications and benchmarks. We propose a model of the impact of dynamic power capping on application progress. By experimental evaluation, we show that our model accurately captures the general behavior of the progress of different classes of applications under a power cap. We believe that such a model is an important first step toward the design of more dynamic power management policies for HPC systems.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122235227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00058
Roy Nissim, O. Schwartz
Communication costs, between processors and across the memory hierarchy, often dominate the runtime of algorithms. Can we trade these costs for recomputations? Most algorithms do not utilize recomputation for this end, and most communication cost lower bounds assume no recomputation, hence do not address this fundamental question. Recently, Bilardi and De Stefani (2017), and Bilardi, Scquizzato, and Silvestri (2018) showed that recomputations cannot reduce communication costs in Strassen's fast matrix multiplication and in fast Fourier transform. We extend the former bound and show that recomputations cannot reduce communication costs for a few other fast matrix multiplication algorithms.
{"title":"Revisiting the I/O-Complexity of Fast Matrix Multiplication with Recomputations","authors":"Roy Nissim, O. Schwartz","doi":"10.1109/IPDPS.2019.00058","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00058","url":null,"abstract":"Communication costs, between processors and across the memory hierarchy, often dominate the runtime of algorithms. Can we trade these costs for recomputations? Most algorithms do not utilize recomputation for this end, and most communication cost lower bounds assume no recomputation, hence do not address this fundamental question. Recently, Bilardi and De Stefani (2017), and Bilardi, Scquizzato, and Silvestri (2018) showed that recomputations cannot reduce communication costs in Strassen's fast matrix multiplication and in fast Fourier transform. We extend the former bound and show that recomputations cannot reduce communication costs for a few other fast matrix multiplication algorithms.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126840626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00083
Jason Arnold, Boris Glavic, I. Raicu
The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled query processing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively Parallel Processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. While we also support serializable transactions, the system has not been optimized for this use case. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implementations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS's scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.
{"title":"A High-Performance Distributed Relational Database System for Scalable OLAP Processing","authors":"Jason Arnold, Boris Glavic, I. Raicu","doi":"10.1109/IPDPS.2019.00083","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00083","url":null,"abstract":"The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled query processing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively Parallel Processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. While we also support serializable transactions, the system has not been optimized for this use case. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implementations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS's scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124622347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00057
I. Yamazaki, Z. Bai, Ding Lu, J. Dongarra
Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that further reduces the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.
{"title":"Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation","authors":"I. Yamazaki, Z. Bai, Ding Lu, J. Dongarra","doi":"10.1109/IPDPS.2019.00057","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00057","url":null,"abstract":"Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that further reduces the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00051
Nikos Tziritas, Thanasis Loukopoulos, S. Khan, Chengzhong Xu, Albert Y. Zomaya
Virtual machine (VM) migration is a widely used technique in cloud computing systems to increase reliability. There are also many other reasons that a VM is migrated during its lifetime, such as reducing energy consumption, improving performance, maintenance, etc. During a live VM migration, the underlying VM continues being up until all or part of its data has been transmitted from source to destination. The remaining data are transmitted in an off-line manner by suspending the corresponding VM. The longer the off-line transmission time, the worse the performance of the respective VM. The above is because during the off-line data transmission, the VM service is down. Because a running VM's memory is subject to changes, already transmitted data pages may get dirtied and thus needing re-transmission. The decision of when suspending the VM is not a trivial task at all. The above is justified by the fact that when suspending the VM early we may result in transmitting off-line a significant amount of data degrading thus the VM's performance. On the other hand, a long waiting time to suspend the VM may result in re-transmitting a huge amount of dirty data, leading in that way to waste of resources. In this paper, we tackle the joint problem of minimizing both the total VM migration time (reflecting the resources spent during a migration) and the VM downtime (reflecting the performance degradation). The aforementioned objective functions are weighted according to the needs of the underlying cloud provider/user. To tackle the problem, we propose an online deterministic algorithm resulting in an strong competitive ratio, as well as a randomized online algorithm achieving significantly better results against the deterministic algorithm.
{"title":"Online Live VM Migration Algorithms to Minimize Total Migration Time and Downtime","authors":"Nikos Tziritas, Thanasis Loukopoulos, S. Khan, Chengzhong Xu, Albert Y. Zomaya","doi":"10.1109/IPDPS.2019.00051","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00051","url":null,"abstract":"Virtual machine (VM) migration is a widely used technique in cloud computing systems to increase reliability. There are also many other reasons that a VM is migrated during its lifetime, such as reducing energy consumption, improving performance, maintenance, etc. During a live VM migration, the underlying VM continues being up until all or part of its data has been transmitted from source to destination. The remaining data are transmitted in an off-line manner by suspending the corresponding VM. The longer the off-line transmission time, the worse the performance of the respective VM. The above is because during the off-line data transmission, the VM service is down. Because a running VM's memory is subject to changes, already transmitted data pages may get dirtied and thus needing re-transmission. The decision of when suspending the VM is not a trivial task at all. The above is justified by the fact that when suspending the VM early we may result in transmitting off-line a significant amount of data degrading thus the VM's performance. On the other hand, a long waiting time to suspend the VM may result in re-transmitting a huge amount of dirty data, leading in that way to waste of resources. In this paper, we tackle the joint problem of minimizing both the total VM migration time (reflecting the resources spent during a migration) and the VM downtime (reflecting the performance degradation). The aforementioned objective functions are weighted according to the needs of the underlying cloud provider/user. To tackle the problem, we propose an online deterministic algorithm resulting in an strong competitive ratio, as well as a randomized online algorithm achieving significantly better results against the deterministic algorithm.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129547531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}