Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00097
Hosein Mohammadi Makrani, S. Rafatirad, A. Houmansadr, H. Homayoun
The emergence of big data frameworks requires computational and memory resources that can naturally scale to manage massive amounts of diverse data. It is currently unclear whether big data frameworks such as Hadoop, Spark, and MPI will require high bandwidth and large capacity memory to cope with this change. The primary purpose of this study is to answer this question through empirical analysis of different memory configurations available for commodity server and to assess the impact of these configurations on the performance Hadoop and Spark frameworks, and MPI based applications. Our results show that neither DRAM capacity, frequency, nor the number of channels play a critical role on the performance of all studied Hadoop as well as most studied Spark applications. However, our results reveal that iterative tasks (e.g. machine learning) in Spark and MPI are benefiting from a high bandwidth and large capacity memory.
{"title":"Main-Memory Requirements of Big Data Applications on Commodity Server Platform","authors":"Hosein Mohammadi Makrani, S. Rafatirad, A. Houmansadr, H. Homayoun","doi":"10.1109/CCGRID.2018.00097","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00097","url":null,"abstract":"The emergence of big data frameworks requires computational and memory resources that can naturally scale to manage massive amounts of diverse data. It is currently unclear whether big data frameworks such as Hadoop, Spark, and MPI will require high bandwidth and large capacity memory to cope with this change. The primary purpose of this study is to answer this question through empirical analysis of different memory configurations available for commodity server and to assess the impact of these configurations on the performance Hadoop and Spark frameworks, and MPI based applications. Our results show that neither DRAM capacity, frequency, nor the number of channels play a critical role on the performance of all studied Hadoop as well as most studied Spark applications. However, our results reveal that iterative tasks (e.g. machine learning) in Spark and MPI are benefiting from a high bandwidth and large capacity memory.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115154017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00031
Suraj Prabhakaran, M. Neumann, F. Wolf
The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.
{"title":"Efficient Fault Tolerance Through Dynamic Node Replacement","authors":"Suraj Prabhakaran, M. Neumann, F. Wolf","doi":"10.1109/CCGRID.2018.00031","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00031","url":null,"abstract":"The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115654108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00042
James Lin, Minhua Wen, Delong Meng, Xin Liu, Akira Nukada, S. Matsuoka
Porting the domain-specific software OpenFOAM onto the TaihuLight supercomputer is a challenging task, due to the highly memory-bound nature of both the supercomputer's processor (SW26010) and the software's liner solvers. Our study tackles this technical challenge, in three steps, by optimizing the linear solvers, such as Preconditioned Conjugate Gradient (PCG), on the SW26010. First, in order to minimize the all_reduce communication cost of PCG, we developed a new algorithm RNPCG, a non-blocking PCG leveraging the on-chip register communication. Second, we optimized three key kernels of the PCG, including proposing a localized version of the Diagonal-based Incomplete Cholesky (LDIC) preconditioner. Third, to scale the RNPCG on TaihuLight, we designed the three-level non-blocking all_reduce operations. With these three steps, we implemented the RNPCG in OpenFOAM. The experimental results on TaihuLight show that 1) compared with the default implementations of OpenFOAM, the RNPCG and the LDIC on a single-core group of SW26010 can achieve a maximum speedup of 8.9X and 3.1X, respectively; 2) the scalable RNPCG can outperform the standard PCG both in the strong and the weak scaling up to 66,560 cores.
{"title":"Optimizing Preconditioned Conjugate Gradient on TaihuLight for OpenFOAM","authors":"James Lin, Minhua Wen, Delong Meng, Xin Liu, Akira Nukada, S. Matsuoka","doi":"10.1109/CCGRID.2018.00042","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00042","url":null,"abstract":"Porting the domain-specific software OpenFOAM onto the TaihuLight supercomputer is a challenging task, due to the highly memory-bound nature of both the supercomputer's processor (SW26010) and the software's liner solvers. Our study tackles this technical challenge, in three steps, by optimizing the linear solvers, such as Preconditioned Conjugate Gradient (PCG), on the SW26010. First, in order to minimize the all_reduce communication cost of PCG, we developed a new algorithm RNPCG, a non-blocking PCG leveraging the on-chip register communication. Second, we optimized three key kernels of the PCG, including proposing a localized version of the Diagonal-based Incomplete Cholesky (LDIC) preconditioner. Third, to scale the RNPCG on TaihuLight, we designed the three-level non-blocking all_reduce operations. With these three steps, we implemented the RNPCG in OpenFOAM. The experimental results on TaihuLight show that 1) compared with the default implementations of OpenFOAM, the RNPCG and the LDIC on a single-core group of SW26010 can achieve a maximum speedup of 8.9X and 3.1X, respectively; 2) the scalable RNPCG can outperform the standard PCG both in the strong and the weak scaling up to 66,560 cores.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117215751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00053
P. Mhashilkar, Mine Altunay, W. Dagenhart, S. Fuess, B. Holzman, J. Kowalkowski, D. Litvintsev, Qiming Lu, A. Moibenko, M. Paterno, P. Spentzouris, S. Timm, A. Tiradani
The next generation of High Energy Physics experiments are expected to generate exabytes of data—two orders of magnitude greater than the current generation. In order to reliably meet peak demands, facilities must either plan to provision enough resources to cover the forecasted need, or find ways to elastically expand their computational capabilities. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and to choose an appropriate scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources—spanning multiple cloud providers, multiple HPC centers, and grid computing federations.
{"title":"Intelligently-Automated Facilities Expansion with the HEPCloud Decision Engine","authors":"P. Mhashilkar, Mine Altunay, W. Dagenhart, S. Fuess, B. Holzman, J. Kowalkowski, D. Litvintsev, Qiming Lu, A. Moibenko, M. Paterno, P. Spentzouris, S. Timm, A. Tiradani","doi":"10.1109/CCGRID.2018.00053","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00053","url":null,"abstract":"The next generation of High Energy Physics experiments are expected to generate exabytes of data—two orders of magnitude greater than the current generation. In order to reliably meet peak demands, facilities must either plan to provision enough resources to cover the forecasted need, or find ways to elastically expand their computational capabilities. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and to choose an appropriate scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources—spanning multiple cloud providers, multiple HPC centers, and grid computing federations.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125291690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00054
Anna Giannakou, Louis Rilling, C. Morin, Jean-Louis Pazat
IaaS clouds allow customers (called tenants) to deploy their IT as virtualized infrastructures. However IaaS clouds features, such as multi-tenancy and elasticity, generate new security vulnerabilities for which the security monitoring must be partly run by the cloud provider to give visibility at the virtualization infrastructure level. Unfortunately the same IaaS clouds features make the virtualized infrastructures frequently reconfigurable and thus affect the ability of a provider-run security monitoring system to detect attacks.
{"title":"SAIDS: A Self-Adaptable Intrusion Detection System for IaaS Clouds","authors":"Anna Giannakou, Louis Rilling, C. Morin, Jean-Louis Pazat","doi":"10.1109/CCGRID.2018.00054","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00054","url":null,"abstract":"IaaS clouds allow customers (called tenants) to deploy their IT as virtualized infrastructures. However IaaS clouds features, such as multi-tenancy and elasticity, generate new security vulnerabilities for which the security monitoring must be partly run by the cloud provider to give visibility at the virtualization infrastructure level. Unfortunately the same IaaS clouds features make the virtualized infrastructures frequently reconfigurable and thus affect the ability of a provider-run security monitoring system to detect attacks.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125344484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00057
Lucas Leandro Nesi, M. A. Pillon, M. Assunção, G. Koslovski
Allocating IT resources to Virtual Infrastructures (VIs) (i.e. groups of VMs, virtual switches, and their network interconnections) is an NP-hard problem. Most allocation algorithms designed to run on CPUs face scalability issues when considering current cloud data centers comprising thousands of servers. This work offers and evaluates a set of allocation algorithms refactored for Graphic Processing Units (GPUs). Experimental results demonstrate their ability to handle three large-scale data center topologies.
{"title":"GPU-Accelerated Algorithms for Allocating Virtual Infrastructure in Cloud Data Centers","authors":"Lucas Leandro Nesi, M. A. Pillon, M. Assunção, G. Koslovski","doi":"10.1109/CCGRID.2018.00057","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00057","url":null,"abstract":"Allocating IT resources to Virtual Infrastructures (VIs) (i.e. groups of VMs, virtual switches, and their network interconnections) is an NP-hard problem. Most allocation algorithms designed to run on CPUs face scalability issues when considering current cloud data centers comprising thousands of servers. This work offers and evaluates a set of allocation algorithms refactored for Graphic Processing Units (GPUs). Experimental results demonstrate their ability to handle three large-scale data center topologies.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116439111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00008
F. Abu-Khzam, DoKyung Kim, Matthew Perry, Kai Wang, Peter Shaw
Graphics Processing Units (GPUs) are gaining notable popularity due to their affordable high performance multi-core architecture. They are particularly useful for massive computations that involve large data sets. In this paper, we present a highly scalable approach for the NP-hard Vertex Cover problem. Our method is based on an advanced data structure to reduce memory usage for more parallelism and we propose a load balancing scheme that is effective for multiGPU architectures. Our parallel algorithm was implemented on multiple AMD GPUs using OpenCL. Experimental results show that our proposed approach can achieve signi?cant speedups on the hard instances of the DIMACS benchmarks as well as the notoriously hard 120-Cell graph and its variants.
{"title":"Accelerating Vertex Cover Optimization on a GPU Architecture","authors":"F. Abu-Khzam, DoKyung Kim, Matthew Perry, Kai Wang, Peter Shaw","doi":"10.1109/CCGRID.2018.00008","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00008","url":null,"abstract":"Graphics Processing Units (GPUs) are gaining notable popularity due to their affordable high performance multi-core architecture. They are particularly useful for massive computations that involve large data sets. In this paper, we present a highly scalable approach for the NP-hard Vertex Cover problem. Our method is based on an advanced data structure to reduce memory usage for more parallelism and we propose a load balancing scheme that is effective for multiGPU architectures. Our parallel algorithm was implemented on multiple AMD GPUs using OpenCL. Experimental results show that our proposed approach can achieve signi?cant speedups on the hard instances of the DIMACS benchmarks as well as the notoriously hard 120-Cell graph and its variants.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122556258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00045
Benjamin Welton, B. Miller
Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.
{"title":"Exposing Hidden Performance Opportunities in High Performance GPU Applications","authors":"Benjamin Welton, B. Miller","doi":"10.1109/CCGRID.2018.00045","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00045","url":null,"abstract":"Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"494 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122930900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00079
Cyril Bordage, E. Jeannot
Process placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain.
{"title":"Process Affinity, Metrics and Impact on Performance: An Empirical Study","authors":"Cyril Bordage, E. Jeannot","doi":"10.1109/CCGRID.2018.00079","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00079","url":null,"abstract":"Process placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"252 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129848036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00019
Jinsu Park, Seongbeom Park, Woongki Baek
Maximizing performance in power-constrained computing environments is highly important in cloud and datacenter computing. To achieve the best possible performance of parallel applications under power capping, it is crucial to execute them with the optimal concurrency level and cross-component power allocation between CPUs and memory. Despite extensive prior works, it still remains unexplored to investigate the efficient runtime support that maximizes the performance of parallel applications under power capping through the coordinated control of concurrency level and cross-component power allocation. To bridge this gap, this work proposes RPPC, a holistic runtime system for maximizing performance under power capping. In contrast to the state-of-the-art techniques, RPPC robustly controls the two performance-critical knobs (i.e., concurrency level and cross-component power allocation) in a coordinated manner to maximize the performance of parallel applications under power capping. RPPC dynamically identifies the characteristics of the target parallel application and explores the system state space to find an efficient system state. Our experimental results demonstrate that RPPC significantly outperforms the two state-of-the-art power-capping techniques, achieves the performance comparable with the static best version that requires extensive per-application offline profiling, incurs small performance overheads, and provides the re-adaptation mechanism to external events such as total power budget changes.
{"title":"RPPC: A Holistic Runtime System for Maximizing Performance Under Power Capping","authors":"Jinsu Park, Seongbeom Park, Woongki Baek","doi":"10.1109/CCGRID.2018.00019","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00019","url":null,"abstract":"Maximizing performance in power-constrained computing environments is highly important in cloud and datacenter computing. To achieve the best possible performance of parallel applications under power capping, it is crucial to execute them with the optimal concurrency level and cross-component power allocation between CPUs and memory. Despite extensive prior works, it still remains unexplored to investigate the efficient runtime support that maximizes the performance of parallel applications under power capping through the coordinated control of concurrency level and cross-component power allocation. To bridge this gap, this work proposes RPPC, a holistic runtime system for maximizing performance under power capping. In contrast to the state-of-the-art techniques, RPPC robustly controls the two performance-critical knobs (i.e., concurrency level and cross-component power allocation) in a coordinated manner to maximize the performance of parallel applications under power capping. RPPC dynamically identifies the characteristics of the target parallel application and explores the system state space to find an efficient system state. Our experimental results demonstrate that RPPC significantly outperforms the two state-of-the-art power-capping techniques, achieves the performance comparable with the static best version that requires extensive per-application offline profiling, incurs small performance overheads, and provides the re-adaptation mechanism to external events such as total power budget changes.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130546899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}