V. Elisseev, John Baker, Neil Morgan, L. Brochard, W. T. Hewitt
Power consumption of the world's leading supercomputers is of the order of tens of MegaWatts (MW). Therefore, energy efficiency and power management of High Performance Computing (HPC) systems are among the main goals of the HPC community. This paper presents our study of managing energy consumption of supercomputers with the use of the energy aware workload management software IBM Platform Load Sharing Facility (LSF). We analyze energy consumption and workloads of the IBM NextScale Cluster, BlueWonder, located at the Daresbury Laboratory, STFC, UK. We describe power management algorithms implemented as Energy Aware Scheduling (EAS) policies in the IBM Platform LSF software. We show the effect of the power management policies on supercomputer efficiency and power consumption using experimental as well as simulated data from scientific workloads on the BlueWonder supercomputer. We observed energy saving of up to 12% from EAS policies.
{"title":"Energy Aware Scheduling Study on BlueWonder","authors":"V. Elisseev, John Baker, Neil Morgan, L. Brochard, W. T. Hewitt","doi":"10.1109/E2SC.2016.14","DOIUrl":"https://doi.org/10.1109/E2SC.2016.14","url":null,"abstract":"Power consumption of the world's leading supercomputers is of the order of tens of MegaWatts (MW). Therefore, energy efficiency and power management of High Performance Computing (HPC) systems are among the main goals of the HPC community. This paper presents our study of managing energy consumption of supercomputers with the use of the energy aware workload management software IBM Platform Load Sharing Facility (LSF). We analyze energy consumption and workloads of the IBM NextScale Cluster, BlueWonder, located at the Daresbury Laboratory, STFC, UK. We describe power management algorithms implemented as Energy Aware Scheduling (EAS) policies in the IBM Platform LSF software. We show the effect of the power management policies on supercomputer efficiency and power consumption using experimental as well as simulated data from scientific workloads on the BlueWonder supercomputer. We observed energy saving of up to 12% from EAS policies.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131819820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy consumption's increasing importance in scientific computing has driven an interest in developing energy efficient high performance systems. Energy constraints of mobile computing has motivated the design and evolution of low-power computing systems capable of supporting a variety of compute-intensive user interfaces and applications. Others have observed the evolution of mobile devices to also provide high performance [14]. Their work has primarily examined the performance and efficiency of compute-intensive scientific programs executed either on mobile systems or hybrids of mobile CPUs grafted into non-mobile (sometimes HPC) systems [6, 12, 14].This report describes an investigation of performance and energy consumption of a single scientific code on five high performance and mobile systems with the objective of identifying the performance and energy efficiency implications of a variety of architectural features. The results of this pilot study suggest that ISA is less significant than other specific aspects of system architecture in achieving high performance at high efficiency. The strategy employed in this study may be extended to other scientific applications with a variety of memory access, computation, and communication properties.
{"title":"Preliminary Investigation of Mobile System Features Potentially Relevant to HPC","authors":"David D. Pruitt, E. Freudenthal","doi":"10.1109/E2SC.2016.13","DOIUrl":"https://doi.org/10.1109/E2SC.2016.13","url":null,"abstract":"Energy consumption's increasing importance in scientific computing has driven an interest in developing energy efficient high performance systems. Energy constraints of mobile computing has motivated the design and evolution of low-power computing systems capable of supporting a variety of compute-intensive user interfaces and applications. Others have observed the evolution of mobile devices to also provide high performance [14]. Their work has primarily examined the performance and efficiency of compute-intensive scientific programs executed either on mobile systems or hybrids of mobile CPUs grafted into non-mobile (sometimes HPC) systems [6, 12, 14].This report describes an investigation of performance and energy consumption of a single scientific code on five high performance and mobile systems with the objective of identifying the performance and energy efficiency implications of a variety of architectural features. The results of this pilot study suggest that ISA is less significant than other specific aspects of system architecture in achieving high performance at high efficiency. The strategy employed in this study may be extended to other scientific applications with a variety of memory access, computation, and communication properties.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115081792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As cooling cost is a significant portion of the total operating cost of supercomputers, improving the efficiency of the cooling mechanisms can significantly reduce the cost. Two sources of cooling inefficiency in existing computing systems are discussed in this paper: temperature variations, and reactive fan speed control. To address these problems, we propose a learning-based approach using a neural network model to accurately predict core temperatures, a preemptive fan control mechanism, and a thermal-aware load balancing algorithm that uses the temperature prediction model. We demonstrate that temperature variations among cores can be reduced from 9°C to 2°C, and that peak fan power can be reduced by 61%. These savings are realized with minimal performance degradation.
{"title":"Neural Network-Based Task Scheduling with Preemptive Fan Control","authors":"Bilge Acun, Eun Kyung Lee, Yoonho Park, L. Kalé","doi":"10.1109/E2SC.2016.6","DOIUrl":"https://doi.org/10.1109/E2SC.2016.6","url":null,"abstract":"As cooling cost is a significant portion of the total operating cost of supercomputers, improving the efficiency of the cooling mechanisms can significantly reduce the cost. Two sources of cooling inefficiency in existing computing systems are discussed in this paper: temperature variations, and reactive fan speed control. To address these problems, we propose a learning-based approach using a neural network model to accurately predict core temperatures, a preemptive fan control mechanism, and a thermal-aware load balancing algorithm that uses the temperature prediction model. We demonstrate that temperature variations among cores can be reduced from 9°C to 2°C, and that peak fan power can be reduced by 61%. These savings are realized with minimal performance degradation.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116671187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Ellsworth, Tapasya Patki, M. Schulz, B. Rountree, A. Malony
Power is quickly becoming a first class resource management concern in HPC. Upcoming HPC systems will likely be hardware over-provisioned, which will require enhanced power management subsystems to prevent service interruption. To advance the state of the art in HPC power management research, we are implementing SLURM plugins to explore a range of power-aware scheduling strategies. Our goal is to develop a coherent platform that allows for a direct comparison of various power-aware approaches on research as well as production clusters.
{"title":"A Unified Platform for Exploring Power Management Strategies","authors":"D. Ellsworth, Tapasya Patki, M. Schulz, B. Rountree, A. Malony","doi":"10.1109/E2SC.2016.10","DOIUrl":"https://doi.org/10.1109/E2SC.2016.10","url":null,"abstract":"Power is quickly becoming a first class resource management concern in HPC. Upcoming HPC systems will likely be hardware over-provisioned, which will require enhanced power management subsystems to prevent service interruption. To advance the state of the art in HPC power management research, we are implementing SLURM plugins to explore a range of power-aware scheduling strategies. Our goal is to develop a coherent platform that allows for a direct comparison of various power-aware approaches on research as well as production clusters.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122459650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tapasya Patki, D. Lowenthal, B. Rountree, M. Schulz, B. Supinski
Recent research has established that hardware overprovisioning can improve system power utilization as well as job throughput in power-constrained, high-performance computing environments significantly. These benefits, however, may be associated with an additional infrastructure cost, making hardware overprovisioned systems less viable economically. It is thus important to conduct a detailed cost-benefit analysis before investing in such systems at a large-scale. In this paper, we develop a model to conduct this analysis and show that for a given, fixed infrastructure cost budget and a system power budget, it is possible for hardware overprovisioned systems to lead to a net performance benefit when compared to traditional, worst-case provisioned HPC systems.
{"title":"Economic Viability of Hardware Overprovisioning in Power-Constrained High Performance Computing","authors":"Tapasya Patki, D. Lowenthal, B. Rountree, M. Schulz, B. Supinski","doi":"10.1109/E2SC.2016.12","DOIUrl":"https://doi.org/10.1109/E2SC.2016.12","url":null,"abstract":"Recent research has established that hardware overprovisioning can improve system power utilization as well as job throughput in power-constrained, high-performance computing environments significantly. These benefits, however, may be associated with an additional infrastructure cost, making hardware overprovisioned systems less viable economically. It is thus important to conduct a detailed cost-benefit analysis before investing in such systems at a large-scale. In this paper, we develop a model to conduct this analysis and show that for a given, fixed infrastructure cost budget and a system power budget, it is possible for hardware overprovisioned systems to lead to a net performance benefit when compared to traditional, worst-case provisioned HPC systems.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123084614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Schöne, T. Ilsche, Mario Bielert, Daniel Molka, D. Hackenberg
Current Intel processors implement a variety of power saving features like frequency scaling and idle states. These mechanisms limit the power draw and thereby decrease the thermal dissipation of the processors. However, they also have an impact on the achievable performance. The various mechanisms significantly differ regarding the amount of power savings, the latency of mode changes, and the associated overhead. In this paper, we describe and closely examine the so-called software controlled clock modulation mechanism for different processor generations. We present results that imply that the available documentation is not always correct and describe when this feature can be used to improve energy efficiency. We additionally compare it against the more popular feature of dynamic voltage and frequency scaling and develop a model to decide which feature should be used to optimize inter-process synchronizations on Intel Haswell-EP processors.
{"title":"Software Controlled Clock Modulation for Energy Efficiency Optimization on Intel Processors","authors":"R. Schöne, T. Ilsche, Mario Bielert, Daniel Molka, D. Hackenberg","doi":"10.1109/E2SC.2016.15","DOIUrl":"https://doi.org/10.1109/E2SC.2016.15","url":null,"abstract":"Current Intel processors implement a variety of power saving features like frequency scaling and idle states. These mechanisms limit the power draw and thereby decrease the thermal dissipation of the processors. However, they also have an impact on the achievable performance. The various mechanisms significantly differ regarding the amount of power savings, the latency of mode changes, and the associated overhead. In this paper, we describe and closely examine the so-called software controlled clock modulation mechanism for different processor generations. We present results that imply that the available documentation is not always correct and describe when this feature can be used to improve energy efficiency. We additionally compare it against the more popular feature of dynamic voltage and frequency scaling and develop a model to decide which feature should be used to optimize inter-process synchronizations on Intel Haswell-EP processors.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127072828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper explores the potential benefits to asynchronous task-based execution to achieve high performance under a power cap. Task-graph schedulers can flexibly reorder tasks and assign compute resources to data-parallel (elastic) tasks to minimize execution time, compared to executing step-by-step (bulk-synchronously). The efficient utilization of the available cores becomes a challenging task when a power cap is imposed. This work characterizes the trade-offs between power and performance as a Pareto frontier, identifying the set of configurations that achieve the best performance for a given amount of power. We present a set of scheduling heuristics that leverage this information dynamically during execution to ensure that the processing cores are used efficiently when running under a power cap. This work examines the behavior of three HPC applications on a 57 core Intel Xeon Phi device, demonstrating a significant performance increase over the baseline.
{"title":"Power-Constrained Performance Scheduling of Data Parallel Tasks","authors":"E. Anger, Jeremiah J. Wilke, S. Yalamanchili","doi":"10.1109/E2SC.2016.11","DOIUrl":"https://doi.org/10.1109/E2SC.2016.11","url":null,"abstract":"This paper explores the potential benefits to asynchronous task-based execution to achieve high performance under a power cap. Task-graph schedulers can flexibly reorder tasks and assign compute resources to data-parallel (elastic) tasks to minimize execution time, compared to executing step-by-step (bulk-synchronously). The efficient utilization of the available cores becomes a challenging task when a power cap is imposed. This work characterizes the trade-offs between power and performance as a Pareto frontier, identifying the set of configurations that achieve the best performance for a given amount of power. We present a set of scheduling heuristics that leverage this information dynamically during execution to ensure that the processing cores are used efficiently when running under a power cap. This work examines the behavior of three HPC applications on a 57 core Intel Xeon Phi device, demonstrating a significant performance increase over the baseline.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121439654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Power is a major limiting factor for the future of HPC and the realization of exascale computing under a power budget. GPUs have now become a mainstream parallel computation device in HPC, and optimizing power usage on GPUs is critical to achieving future goals. GPU memory is seldom studied, especially for power usage. Nevertheless, memory accesses draw significant power and are critical to understanding and optimizing GPU power usage. In this work we investigate the power and performance characteristics of various GPU memory accesses. We take an empirical approach and experimentally examine and evaluate how GPU power and performance vary with data access patterns and software parameters including GPU thread block size. In addition, we take into account the advanced power saving technology dynamic voltage and frequency scaling (DVFS) on GPU processing units and global memory. We analyze power and performance and provide some suggestions for the optimal parameters for applications that heavily use specific memory operations.
{"title":"Characterizing Power and Performance of GPU Memory Access","authors":"Tyler N. Allen, Rong Ge","doi":"10.1109/E2SC.2016.8","DOIUrl":"https://doi.org/10.1109/E2SC.2016.8","url":null,"abstract":"Power is a major limiting factor for the future of HPC and the realization of exascale computing under a power budget. GPUs have now become a mainstream parallel computation device in HPC, and optimizing power usage on GPUs is critical to achieving future goals. GPU memory is seldom studied, especially for power usage. Nevertheless, memory accesses draw significant power and are critical to understanding and optimizing GPU power usage. In this work we investigate the power and performance characteristics of various GPU memory accesses. We take an empirical approach and experimentally examine and evaluate how GPU power and performance vary with data access patterns and software parameters including GPU thread block size. In addition, we take into account the advanced power saving technology dynamic voltage and frequency scaling (DVFS) on GPU processing units and global memory. We analyze power and performance and provide some suggestions for the optimal parameters for applications that heavily use specific memory operations.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"17 21","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120966138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we introduce a novel, dense, system-on-chip many-core Lenovo NeXtScale System® server based on the Cavium THUNDERX® ARMv8 processor that was designed for performance, energy efficiency and programmability. THUNDERX processor was designed to scale up to 96 cores in a cache coherent, shared memory architecture. Furthermore, this hardware system has a power interface board (PIB) that measures with high accuracy power draw across the server board in the NeXtScale™ chassis. We use data obtainable from PIB to measure the energy use of PARSEC and Splash-2 benchmarks and demonstrate how to use available hardware counters from THUNDERX processor in order to quantify the amount of energy that is used by different aspects of shared memory programming, such as cache coherent communication. We show that energy used required to keep caches coherent is negligible and demonstrate that shared memory programming paradigm is viable candidate for future energy aware HPC designs.
{"title":"Quantifying Energy Use in Dense Shared Memory HPC Node","authors":"Milos Puzovic, Srilatha Manne, Shay GalOn, M. Ono","doi":"10.1109/E2SC.2016.7","DOIUrl":"https://doi.org/10.1109/E2SC.2016.7","url":null,"abstract":"In this paper we introduce a novel, dense, system-on-chip many-core Lenovo NeXtScale System® server based on the Cavium THUNDERX® ARMv8 processor that was designed for performance, energy efficiency and programmability. THUNDERX processor was designed to scale up to 96 cores in a cache coherent, shared memory architecture. Furthermore, this hardware system has a power interface board (PIB) that measures with high accuracy power draw across the server board in the NeXtScale™ chassis. We use data obtainable from PIB to measure the energy use of PARSEC and Splash-2 benchmarks and demonstrate how to use available hardware counters from THUNDERX processor in order to quantify the amount of energy that is used by different aspects of shared memory programming, such as cache coherent communication. We show that energy used required to keep caches coherent is negligible and demonstrate that shared memory programming paradigm is viable candidate for future energy aware HPC designs.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128884783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gary Lawson, Vaibhav Sundriyal, M. Sosonkina, Yuzhong Shen
Energy-efficient computing is crucial to achieving exascale performance. Power capping and dynamic voltage/frequency scaling may be used to achieve energy savings. The Intel Xeon Phi implements a power capping strategy, where power thresholds are employed to dynamically set voltage/frequency at the runtime. By default, these power limits are much higher than the majority of applications would reach. Hence, this work aims to set the power limits according to the workload characteristics and application performance. Certain models, originally developed for the CPU performance and power, have been adapted here to determine power-limit thresholds in the Xeon Phi. Next, a procedure to select these thresholds dynamically is proposed, and its limitations outlined. When this runtime procedure along with static power-threshold assignment were compared with the default execution, energy savings ranging from 5% to 49% were observed, mostly for memory-intensive applications.
{"title":"Runtime Power Limiting of Parallel Applications on Intel Xeon Phi Processors","authors":"Gary Lawson, Vaibhav Sundriyal, M. Sosonkina, Yuzhong Shen","doi":"10.1109/E2SC.2016.9","DOIUrl":"https://doi.org/10.1109/E2SC.2016.9","url":null,"abstract":"Energy-efficient computing is crucial to achieving exascale performance. Power capping and dynamic voltage/frequency scaling may be used to achieve energy savings. The Intel Xeon Phi implements a power capping strategy, where power thresholds are employed to dynamically set voltage/frequency at the runtime. By default, these power limits are much higher than the majority of applications would reach. Hence, this work aims to set the power limits according to the workload characteristics and application performance. Certain models, originally developed for the CPU performance and power, have been adapted here to determine power-limit thresholds in the Xeon Phi. Next, a procedure to select these thresholds dynamically is proposed, and its limitations outlined. When this runtime procedure along with static power-threshold assignment were compared with the default execution, energy savings ranging from 5% to 49% were observed, mostly for memory-intensive applications.","PeriodicalId":424743,"journal":{"name":"2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)","volume":"77 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128083613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}