PowerCoord: A Coordinated Power Capping Controller for Multi-CPU/GPU Servers

2018 Ninth International Green and Sustainable Computing Conference (IGSC) Pub Date : 2018-10-01 DOI:10.1109/IGCC.2018.8752132

R. Azimi, Chao Jing, S. Reda

{"title":"PowerCoord: A Coordinated Power Capping Controller for Multi-CPU/GPU Servers","authors":"R. Azimi, Chao Jing, S. Reda","doi":"10.1109/IGCC.2018.8752132","DOIUrl":null,"url":null,"abstract":"Modern supercomputers and cloud providers rely on server nodes that are equipped with multiple CPU sockets and general purpose GPUs (GPGPUs) to handle the high demand for intensive computations. These servers consume much higher power than commodity servers, and integrating them with power capping systems used in modern clusters presents new challenges. In this paper, we propose a new power capping controller, PowerCoord, that is specifically designed for servers with multiple CPU and GPU sockets that are running multiple jobs at a time. PowerCoord coordinates among the various power domains (e.g., CPU sockets and GPUs) inside a server to meet target power caps, while seeking to maximize throughput. Our approach also takes into consideration job deadlines and priorities. Because performance modeling for co-located jobs is error-prone, PowerCoord uses a learning method to adapt to various workloads. PowerCoord has a number of heuristic policies to allocate power among the various CPUs and GPUs, and it uses reinforcement learning to select the best policy during runtime. Based on the observed state of the system, PowerCoord shifts the distribution of selected policies. We implement our power cap controller on a real multi-CPU/GPU server with low overhead, and we demonstrate that it is able to meet target power caps while maximizing the throughput, and balancing other demands such as priorities and deadlines. Compared to prior published techniques, our results show that PowerCoord improves the throughput by an average of 14.4% under power caps.","PeriodicalId":388554,"journal":{"name":"2018 Ninth International Green and Sustainable Computing Conference (IGSC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Ninth International Green and Sustainable Computing Conference (IGSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IGCC.2018.8752132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Modern supercomputers and cloud providers rely on server nodes that are equipped with multiple CPU sockets and general purpose GPUs (GPGPUs) to handle the high demand for intensive computations. These servers consume much higher power than commodity servers, and integrating them with power capping systems used in modern clusters presents new challenges. In this paper, we propose a new power capping controller, PowerCoord, that is specifically designed for servers with multiple CPU and GPU sockets that are running multiple jobs at a time. PowerCoord coordinates among the various power domains (e.g., CPU sockets and GPUs) inside a server to meet target power caps, while seeking to maximize throughput. Our approach also takes into consideration job deadlines and priorities. Because performance modeling for co-located jobs is error-prone, PowerCoord uses a learning method to adapt to various workloads. PowerCoord has a number of heuristic policies to allocate power among the various CPUs and GPUs, and it uses reinforcement learning to select the best policy during runtime. Based on the observed state of the system, PowerCoord shifts the distribution of selected policies. We implement our power cap controller on a real multi-CPU/GPU server with low overhead, and we demonstrate that it is able to meet target power caps while maximizing the throughput, and balancing other demands such as priorities and deadlines. Compared to prior published techniques, our results show that PowerCoord improves the throughput by an average of 14.4% under power caps.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PowerCoord:用于多cpu /GPU服务器的协调功率封顶控制器

现代超级计算机和云提供商依赖于配备多个CPU插槽和通用gpu (gpgpu)的服务器节点来处理对密集计算的高需求。这些服务器比普通服务器消耗更高的功率，并且将它们与现代集群中使用的功率封顶系统集成在一起提出了新的挑战。在本文中，我们提出了一种新的功率封顶控制器PowerCoord，它是专门为具有多个CPU和GPU插槽的服务器设计的，这些服务器同时运行多个作业。PowerCoord在服务器内部的各种功率域(例如，CPU插座和gpu)之间进行协调，以满足目标功率上限，同时寻求最大吞吐量。我们的方法也考虑到工作的截止日期和优先级。由于同址作业的性能建模容易出错，因此PowerCoord使用一种学习方法来适应各种工作负载。PowerCoord有许多启发式策略来在各种cpu和gpu之间分配功率，并且它使用强化学习来在运行时选择最佳策略。根据观察到的系统状态，PowerCoord改变所选策略的分布。我们在一个低开销的真实多cpu /GPU服务器上实现了我们的功率上限控制器，我们证明了它能够满足目标功率上限，同时最大限度地提高吞吐量，并平衡其他需求，如优先级和截止日期。与之前发表的技术相比，我们的结果表明，在功率上限下，PowerCoord的吞吐量平均提高了14.4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 Ninth International Green and Sustainable Computing Conference (IGSC)

自引率

0.00%

发文量

期刊最新文献

A Dynamic Programming Technique for Energy-Efficient Multicore Systems Holistic Approaches to HPC Power and Workflow Management* IGSC 2018 PhD Workshop on Power/Energy Management at Extreme Scale [Copyright notice] DiRP: Distributed Intelligent Rendezvous Point for Multicast Control Plane