自适应需求响应:在线学习躁动与控制的土匪

2014 IEEE International Conference on Smart Grid Communications (SmartGridComm) Pub Date : 2014-11-01 DOI:10.1109/SmartGridComm.2014.7007738

Qingsi Wang, M. Liu, J. Mathieu

{"title":"自适应需求响应:在线学习躁动与控制的土匪","authors":"Qingsi Wang, M. Liu, J. Mathieu","doi":"10.1109/SmartGridComm.2014.7007738","DOIUrl":null,"url":null,"abstract":"The capabilities of electric loads participating in load curtailment programs are often unknown until the loads have been told to curtail (i.e., deployed) and observed. In programs in which payments are made each time a load is deployed, we aim to pick the “best” loads to deploy in each time step. Our choice is a tradeoff between exploration and exploitation, i.e., curtailing poorly characterized loads in order to better characterize them in the hope of benefiting in the future versus curtailing well-characterized loads so that we benefit now. We formulate this problem as a multi-armed restless bandit problem with controlled bandits. In contrast to past work that has assumed all load parameters are known allowing the use of optimization approaches, we assume the parameters of the controlled system are unknown and develop an online learning approach. Our problem has two features not commonly addressed in the bandit literature: the arms/processes evolve according to different probabilistic laws depending on the control, and the reward/feedback observed by the decision-maker is the total realized curtailment, not the curtailment of each load. We develop an adaptive demand response learning algorithm and an extended version that works with aggregate feedback, both aimed at approximating the Whittle index policy. We show numerically that the regret of our algorithms with respect to the Whittle index policy is of logarithmic order in time, and significantly outperforms standard learning algorithms like UCB1.","PeriodicalId":6499,"journal":{"name":"2014 IEEE International Conference on Smart Grid Communications (SmartGridComm)","volume":"47 1","pages":"752-757"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Adaptive demand response: Online learning of restless and controlled bandits\",\"authors\":\"Qingsi Wang, M. Liu, J. Mathieu\",\"doi\":\"10.1109/SmartGridComm.2014.7007738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The capabilities of electric loads participating in load curtailment programs are often unknown until the loads have been told to curtail (i.e., deployed) and observed. In programs in which payments are made each time a load is deployed, we aim to pick the “best” loads to deploy in each time step. Our choice is a tradeoff between exploration and exploitation, i.e., curtailing poorly characterized loads in order to better characterize them in the hope of benefiting in the future versus curtailing well-characterized loads so that we benefit now. We formulate this problem as a multi-armed restless bandit problem with controlled bandits. In contrast to past work that has assumed all load parameters are known allowing the use of optimization approaches, we assume the parameters of the controlled system are unknown and develop an online learning approach. Our problem has two features not commonly addressed in the bandit literature: the arms/processes evolve according to different probabilistic laws depending on the control, and the reward/feedback observed by the decision-maker is the total realized curtailment, not the curtailment of each load. We develop an adaptive demand response learning algorithm and an extended version that works with aggregate feedback, both aimed at approximating the Whittle index policy. We show numerically that the regret of our algorithms with respect to the Whittle index policy is of logarithmic order in time, and significantly outperforms standard learning algorithms like UCB1.\",\"PeriodicalId\":6499,\"journal\":{\"name\":\"2014 IEEE International Conference on Smart Grid Communications (SmartGridComm)\",\"volume\":\"47 1\",\"pages\":\"752-757\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE International Conference on Smart Grid Communications (SmartGridComm)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SmartGridComm.2014.7007738\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Smart Grid Communications (SmartGridComm)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartGridComm.2014.7007738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

参与减载计划的电力负荷的能力通常是未知的，直到负荷被告知减载(即部署)和观察。在每次部署负载时都要付费的程序中，我们的目标是在每个时间步中选择“最佳”负载来部署。我们的选择是在探索和开发之间进行权衡，也就是说，减少特征不佳的负载以便更好地描述它们，以期在未来受益，而减少特征良好的负载以便我们现在受益。我们把这个问题表述为一个多武装的不安分的土匪问题。与过去假设所有负载参数都是已知的允许使用优化方法的工作相反，我们假设受控系统的参数是未知的，并开发了一种在线学习方法。我们的问题有两个特征在强盗文献中通常没有提到:武器/过程根据不同的概率规律根据控制而进化，决策者观察到的奖励/反馈是实现的总削减，而不是每个负载的削减。我们开发了一种自适应需求响应学习算法和一种扩展版本，用于汇总反馈，两者都旨在近似惠特尔指数策略。我们在数字上表明，我们的算法相对于Whittle索引策略的遗憾在时间上是对数阶的，并且显著优于UCB1等标准学习算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Adaptive demand response: Online learning of restless and controlled bandits

The capabilities of electric loads participating in load curtailment programs are often unknown until the loads have been told to curtail (i.e., deployed) and observed. In programs in which payments are made each time a load is deployed, we aim to pick the “best” loads to deploy in each time step. Our choice is a tradeoff between exploration and exploitation, i.e., curtailing poorly characterized loads in order to better characterize them in the hope of benefiting in the future versus curtailing well-characterized loads so that we benefit now. We formulate this problem as a multi-armed restless bandit problem with controlled bandits. In contrast to past work that has assumed all load parameters are known allowing the use of optimization approaches, we assume the parameters of the controlled system are unknown and develop an online learning approach. Our problem has two features not commonly addressed in the bandit literature: the arms/processes evolve according to different probabilistic laws depending on the control, and the reward/feedback observed by the decision-maker is the total realized curtailment, not the curtailment of each load. We develop an adaptive demand response learning algorithm and an extended version that works with aggregate feedback, both aimed at approximating the Whittle index policy. We show numerically that the regret of our algorithms with respect to the Whittle index policy is of logarithmic order in time, and significantly outperforms standard learning algorithms like UCB1.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE International Conference on Smart Grid Communications (SmartGridComm)

自引率

0.00%

发文量