理解反馈在有转换成本的在线学习中的作用

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning Pub Date : 2023-06-16 DOI:10.48550/arXiv.2306.09588

Duo Cheng, Xingyu Zhou, Bo Ji

{"title":"理解反馈在有转换成本的在线学习中的作用","authors":"Duo Cheng, Xingyu Zhou, Bo Ji","doi":"10.48550/arXiv.2306.09588","DOIUrl":null,"url":null,"abstract":"In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\\widetilde{\\Theta}(T^{2/3})$ under bandit feedback and improves to $\\widetilde{\\Theta}(\\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\\mathrm{ex}} = O(T^{2/3})$, the regret remains $\\widetilde{\\Theta}(T^{2/3})$, but when $B_{\\mathrm{ex}} = \\Omega(T^{2/3})$, it becomes $\\widetilde{\\Theta}(T/\\sqrt{B_{\\mathrm{ex}}})$, which improves as the budget $B_{\\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\\widetilde{\\Theta}(T/\\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"172 1","pages":"5521-5543"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Understanding the Role of Feedback in Online Learning with Switching Costs\",\"authors\":\"Duo Cheng, Xingyu Zhou, Bo Ji\",\"doi\":\"10.48550/arXiv.2306.09588\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\\\\widetilde{\\\\Theta}(T^{2/3})$ under bandit feedback and improves to $\\\\widetilde{\\\\Theta}(\\\\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\\\\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\\\\mathrm{ex}} = O(T^{2/3})$, the regret remains $\\\\widetilde{\\\\Theta}(T^{2/3})$, but when $B_{\\\\mathrm{ex}} = \\\\Omega(T^{2/3})$, it becomes $\\\\widetilde{\\\\Theta}(T/\\\\sqrt{B_{\\\\mathrm{ex}}})$, which improves as the budget $B_{\\\\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\\\\widetilde{\\\\Theta}(T/\\\\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.\",\"PeriodicalId\":74529,\"journal\":{\"name\":\"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning\",\"volume\":\"172 1\",\"pages\":\"5521-5543\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2306.09588\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.09588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文研究了反馈在具有切换成本的在线学习中的作用。研究表明，在强盗反馈下，最小最大后悔为$\widetilde{\Theta}(T^{2/3})$，而在全信息反馈下，最小最大后悔为$\widetilde{\Theta}(\sqrt{T})$，其中$T$为时间范围长度。然而，反馈的数量和类型通常如何影响后悔，这在很大程度上仍然是未知的。为此，我们首先考虑具有额外观测值的强盗学习设置;也就是说，除了典型的强盗反馈之外，学习者还可以自由地进行$B_{\mathrm{ex}}$额外观察。我们在这个设置中充分描述了极大极小遗憾，它展示了一个有趣的相变现象:当$B_{\mathrm{ex}} = O(T^{2/3})$时，遗憾仍然是$\widetilde{\Theta}(T^{2/3})$，但当$B_{\mathrm{ex}} = \Omega(T^{2/3})$时，它变成$\widetilde{\Theta}(T/\sqrt{B_{\mathrm{ex}}})$，随着预算$B_{\mathrm{ex}}$的增加而改善。为了设计能够实现最小最大遗憾的算法，考虑一个更一般的设置是有指导意义的，其中学习者的总观测值预算为$B$。在这种情况下，我们也充分描述了最小最大遗憾，并表明它是$\widetilde{\Theta}(T/\sqrt{B})$，它与总预算$B$平滑地缩放。此外，我们提出了一个通用的算法框架，使我们能够设计不同的学习算法，这些算法可以根据反馈的数量和类型实现两种设置的匹配上界。一个有趣的发现是，当预算相对有限时，强盗反馈仍然可以保证最优后悔，但当预算相对较大时，它不再足以实现最优后悔。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Understanding the Role of Feedback in Online Learning with Switching Costs

In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\widetilde{\Theta}(T^{2/3})$ under bandit feedback and improves to $\widetilde{\Theta}(\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\mathrm{ex}} = O(T^{2/3})$, the regret remains $\widetilde{\Theta}(T^{2/3})$, but when $B_{\mathrm{ex}} = \Omega(T^{2/3})$, it becomes $\widetilde{\Theta}(T/\sqrt{B_{\mathrm{ex}}})$, which improves as the budget $B_{\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\widetilde{\Theta}(T/\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

自引率

0.00%

发文量