理解反馈在有转换成本的在线学习中的作用

Duo Cheng, Xingyu Zhou, Bo Ji
{"title":"理解反馈在有转换成本的在线学习中的作用","authors":"Duo Cheng, Xingyu Zhou, Bo Ji","doi":"10.48550/arXiv.2306.09588","DOIUrl":null,"url":null,"abstract":"In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\\widetilde{\\Theta}(T^{2/3})$ under bandit feedback and improves to $\\widetilde{\\Theta}(\\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\\mathrm{ex}} = O(T^{2/3})$, the regret remains $\\widetilde{\\Theta}(T^{2/3})$, but when $B_{\\mathrm{ex}} = \\Omega(T^{2/3})$, it becomes $\\widetilde{\\Theta}(T/\\sqrt{B_{\\mathrm{ex}}})$, which improves as the budget $B_{\\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\\widetilde{\\Theta}(T/\\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"172 1","pages":"5521-5543"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Understanding the Role of Feedback in Online Learning with Switching Costs\",\"authors\":\"Duo Cheng, Xingyu Zhou, Bo Ji\",\"doi\":\"10.48550/arXiv.2306.09588\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\\\\widetilde{\\\\Theta}(T^{2/3})$ under bandit feedback and improves to $\\\\widetilde{\\\\Theta}(\\\\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\\\\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\\\\mathrm{ex}} = O(T^{2/3})$, the regret remains $\\\\widetilde{\\\\Theta}(T^{2/3})$, but when $B_{\\\\mathrm{ex}} = \\\\Omega(T^{2/3})$, it becomes $\\\\widetilde{\\\\Theta}(T/\\\\sqrt{B_{\\\\mathrm{ex}}})$, which improves as the budget $B_{\\\\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\\\\widetilde{\\\\Theta}(T/\\\\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.\",\"PeriodicalId\":74529,\"journal\":{\"name\":\"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning\",\"volume\":\"172 1\",\"pages\":\"5521-5543\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2306.09588\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.09588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文研究了反馈在具有切换成本的在线学习中的作用。研究表明,在强盗反馈下,最小最大后悔为$\widetilde{\Theta}(T^{2/3})$,而在全信息反馈下,最小最大后悔为$\widetilde{\Theta}(\sqrt{T})$,其中$T$为时间范围长度。然而,反馈的数量和类型通常如何影响后悔,这在很大程度上仍然是未知的。为此,我们首先考虑具有额外观测值的强盗学习设置;也就是说,除了典型的强盗反馈之外,学习者还可以自由地进行$B_{\mathrm{ex}}$额外观察。我们在这个设置中充分描述了极大极小遗憾,它展示了一个有趣的相变现象:当$B_{\mathrm{ex}} = O(T^{2/3})$时,遗憾仍然是$\widetilde{\Theta}(T^{2/3})$,但当$B_{\mathrm{ex}} = \Omega(T^{2/3})$时,它变成$\widetilde{\Theta}(T/\sqrt{B_{\mathrm{ex}}})$,随着预算$B_{\mathrm{ex}}$的增加而改善。为了设计能够实现最小最大遗憾的算法,考虑一个更一般的设置是有指导意义的,其中学习者的总观测值预算为$B$。在这种情况下,我们也充分描述了最小最大遗憾,并表明它是$\widetilde{\Theta}(T/\sqrt{B})$,它与总预算$B$平滑地缩放。此外,我们提出了一个通用的算法框架,使我们能够设计不同的学习算法,这些算法可以根据反馈的数量和类型实现两种设置的匹配上界。一个有趣的发现是,当预算相对有限时,强盗反馈仍然可以保证最优后悔,但当预算相对较大时,它不再足以实现最优后悔。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Understanding the Role of Feedback in Online Learning with Switching Costs
In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\widetilde{\Theta}(T^{2/3})$ under bandit feedback and improves to $\widetilde{\Theta}(\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\mathrm{ex}} = O(T^{2/3})$, the regret remains $\widetilde{\Theta}(T^{2/3})$, but when $B_{\mathrm{ex}} = \Omega(T^{2/3})$, it becomes $\widetilde{\Theta}(T/\sqrt{B_{\mathrm{ex}}})$, which improves as the budget $B_{\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\widetilde{\Theta}(T/\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Differential Privacy, Linguistic Fairness, and Training Data Influence: Impossibility and Possibility Theorems for Multilingual Language Models Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition Probabilistic Imputation for Time-series Classification with Missing Data Decoding Layer Saliency in Language Transformers Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1