打破样本复杂性障碍后悔最优无模型强化学习

IF 1.4 4区 数学 Q2 MATHEMATICS, APPLIED Information and Inference-A Journal of the Ima Pub Date : 2022-08-01 DOI:10.1093/imaiai/iaac034
Gen Li;Laixi Shi;Yuxin Chen;Yuejie Chi
{"title":"打破样本复杂性障碍后悔最优无模型强化学习","authors":"Gen Li;Laixi Shi;Yuxin Chen;Yuejie Chi","doi":"10.1093/imaiai/iaac034","DOIUrl":null,"url":null,"abstract":"Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with \n<tex>$S$</tex>\n states, \n<tex>$A$</tex>\n actions and horizon length \n<tex>$H$</tex>\n, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of \n<tex>$\\sqrt{H^2SAT}$</tex>\n (modulo log factors) with \n<tex>$T$</tex>\n the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. \n<tex>$S^6A^4 \\,\\mathrm{poly}(H)$</tex>\n for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity \n<tex>$O(SAH)$</tex>\n, that achieves near-optimal regret as soon as the sample size exceeds the order of \n<tex>$SA\\,\\mathrm{poly}(H)$</tex>\n. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of \n<tex>$S^5A^3$</tex>\n—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"12 2","pages":"969-1043"},"PeriodicalIF":1.4000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8016800/10058586/10058618.pdf","citationCount":"35","resultStr":"{\"title\":\"Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning\",\"authors\":\"Gen Li;Laixi Shi;Yuxin Chen;Yuejie Chi\",\"doi\":\"10.1093/imaiai/iaac034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with \\n<tex>$S$</tex>\\n states, \\n<tex>$A$</tex>\\n actions and horizon length \\n<tex>$H$</tex>\\n, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of \\n<tex>$\\\\sqrt{H^2SAT}$</tex>\\n (modulo log factors) with \\n<tex>$T$</tex>\\n the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. \\n<tex>$S^6A^4 \\\\,\\\\mathrm{poly}(H)$</tex>\\n for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity \\n<tex>$O(SAH)$</tex>\\n, that achieves near-optimal regret as soon as the sample size exceeds the order of \\n<tex>$SA\\\\,\\\\mathrm{poly}(H)$</tex>\\n. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of \\n<tex>$S^5A^3$</tex>\\n—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.\",\"PeriodicalId\":45437,\"journal\":{\"name\":\"Information and Inference-A Journal of the Ima\",\"volume\":\"12 2\",\"pages\":\"969-1043\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/iel7/8016800/10058586/10058618.pdf\",\"citationCount\":\"35\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Inference-A Journal of the Ima\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10058618/\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Inference-A Journal of the Ima","FirstCategoryId":"100","ListUrlMain":"https://ieeexplore.ieee.org/document/10058618/","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 35

摘要

在线情景强化学习(RL)中实现样本效率需要优化地平衡探索和开发。当涉及到具有$S$状态、$a$动作和视界长度$H$的有限视界幕式马尔可夫决策过程时,在表征最小最大最优后悔方面取得了实质性进展,该最小最大最优遗憾按$\sqrt{H^2SAT}$(模对数因子)的阶数缩放,$T$为样本总数。虽然已经提出了几种竞争解决方案范式来最大限度地减少遗憾,但它们要么记忆效率低下,要么除非样本量超过一个巨大的阈值,否则达不到最优性(例如,对于现有的无模型方法,$S^6A^4\,\mathrm{poly}(H)$)。为了克服有效RL的如此大的样本量障碍,我们设计了一种新的无模型算法,其空间复杂度为$O(SAH)$,一旦样本量超过$SA\,\mathrm{poly}(H)$的数量级,就实现了接近最优的后悔。就这个样本量要求(也称为初始老化成本)而言,我们的方法比任何先验的渐进后悔最优的内存有效算法改进了至少一倍$S^5A^3$。利用最近引入的方差减少策略(也称为参考优势分解),该算法采用了一个早期确定的参考更新规则,并借助于两个具有上下限置信度的Q学习序列。我们早期确定的方差减少方法的设计原理可能对其他涉及复杂勘探-开发权衡的RL设置独立感兴趣。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$ , substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$ , that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$ . In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of $S^5A^3$ —upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.90
自引率
0.00%
发文量
28
期刊最新文献
The Dyson equalizer: adaptive noise stabilization for low-rank signal detection and recovery. Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. Phase transition and higher order analysis of Lq regularization under dependence. On statistical inference with high-dimensional sparse CCA. Black-box tests for algorithmic stability.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1