{"title":"Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation","authors":"Woojin Chae, Dabeen Lee","doi":"arxiv-2409.10772","DOIUrl":null,"url":null,"abstract":"This paper proposes a computationally tractable algorithm for learning\ninfinite-horizon average-reward linear Markov decision processes (MDPs) and\nlinear mixture MDPs under the Bellman optimality condition. While guaranteeing\ncomputational efficiency, our algorithm for linear MDPs achieves the best-known\nregret upper bound of\n$\\widetilde{\\mathcal{O}}(d^{3/2}\\mathrm{sp}(v^*)\\sqrt{T})$ over $T$ time steps\nwhere $\\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$\nis the dimension of the feature mapping. For linear mixture MDPs, our algorithm\nattains a regret bound of\n$\\widetilde{\\mathcal{O}}(d\\cdot\\mathrm{sp}(v^*)\\sqrt{T})$. The algorithm\napplies novel techniques to control the covering number of the value function\nclass and the span of optimistic estimators of the value function, which is of\nindependent interest.","PeriodicalId":501286,"journal":{"name":"arXiv - MATH - Optimization and Control","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Optimization and Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10772","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper proposes a computationally tractable algorithm for learning
infinite-horizon average-reward linear Markov decision processes (MDPs) and
linear mixture MDPs under the Bellman optimality condition. While guaranteeing
computational efficiency, our algorithm for linear MDPs achieves the best-known
regret upper bound of
$\widetilde{\mathcal{O}}(d^{3/2}\mathrm{sp}(v^*)\sqrt{T})$ over $T$ time steps
where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$
is the dimension of the feature mapping. For linear mixture MDPs, our algorithm
attains a regret bound of
$\widetilde{\mathcal{O}}(d\cdot\mathrm{sp}(v^*)\sqrt{T})$. The algorithm
applies novel techniques to control the covering number of the value function
class and the span of optimistic estimators of the value function, which is of
independent interest.