{"title":"Minimax weight learning for absorbing MDPs","authors":"Fengying Li, Yuqiang Li, Xianyi Wu","doi":"10.1007/s00362-023-01491-4","DOIUrl":null,"url":null,"abstract":"<p>Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"43 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Papers","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s00362-023-01491-4","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment
期刊介绍:
The journal Statistical Papers addresses itself to all persons and organizations that have to deal with statistical methods in their own field of work. It attempts to provide a forum for the presentation and critical assessment of statistical methods, in particular for the discussion of their methodological foundations as well as their potential applications. Methods that have broad applications will be preferred. However, special attention is given to those statistical methods which are relevant to the economic and social sciences. In addition to original research papers, readers will find survey articles, short notes, reports on statistical software, problem section, and book reviews.