{"title":"Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games","authors":"Yuling Yan, Gen Li, Yuxin Chen, Jianqing Fan","doi":"10.1287/opre.2022.0342","DOIUrl":null,"url":null,"abstract":"<p>This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a <i>γ</i>-discounted, infinite-horizon Markov game with <i>S</i> states, in which the max-player has <i>A</i> actions and the min-player has <i>B</i> actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an <i>ε</i>-approximate Nash equilibrium with a sample complexity no larger than <span><math altimg=\"eq-00001.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mfrac><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext mathvariant=\"sans-serif\">clipped</mtext></mrow><mi>⋆</mi></msubsup><mi>S</mi><mo stretchy=\"false\">(</mo><mi>A</mi><mo>+</mo><mi>B</mi><mo stretchy=\"false\">)</mo></mrow><mrow><msup><mrow><mo stretchy=\"false\">(</mo><mn>1</mn><mo>−</mo><mi>γ</mi><mo stretchy=\"false\">)</mo></mrow><mn>3</mn></msup><msup><mrow><mi>ε</mi></mrow><mn>2</mn></msup></mrow></mfrac></mrow></math></span><span></span> (up to some log factor). Here, <span><math altimg=\"eq-00002.gif\" display=\"inline\" overflow=\"scroll\"><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext mathvariant=\"sans-serif\">clipped</mtext></mrow><mi>⋆</mi></msubsup></mrow></math></span><span></span> is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy <i>ε</i> can be any value within <span><math altimg=\"eq-00003.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mi>γ</mi></mrow></mfrac></mrow><mo>]</mo></mrow></mrow></math></span><span></span>. Our sample complexity bound strengthens prior art by a factor of <span><math altimg=\"eq-00004.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mi>min</mi><mo stretchy=\"false\">{</mo><mi>A</mi><mo>,</mo><mi>B</mi><mo stretchy=\"false\">}</mo></mrow></math></span><span></span>, achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.</p><p><b>Funding:</b> Y. Yan is supported in part by the Charlotte Elizabeth Procter Honorific Fellowship from Princeton University and the Norbert Wiener Postdoctoral Fellowship from MIT. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009, CCF-1907661, IIS-2218713, DMS-2014279, and IIS-2218773]. J. Fan is supported in part by the National Science Foundation [Grants DMS-1712591, DMS-2052926, DMS-2053832, and DMS-2210833] and Office of Naval Research [Grant N00014-22-1-2340].</p><p><b>Supplemental Material:</b> The online appendix is available at https://doi.org/10.1287/opre.2022.0342.</p>","PeriodicalId":54680,"journal":{"name":"Operations Research","volume":"1 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1287/opre.2022.0342","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 0
Abstract
This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a γ-discounted, infinite-horizon Markov game with S states, in which the max-player has A actions and the min-player has B actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than (up to some log factor). Here, is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within . Our sample complexity bound strengthens prior art by a factor of , achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
Funding: Y. Yan is supported in part by the Charlotte Elizabeth Procter Honorific Fellowship from Princeton University and the Norbert Wiener Postdoctoral Fellowship from MIT. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009, CCF-1907661, IIS-2218713, DMS-2014279, and IIS-2218773]. J. Fan is supported in part by the National Science Foundation [Grants DMS-1712591, DMS-2052926, DMS-2053832, and DMS-2210833] and Office of Naval Research [Grant N00014-22-1-2340].
Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.0342.
期刊介绍:
Operations Research publishes quality operations research and management science works of interest to the OR practitioner and researcher in three substantive categories: methods, data-based operational science, and the practice of OR. The journal seeks papers reporting underlying data-based principles of operational science, observations and modeling of operating systems, contributions to the methods and models of OR, case histories of applications, review articles, and discussions of the administrative environment, history, policy, practice, future, and arenas of application of operations research.