Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

IF 2.2 3区 管理学 Q3 MANAGEMENT Operations Research Pub Date : 2024-04-02 DOI:10.1287/opre.2022.0342
Yuling Yan, Gen Li, Yuxin Chen, Jianqing Fan
{"title":"Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games","authors":"Yuling Yan, Gen Li, Yuxin Chen, Jianqing Fan","doi":"10.1287/opre.2022.0342","DOIUrl":null,"url":null,"abstract":"<p>This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a <i>γ</i>-discounted, infinite-horizon Markov game with <i>S</i> states, in which the max-player has <i>A</i> actions and the min-player has <i>B</i> actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an <i>ε</i>-approximate Nash equilibrium with a sample complexity no larger than <span><math altimg=\"eq-00001.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mfrac><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext mathvariant=\"sans-serif\">clipped</mtext></mrow><mi>⋆</mi></msubsup><mi>S</mi><mo stretchy=\"false\">(</mo><mi>A</mi><mo>+</mo><mi>B</mi><mo stretchy=\"false\">)</mo></mrow><mrow><msup><mrow><mo stretchy=\"false\">(</mo><mn>1</mn><mo>−</mo><mi>γ</mi><mo stretchy=\"false\">)</mo></mrow><mn>3</mn></msup><msup><mrow><mi>ε</mi></mrow><mn>2</mn></msup></mrow></mfrac></mrow></math></span><span></span> (up to some log factor). Here, <span><math altimg=\"eq-00002.gif\" display=\"inline\" overflow=\"scroll\"><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext mathvariant=\"sans-serif\">clipped</mtext></mrow><mi>⋆</mi></msubsup></mrow></math></span><span></span> is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy <i>ε</i> can be any value within <span><math altimg=\"eq-00003.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mi>γ</mi></mrow></mfrac></mrow><mo>]</mo></mrow></mrow></math></span><span></span>. Our sample complexity bound strengthens prior art by a factor of <span><math altimg=\"eq-00004.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mi>min</mi><mo stretchy=\"false\">{</mo><mi>A</mi><mo>,</mo><mi>B</mi><mo stretchy=\"false\">}</mo></mrow></math></span><span></span>, achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.</p><p><b>Funding:</b> Y. Yan is supported in part by the Charlotte Elizabeth Procter Honorific Fellowship from Princeton University and the Norbert Wiener Postdoctoral Fellowship from MIT. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009, CCF-1907661, IIS-2218713, DMS-2014279, and IIS-2218773]. J. Fan is supported in part by the National Science Foundation [Grants DMS-1712591, DMS-2052926, DMS-2053832, and DMS-2210833] and Office of Naval Research [Grant N00014-22-1-2340].</p><p><b>Supplemental Material:</b> The online appendix is available at https://doi.org/10.1287/opre.2022.0342.</p>","PeriodicalId":54680,"journal":{"name":"Operations Research","volume":null,"pages":null},"PeriodicalIF":2.2000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1287/opre.2022.0342","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 0

Abstract

This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a γ-discounted, infinite-horizon Markov game with S states, in which the max-player has A actions and the min-player has B actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than CclippedS(A+B)(1γ)3ε2 (up to some log factor). Here, Cclipped is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within (0,11γ]. Our sample complexity bound strengthens prior art by a factor of min{A,B}, achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.

Funding: Y. Yan is supported in part by the Charlotte Elizabeth Procter Honorific Fellowship from Princeton University and the Norbert Wiener Postdoctoral Fellowship from MIT. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009, CCF-1907661, IIS-2218713, DMS-2014279, and IIS-2218773]. J. Fan is supported in part by the National Science Foundation [Grants DMS-1712591, DMS-2052926, DMS-2053832, and DMS-2210833] and Office of Naval Research [Grant N00014-22-1-2340].

Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.0342.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于模型的离线零和马尔可夫游戏强化学习
本文在从离线数据学习双人零和马尔可夫博弈中的纳什均衡方面取得了进展。具体来说,考虑一个具有 S 种状态的 γ 贴现无限视距马尔可夫博弈,其中最大玩家有 A 种行动,最小玩家有 B 种行动。我们提出了一种基于模型的悲观算法,该算法具有伯恩斯坦式置信下限,即零和马尔可夫博弈的置信下限值迭代,可以证明它能找到一个ε近似纳什均衡,样本复杂度不大于 Cclipped⋆S(A+B)(1-γ)3ε2(最多不超过某个对数因子)。这里,Cclipped⋆ 是某个单边剪切的同质性系数,反映了可用数据(相对于目标数据)的覆盖范围和分布偏移,而目标精度 ε 可以是 (0,11-γ] 范围内的任意值。我们的样本复杂度约束以最小{A,B}的系数加强了现有技术,在广泛的兴趣范围内实现了最小最优。我们的结果的一个吸引人之处在于其算法简单,它揭示了在实现样本最优性过程中减少方差和样本分割的必要性:严宇部分获得普林斯顿大学夏洛特-伊丽莎白-普罗克特荣誉奖学金和麻省理工学院诺伯特-维纳博士后奖学金的资助。Y. Chen 的部分研究经费来自 Alfred P. Sloan 研究奖学金、谷歌研究学者奖、空军科学研究办公室[FA9550-22-1-0198 号拨款]、海军研究办公室[N00014-22-1-2354 号拨款]和美国国家科学基金会[CCF-2221009、CCF-1907661、IIS-2218713、DMS-2014279 和 IIS-2218773 号拨款]。J. Fan 部分获得了美国国家科学基金会 [资助 DMS-1712591、DMS-2052926、DMS-2053832 和 DMS-2210833] 和海军研究办公室 [资助 N00014-22-1-2340] 的资助:在线附录见 https://doi.org/10.1287/opre.2022.0342。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Operations Research
Operations Research 管理科学-运筹学与管理科学
CiteScore
4.80
自引率
14.80%
发文量
237
审稿时长
15 months
期刊介绍: Operations Research publishes quality operations research and management science works of interest to the OR practitioner and researcher in three substantive categories: methods, data-based operational science, and the practice of OR. The journal seeks papers reporting underlying data-based principles of operational science, observations and modeling of operating systems, contributions to the methods and models of OR, case histories of applications, review articles, and discussions of the administrative environment, history, policy, practice, future, and arenas of application of operations research.
期刊最新文献
Stability of a Queue Fed by Scheduled Traffic at Critical Loading On (Random-Order) Online Contention Resolution Schemes for the Matching Polytope of (Bipartite) Graphs Efficient Algorithms for a Class of Stochastic Hidden Convex Optimization and Its Applications in Network Revenue Management Application-Driven Learning: A Closed-Loop Prediction and Optimization Approach Applied to Dynamic Reserves and Demand Forecasting Data-Driven Clustering and Feature-Based Retail Electricity Pricing with Smart Meters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1