Purified Policy Space Response Oracles for Symmetric Zero-Sum Games

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE transactions on neural networks and learning systems Pub Date : 2024-12-12 DOI:10.1109/TNNLS.2024.3457509

Zhengdao Shao;Liansheng Zhuang;Yihong Huang;Houqiang Li;Shafei Wang

{"title":"Purified Policy Space Response Oracles for Symmetric Zero-Sum Games","authors":"Zhengdao Shao;Liansheng Zhuang;Yihong Huang;Houqiang Li;Shafei Wang","doi":"10.1109/TNNLS.2024.3457509","DOIUrl":null,"url":null,"abstract":"Policy space response oracles (PSRO) is a promising tool to find an approximate Nash equilibrium (NE) in a two-player zero-sum game. It solves the equilibrium by iteratively expanding a small-scale meta-game formed by a restricted strategy population consisting of historical approximate best responses of the meta-games. However, since these best responses have a strong correlation with each other, existing PSRO and its variants often have the slow diversity growth of the strategy population, and thus suffer from poor exploration efficiency and slow convergence rate. To address this problem, this article proposes Purified PSRO, which deliberately maintains a pure strategy population formed by pure strategy bases of approximate best responses. A novel module namely non-best response suppression (NBRS) is introduced to calculate a pure strategy base with better orthogonality to expand the strategy population at each epoch. In this way, Purified PSRO can quickly increase the diversity of the strategy population, thus greatly enhance the efficiency of exploration. Theoretically, we prove the convergence of Purified PSRO. Moreover, we introduce an early stop module to reduce computation cost, and give the upper bound of the exploitability when the algorithm stops early. Extensive experiments on random games of skill (RGoS) and real-world meta-games show that Purified PSRO can consistently outperform existing SOTA methods, sometimes with a large margin.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 6","pages":"11258-11270"},"PeriodicalIF":8.9000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10795441/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Policy space response oracles (PSRO) is a promising tool to find an approximate Nash equilibrium (NE) in a two-player zero-sum game. It solves the equilibrium by iteratively expanding a small-scale meta-game formed by a restricted strategy population consisting of historical approximate best responses of the meta-games. However, since these best responses have a strong correlation with each other, existing PSRO and its variants often have the slow diversity growth of the strategy population, and thus suffer from poor exploration efficiency and slow convergence rate. To address this problem, this article proposes Purified PSRO, which deliberately maintains a pure strategy population formed by pure strategy bases of approximate best responses. A novel module namely non-best response suppression (NBRS) is introduced to calculate a pure strategy base with better orthogonality to expand the strategy population at each epoch. In this way, Purified PSRO can quickly increase the diversity of the strategy population, thus greatly enhance the efficiency of exploration. Theoretically, we prove the convergence of Purified PSRO. Moreover, we introduce an early stop module to reduce computation cost, and give the upper bound of the exploitability when the algorithm stops early. Extensive experiments on random games of skill (RGoS) and real-world meta-games show that Purified PSRO can consistently outperform existing SOTA methods, sometimes with a large margin.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对称零和博弈的净化策略空间响应预言

政策空间反应预言（PSRO）是一个很有前途的工具，可以在两方零和博弈中找到近似纳什均衡（NE）。它通过迭代扩展小规模元博弈来解决平衡问题，小规模元博弈是由由元博弈的历史近似最佳对策组成的有限策略群体形成的。然而，由于这些最佳响应之间具有较强的相关性，现有的PSRO及其变体往往具有策略种群多样性增长缓慢的特点，因而勘探效率较低，收敛速度较慢。为了解决这个问题，本文提出了纯化的PSRO，它故意保持一个由近似最佳对策的纯策略基础组成的纯策略群体。引入非最佳反应抑制（non-best response suppression， NBRS）模块，计算出具有较好正交性的纯策略基，从而扩展策略种群。这样，纯化后的PSRO可以快速增加策略种群的多样性，从而大大提高探索效率。从理论上证明了纯化PSRO的收敛性。此外，我们还引入了提前停止模块以降低计算成本，并给出了算法提前停止时可利用性的上界。对随机技能游戏（rgo）和现实世界元游戏的大量实验表明，纯化PSRO可以持续优于现有的SOTA方法，有时甚至有很大的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.