Social Learning in Multi Agent Multi Armed Bandits

Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems Pub Date : 2020-06-08 DOI:10.1145/3393691.3394217

Abishek Sankararaman, A. Ganesh, S. Shakkottai

{"title":"Social Learning in Multi Agent Multi Armed Bandits","authors":"Abishek Sankararaman, A. Ganesh, S. Shakkottai","doi":"10.1145/3393691.3394217","DOIUrl":null,"url":null,"abstract":"We introduce a novel decentralized, multi agent version of the classical Multi-Arm Bandit (MAB) problem, consisting of n agents, that collaboratively and simultaneously solve the same instance of K armed MAB to minimize individual regret. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The peragent regret achieved by our algorithm is O(⌈K/n⌉ + log(n)/Δ log(T)), where Δ is the difference between the mean of the best and second best arm. Furthermore, any agent in our algorithm communicates (arm-ids to an uniformly and independently chosen agent) only a total of Θ(log(T)) times over a time interval of T. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in K), and in communication complexity when compared to the full interaction setting which requires T communication attempts by an agent over T arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in per-agent regret.","PeriodicalId":188517,"journal":{"name":"Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3393691.3394217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

We introduce a novel decentralized, multi agent version of the classical Multi-Arm Bandit (MAB) problem, consisting of n agents, that collaboratively and simultaneously solve the same instance of K armed MAB to minimize individual regret. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The peragent regret achieved by our algorithm is O(⌈K/n⌉ + log(n)/Δ log(T)), where Δ is the difference between the mean of the best and second best arm. Furthermore, any agent in our algorithm communicates (arm-ids to an uniformly and independently chosen agent) only a total of Θ(log(T)) times over a time interval of T. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in K), and in communication complexity when compared to the full interaction setting which requires T communication attempts by an agent over T arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in per-agent regret.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多智能体多武装盗匪的社会学习

我们引入了经典的多臂强盗(MAB)问题的一种新的分散的多智能体版本，由n个智能体组成，这些智能体协作并同时解决相同的K臂强盗(MAB)实例以最小化个体后悔。代理之间只能通过基于对异步八卦的协议进行通信和协作，该协议交换有限数量的比特。在我们的模型中，智能体在每个点上决定(i)使用哪只手臂，(ii)是否使用，如果使用，(iii)与什么和谁通信。我们开发了一种新的算法，在这种算法中，智能体无论何时选择，都只与随机选择的另一个智能体进行臂id而不是样本的通信。peragent后悔通过我们的算法是O(⌈K / n⌉+ log (n) /Δ日志(T)),在Δ是最好的均值之间的差异和第二最好的手臂。此外，我们算法中的任何智能体(与统一且独立选择的智能体)在T的时间间隔内总共只通信Θ(log(T))次。我们将我们的结果与两个基准进行比较-一个是智能体之间没有通信，另一个对应于完整的交互，其中智能体可以访问所有智能体的整个武器历史和获得的奖励。我们在理论上和经验上都表明，与代理不协作并且每个代理都在玩标准MAB问题(其中遗憾将在K中线性扩展)的情况相比，我们的算法在每个代理的遗憾方面都经历了显着减少，并且与需要代理进行T次通信尝试的完整交互设置相比，在通信复杂性方面。因此，我们的结果表明，即使是不同代理之间最小程度的合作，也能显著减少每个代理的后悔。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems

自引率

0.00%

发文量

期刊最新文献

Latency Imbalance Among Internet Load-Balanced Paths: A Cloud-Centric View Fundamental Limits of Approximate Gradient Coding Staleness Control for Edge Data Analytics vrfinder: Finding Outbound Addresses in Traceroute The Great Internet TCP Congestion Control Census