Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale

Proceedings of the ACM on Measurement and Analysis of Computing Systems Pub Date : 2022-12-01 DOI:10.1145/3570611

Ali Hossein Abbasi Abyaneh, Maizi Liao, S. Zahedi

{"title":"Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale","authors":"Ali Hossein Abbasi Abyaneh, Maizi Liao, S. Zahedi","doi":"10.1145/3570611","DOIUrl":null,"url":null,"abstract":"We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency and high availability is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework. In this framework, servers collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. In homogeneous settings, Malcolm performs as well as the best alternative among other baselines. In heterogeneous settings, compared to other baselines, for lower loads, Malcolm improves tail latency by up to a factor of four. And for the same tail latency, Malcolm achieves up to 60% more throughput compared to the best alternative among other baselines.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3570611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency and high availability is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework. In this framework, servers collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. In homogeneous settings, Malcolm performs as well as the best alternative among other baselines. In heterogeneous settings, compared to other baselines, for lower loads, Malcolm improves tail latency by up to a factor of four. And for the same tail latency, Malcolm achieves up to 60% more throughput compared to the best alternative among other baselines.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

机架规模下协同负载管理的多智能体学习

我们考虑在密集机架中的服务器之间平衡微秒级工作负载的问题。为了在这种设置中平衡负载，每秒必须做出数千万个调度决策。在提供微秒级延迟和高可用性的同时实现这种吞吐量是极具挑战性的。为了应对这一挑战，我们设计了一个完全分散的负载平衡框架。在这个框架中，服务器共同平衡系统中的负载。我们将服务器间的交互建模为一个合作的随机博弈。为了找到博弈的参数纳什均衡，我们设计并实现了一个基于多智能体学习理论的去中心化算法。我们的经验表明，我们提出的算法是自适应和可扩展的，同时优于最先进的替代方案。在同质环境中，马尔科姆的表现与其他基线中的最佳选择一样好。在异构环境中，与其他基线相比，对于较低负载，Malcolm将尾部延迟提高了四倍。对于相同的尾部延迟，与其他基线中的最佳替代方案相比，Malcolm实现了高达60%的吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM on Measurement and Analysis of Computing Systems

CiteScore

3.20

自引率

0.00%

发文量