求解大型图优化问题的分布式gpu深度强化学习系统

IF 1.7 Q3 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Parallel Computing Pub Date : 2023-03-23 DOI:10.1145/3589188

Weijian Zheng, Dali Wang, Fengguang Song

{"title":"求解大型图优化问题的分布式gpu深度强化学习系统","authors":"Weijian Zheng, Dali Wang, Fengguang Song","doi":"10.1145/3589188","DOIUrl":null,"url":null,"abstract":"Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.7000,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems\",\"authors\":\"Weijian Zheng, Dali Wang, Fengguang Song\",\"doi\":\"10.1145/3589188\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.\",\"PeriodicalId\":42115,\"journal\":{\"name\":\"ACM Transactions on Parallel Computing\",\"volume\":\" \",\"pages\":\"1 - 23\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2023-03-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Parallel Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3589188\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3589188","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

图优化问题（如最小顶点覆盖、最大割、旅行商问题）出现在许多领域，包括社会科学、电力系统、化学和生物信息学。最近，深度强化学习（DRL）在自动学习良好的启发式方法以解决图优化问题方面取得了成功。然而，现有的RL系统要么不支持图形RL环境，要么不支持分布式设置中的多个或多个GPU。由于缺乏并行性和高可扩展性，这削弱了强化学习在解决大规模图优化问题中的能力。为了应对并行化和可扩展性的挑战，我们开发了RL4GO，这是一个用于解决图形优化问题的高性能分布式GPU DRL框架。RL4GO专注于一类计算要求很高的RL问题，其中RL环境和策略模型都是高度计算密集型的。传统的强化学习系统通常认为RL环境的时间复杂度较低，或者策略模型较小。在这项工作中，我们将大规模的图分布在分布式GPU中，并使用空间并行性和数据并行性来实现可扩展的性能。我们对空间并行和数据并行的性能进行了比较和分析，并展示了它们的差异。为了支持图神经网络（GNN）层，该层将在分布式GPU上分割的数据样本作为输入数据，我们设计了并行数学内核来对分布式3D稀疏张量和3D密集张量执行操作。为了处理成本高昂的RL环境，我们设计了一个并行图环境来扩展所有与RL环境相关的操作。通过将可扩展的GNN层与可扩展的RL环境相结合，我们能够并行开发高性能的RL4GO训练和推理算法。此外，我们提出了两种优化技术——回放缓冲动态图生成和自适应多节点选择——以最小化空间成本并加速强化学习。本文还对并行效率和内存成本进行了深入分析，并表明所设计的RL4GO算法在众多分布式GPU上是可扩展的。对大型图的评估表明，（1）RL4GO训练和推理可以在192个GPU上实现良好的并行效率，（2）其训练时间可以比最先进的Gorila分布式RL框架快18倍[34]，（3）其推理性能比Gorila提高了26倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems

Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊