Towards scaling community detection on distributed-memory heterogeneous systems

IF 2.1 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Parallel Computing Pub Date : 2022-07-01 Epub Date: 2022-02-22 DOI:10.1016/j.parco.2022.102898

Nitin Gawande , Sayan Ghosh , Mahantesh Halappanavar , Antonino Tumeo , Ananth Kalyanaraman

{"title":"Towards scaling community detection on distributed-memory heterogeneous systems","authors":"Nitin Gawande , Sayan Ghosh , Mahantesh Halappanavar , Antonino Tumeo , Ananth Kalyanaraman","doi":"10.1016/j.parco.2022.102898","DOIUrl":null,"url":null,"abstract":"<div>In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using modularity. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2<math><mo>×</mo></math> speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19<math><mo>×</mo></math> speedup relative to the NVIDIA RAPIDS® cuGraph® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102898"},"PeriodicalIF":2.1000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000060/pdfft?md5=af2c328e8814f291f58460d2c8138c36&pid=1-s2.0-S0167819122000060-main.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819122000060","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/2/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 2

Abstract

In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using modularity. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.

Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.

In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2 $\times$ speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19 $\times$ speedup relative to the NVIDIA RAPIDS® cuGraph® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于分布式内存异构系统的社区检测研究

在大多数现实世界的网络中，节点/顶点往往被组织成紧密结合的模块，称为社区或集群，这样社区内的节点更有可能相互连接或相互关联，而不是与网络的其余部分相连。网络(图)中的社区检测旨在找到将顶点划分为社区的方法。划分的好坏通常用模块化来衡量。模块化最大化是一个np完全问题。2008年，Blondel等人引入了一种多阶段、多迭代的模块化最大化启发式方法，称为Louvain方法。由于它的速度和产生高质量社区的能力，Louvain方法仍然是串行社区检测最广泛使用的工具之一。分布式多gpu系统为高效执行并行应用程序带来了巨大的挑战和机遇。特别是图算法，在这样的平台上很难并行化，这是由于不规则的内存访问，较低的计算与通信比率，以及在多gpu系统上特别难以解决的负载平衡问题。在本文中，我们介绍了我们正在进行的Louvain方法在异构系统上的分布式内存实现的工作。我们在之前的工作的基础上，将Louvain方法并行化，用于在没有gpu的传统仅cpu分布式系统上进行社区检测。通过在多GPU系统上进行的大量实验证实，我们展示了与现有基于分布式内存cpu的实现相比具有竞争力的性能，使用OLCF Summit的16个节点相对于两个节点的加速高达3.2倍，相对于来自DGX-2平台的单个NVIDIA V100 GPU的NVIDIA RAPIDS®cuGraph®实现的加速高达19倍，同时获得与原始Louvain方法相当的高质量解决方案。据我们所知，这项工作代表了分布式多gpu系统上社区检测的第一次努力。我们的方法和相关发现可以扩展到多gpu系统上的许多其他迭代图算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications