Towards scaling community detection on distributed-memory heterogeneous systems

IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Parallel Computing Pub Date : 2022-07-01 DOI:10.1016/j.parco.2022.102898
Nitin Gawande , Sayan Ghosh , Mahantesh Halappanavar , Antonino Tumeo , Ananth Kalyanaraman
{"title":"Towards scaling community detection on distributed-memory heterogeneous systems","authors":"Nitin Gawande ,&nbsp;Sayan Ghosh ,&nbsp;Mahantesh Halappanavar ,&nbsp;Antonino Tumeo ,&nbsp;Ananth Kalyanaraman","doi":"10.1016/j.parco.2022.102898","DOIUrl":null,"url":null,"abstract":"<div><p>In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as <em>communities</em> or <em>clusters</em> such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using <em>modularity</em>. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the <em>Louvain</em> method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.</p><p>Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.</p><p>In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2<span><math><mo>×</mo></math></span> speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19<span><math><mo>×</mo></math></span> speedup relative to the NVIDIA RAPIDS® <span>cuGraph</span>® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102898"},"PeriodicalIF":2.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000060/pdfft?md5=af2c328e8814f291f58460d2c8138c36&pid=1-s2.0-S0167819122000060-main.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819122000060","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 2

Abstract

In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using modularity. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.

Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.

In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2× speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19× speedup relative to the NVIDIA RAPIDS® cuGraph® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于分布式内存异构系统的社区检测研究
在大多数现实世界的网络中,节点/顶点往往被组织成紧密结合的模块,称为社区或集群,这样社区内的节点更有可能相互连接或相互关联,而不是与网络的其余部分相连。网络(图)中的社区检测旨在找到将顶点划分为社区的方法。划分的好坏通常用模块化来衡量。模块化最大化是一个np完全问题。2008年,Blondel等人引入了一种多阶段、多迭代的模块化最大化启发式方法,称为Louvain方法。由于它的速度和产生高质量社区的能力,Louvain方法仍然是串行社区检测最广泛使用的工具之一。分布式多gpu系统为高效执行并行应用程序带来了巨大的挑战和机遇。特别是图算法,在这样的平台上很难并行化,这是由于不规则的内存访问,较低的计算与通信比率,以及在多gpu系统上特别难以解决的负载平衡问题。在本文中,我们介绍了我们正在进行的Louvain方法在异构系统上的分布式内存实现的工作。我们在之前的工作的基础上,将Louvain方法并行化,用于在没有gpu的传统仅cpu分布式系统上进行社区检测。通过在多GPU系统上进行的大量实验证实,我们展示了与现有基于分布式内存cpu的实现相比具有竞争力的性能,使用OLCF Summit的16个节点相对于两个节点的加速高达3.2倍,相对于来自DGX-2平台的单个NVIDIA V100 GPU的NVIDIA RAPIDS®cuGraph®实现的加速高达19倍,同时获得与原始Louvain方法相当的高质量解决方案。据我们所知,这项工作代表了分布式多gpu系统上社区检测的第一次努力。我们的方法和相关发现可以扩展到多gpu系统上的许多其他迭代图算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Parallel Computing
Parallel Computing 工程技术-计算机:理论方法
CiteScore
3.50
自引率
7.10%
发文量
49
审稿时长
4.5 months
期刊介绍: Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications
期刊最新文献
Towards resilient and energy efficient scalable Krylov solvers Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions Editorial Board FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1