可扩展的细粒度并行循环枚举算法

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2022-02-19 DOI:10.1145/3490148.3538585

J. Blanuša, P. Ienne, K. Atasu

{"title":"可扩展的细粒度并行循环枚举算法","authors":"J. Blanuša, P. Ienne, K. Atasu","doi":"10.1145/3490148.3538585","DOIUrl":null,"url":null,"abstract":"Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Scalable Fine-Grained Parallel Cycle Enumeration Algorithms\",\"authors\":\"J. Blanuša, P. Ienne, K. Atasu\",\"doi\":\"10.1145/3490148.3538585\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.\",\"PeriodicalId\":112865,\"journal\":{\"name\":\"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3490148.3538585\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3490148.3538585","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

枚举简单循环在计算生物学、网络科学和金融犯罪分析中有着重要的应用。在这项工作中，我们专注于并行Johnson和Read-Tarjan的最先进的简单循环枚举算法及其在时间图中的应用。据我们所知，我们是第一个以细粒度方式并行化这两种算法的人。我们也是第一个通过实验证明线性性能缩放的团队。通过将长顺序搜索分解为细粒度任务，可以实现这种扩展，然后跨CPU内核动态调度这些任务，从而实现最佳负载平衡。此外，我们表明Johnson和Read-Tarjan算法的粗粒度并行版本利用边缘或顶点级并行性是不可扩展的。在具有256个物理核的4个多核cpu集群上，我们的细粒度并行算法平均比粗粒度并行算法快一个数量级。细粒度和粗粒度并行算法之间的性能差距随着我们使用更多的CPU内核而扩大。当使用所有256个CPU内核时，我们的并行算法枚举时间周期的平均速度比Kumar和Calders的串行算法快260倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scalable Fine-Grained Parallel Cycle Enumeration Algorithms

Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

自引率

0.00%

发文量