RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning

Hwanjun Song, Jae-Gil Lee
{"title":"RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning","authors":"Hwanjun Song, Jae-Gil Lee","doi":"10.1145/3183713.3196887","DOIUrl":null,"url":null,"abstract":"In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"147 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3196887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 50

Abstract

In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RP-DBSCAN:一种基于随机分区的超高速并行DBSCAN算法
在大多数并行DBSCAN算法中,将相邻点分配到同一数据分区进行并行处理,以方便计算相邻点的密度。这种数据分区方案会导致一些关键问题,包括数据分区之间的负载不平衡,特别是在倾斜数据集中。为了解决这些问题,我们提出了一种基于单元的数据分区方案,即伪随机分区,它随机分布小单元而不是点本身。无论数据偏度如何,它都能实现高负载平衡,同时保留DBSCAN所需的数据连续性。此外,我们构建并传播整个数据集的高度紧凑的摘要,我们称之为两级单元字典,以补充随机分区。然后,我们开发了一种新的并行DBSCAN算法,随机分区-DBSCAN(简称RP-DBSCAN),它使用伪随机分区和两级单元字典。该算法同时找到每个数据分区的局部聚类,然后将这些局部聚类合并得到全局聚类。为了验证我们的方法的优点,我们在Spark上实现了RP-DBSCAN,并在12台Microsoft Azure机器(48核)上使用各种实际数据集进行了广泛的实验。在RP-DBSCAN中,数据分区和集群合并非常简单,并且每个分割上的集群不会被特定的工作人员拖出。因此,性能结果表明,RP-DBSCAN显著优于最先进的算法高达180倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Meta-Dataflows: Efficient Exploratory Dataflow Jobs Columnstore and B+ tree - Are Hybrid Physical Designs Important? Demonstration of VerdictDB, the Platform-Independent AQP System Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration Session details: Keynote1
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1