RP-DBSCAN:一种基于随机分区的超高速并行DBSCAN算法

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI:10.1145/3183713.3196887

Hwanjun Song, Jae-Gil Lee

{"title":"RP-DBSCAN:一种基于随机分区的超高速并行DBSCAN算法","authors":"Hwanjun Song, Jae-Gil Lee","doi":"10.1145/3183713.3196887","DOIUrl":null,"url":null,"abstract":"In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"147 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":"{\"title\":\"RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning\",\"authors\":\"Hwanjun Song, Jae-Gil Lee\",\"doi\":\"10.1145/3183713.3196887\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.\",\"PeriodicalId\":20430,\"journal\":{\"name\":\"Proceedings of the 2018 International Conference on Management of Data\",\"volume\":\"147 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"50\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3183713.3196887\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3196887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

摘要

在大多数并行DBSCAN算法中，将相邻点分配到同一数据分区进行并行处理，以方便计算相邻点的密度。这种数据分区方案会导致一些关键问题，包括数据分区之间的负载不平衡，特别是在倾斜数据集中。为了解决这些问题，我们提出了一种基于单元的数据分区方案，即伪随机分区，它随机分布小单元而不是点本身。无论数据偏度如何，它都能实现高负载平衡，同时保留DBSCAN所需的数据连续性。此外，我们构建并传播整个数据集的高度紧凑的摘要，我们称之为两级单元字典，以补充随机分区。然后，我们开发了一种新的并行DBSCAN算法，随机分区-DBSCAN(简称RP-DBSCAN)，它使用伪随机分区和两级单元字典。该算法同时找到每个数据分区的局部聚类，然后将这些局部聚类合并得到全局聚类。为了验证我们的方法的优点，我们在Spark上实现了RP-DBSCAN，并在12台Microsoft Azure机器(48核)上使用各种实际数据集进行了广泛的实验。在RP-DBSCAN中，数据分区和集群合并非常简单，并且每个分割上的集群不会被特定的工作人员拖出。因此，性能结果表明，RP-DBSCAN显著优于最先进的算法高达180倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning

In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning , that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary , to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助