{"title":"基于参数服务器的高效通信并行DBSCAN算法","authors":"Xu Hu, Jun Huang, Minghui Qiu","doi":"10.1145/3132847.3133112","DOIUrl":null,"url":null,"abstract":"Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"163 5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server\",\"authors\":\"Xu Hu, Jun Huang, Minghui Qiu\",\"doi\":\"10.1145/3132847.3133112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.\",\"PeriodicalId\":20449,\"journal\":{\"name\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"volume\":\"163 5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3132847.3133112\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3133112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server
Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.