Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal
{"title":"A Rapid Prototyping Approach for High Performance Density-Based Clustering","authors":"Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal","doi":"10.1109/DSAA.2019.00041","DOIUrl":null,"url":null,"abstract":"Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2019.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.
大数据极大地增加了数据分析社区对高性能计算(HPC)系统的依赖。然而,高效地为HPC系统编程仍然是一项繁琐的任务,需要在并行化和使用特定于平台的语言以及机制方面的专业技能。我们提出了一个框架,用于快速原型化新的/现有的基于密度的聚类算法,同时通过自动并行化获得低运行时间和高速度。用户只需要用领域特定语言(Domain Specific Language, DSL)指定序列算法,以便在非常高的抽象级别上进行聚类。DSL的并行编译器会完成其余的工作,以利用分布式系统——特别是由普通硬件组成的典型横向扩展集群。我们的方法基于循环的、可并行的编程模式,称为内核,它由编译器识别和并行化。我们演示了DBSCAN、SNN和RECOME算法的编程便利性和可扩展性能。我们还确定,所提出的方法可以达到与最先进的手动并行实现相当的性能,同时需要最少的编程工作,比MPI/Spark等其他并行平台所需的编程工作小几个数量级。