A Rapid Prototyping Approach for High Performance Density-Based Clustering

Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal
{"title":"A Rapid Prototyping Approach for High Performance Density-Based Clustering","authors":"Saiyedul Islam, S. Balasubramaniam, Poonam Goyal, Ankit Sultana, Lakshit Bhutani, S. Raje, Navneet Goyal","doi":"10.1109/DSAA.2019.00041","DOIUrl":null,"url":null,"abstract":"Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2019.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高性能基于密度的聚类的快速原型方法
大数据极大地增加了数据分析社区对高性能计算(HPC)系统的依赖。然而,高效地为HPC系统编程仍然是一项繁琐的任务,需要在并行化和使用特定于平台的语言以及机制方面的专业技能。我们提出了一个框架,用于快速原型化新的/现有的基于密度的聚类算法,同时通过自动并行化获得低运行时间和高速度。用户只需要用领域特定语言(Domain Specific Language, DSL)指定序列算法,以便在非常高的抽象级别上进行聚类。DSL的并行编译器会完成其余的工作,以利用分布式系统——特别是由普通硬件组成的典型横向扩展集群。我们的方法基于循环的、可并行的编程模式,称为内核,它由编译器识别和并行化。我们演示了DBSCAN、SNN和RECOME算法的编程便利性和可扩展性能。我们还确定,所提出的方法可以达到与最先进的手动并行实现相当的性能,同时需要最少的编程工作,比MPI/Spark等其他并行平台所需的编程工作小几个数量级。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Rapid Prototyping Approach for High Performance Density-Based Clustering Automating Big Data Analysis Based on Deep Learning Generation by Automatic Service Composition Detecting Sensitive Content in Spoken Language Improving the Personalized Recommendation in the Cold-start Scenarios Colorwall: An Embedded Temporal Display of Bibliographic Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1