高效的大数据集群

Proceedings of the 22nd International Database Engineering & Applications Symposium Pub Date : 2018-06-18 DOI:10.1145/3216122.3216154

M. Ianni, E. Masciari, G. Mazzeo, C. Zaniolo

{"title":"高效的大数据集群","authors":"M. Ianni, E. Masciari, G. Mazzeo, C. Zaniolo","doi":"10.1145/3216122.3216154","DOIUrl":null,"url":null,"abstract":"The need to support advanced analytics on Big Data is driving data scientist' interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Efficient Big Data Clustering\",\"authors\":\"M. Ianni, E. Masciari, G. Mazzeo, C. Zaniolo\",\"doi\":\"10.1145/3216122.3216154\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The need to support advanced analytics on Big Data is driving data scientist' interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+.\",\"PeriodicalId\":422509,\"journal\":{\"name\":\"Proceedings of the 22nd International Database Engineering & Applications Symposium\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd International Database Engineering & Applications Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3216122.3216154\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Database Engineering & Applications Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3216122.3216154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

支持大数据高级分析的需求正推动数据科学家对大规模并行分布式系统和软件平台的兴趣，如Map-Reduce和Spark，这使得它们的可扩展利用成为可能。然而，当需要复杂的数据挖掘算法时，它们在这样的平台上的完全可伸缩部署面临着许多技术挑战，这些挑战随着所涉及算法的复杂性而增长。因此，为了有效地使用分布式计算资源，必须经常重新设计最初设计用于顺序性质的算法。在本文中，我们对这些问题进行了探讨，并提出了一种解决方案，该方案在复杂的分层聚类算法CLUBS+上被证明是非常有效的。通过四个阶段的连续细化，CLUBS+提供了围绕其质心分组的高质量数据簇，以完全无监督的方式工作。实验结果证实了CLUBS+的准确性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient Big Data Clustering

The need to support advanced analytics on Big Data is driving data scientist' interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 22nd International Database Engineering & Applications Symposium

自引率

0.00%

发文量

期刊最新文献

Data Mining Ancient Script Image Data Using Convolutional Neural Networks CELPB: A Cache Invalidation Policy for Location Dependent Data in Mobile Environment Efficient Big Data Clustering The Science of Science and a Multilayer Network Approach to Scientists' Ranking WalDis: Mining Discriminative Patterns within Dynamic Graphs