Mohammad Sultan Mahmud;Joshua Zhexue Huang;Rukhsana Ruby;Alladoumbaye Ngueilbaye;Kaishun Wu
{"title":"Approximate Clustering Ensemble Method for Big Data","authors":"Mohammad Sultan Mahmud;Joshua Zhexue Huang;Rukhsana Ruby;Alladoumbaye Ngueilbaye;Kaishun Wu","doi":"10.1109/TBDATA.2023.3255003","DOIUrl":null,"url":null,"abstract":"Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n cluster centers. Finally, the \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n-means algorithm was used to allocate the entire dataset into \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1142-1155"},"PeriodicalIF":7.5000,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10066202/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 2
Abstract
Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of
$k$
cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of
$k$
cluster centers. Finally, the
$k$
-means algorithm was used to allocate the entire dataset into
$k$
clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.