k-中心聚类的高效并行算法

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-04-12 DOI:10.1109/ICPP.2016.22

J. McClintock, Anthony Wirth

{"title":"k-中心聚类的高效并行算法","authors":"J. McClintock, Anthony Wirth","doi":"10.1109/ICPP.2016.22","DOIUrl":null,"url":null,"abstract":"The k-center problem is a classic NP-hard clustering question. For contemporary massive data sets, RAM-based algorithms become impractical. Although there exist good algorithms for k-center, they are all inherently sequential. In this paper, we design and implement parallel approximation algorithms for k-center. We observe that Gonzalez's greedy algorithm can be efficiently parallelized in several MapReduce rounds, in practice, we find that two rounds are sufficient, leading to a 4-approximation. In practice, we find this parallel scheme is about 100 times faster than the sequential Gonzalez algorithm, and barely compromises solution quality. We contrast this with an existing parallel algorithm for k-center that offers a 10-approximation. Our analysis reveals that this scheme is often slow, and that its sampling procedure only runs if k is sufficiently small, relative to input size. In practice, it is slightly more effective than Gonzalez's approach, but is slow. To trade off runtime for approximation guarantee, we parameterize this sampling algorithm. We prove a lower bound on the parameter for effectiveness, and find experimentally that with values even lower than the bound, the algorithm is not only faster, but sometimes more effective.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Efficient Parallel Algorithms for k-Center Clustering\",\"authors\":\"J. McClintock, Anthony Wirth\",\"doi\":\"10.1109/ICPP.2016.22\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The k-center problem is a classic NP-hard clustering question. For contemporary massive data sets, RAM-based algorithms become impractical. Although there exist good algorithms for k-center, they are all inherently sequential. In this paper, we design and implement parallel approximation algorithms for k-center. We observe that Gonzalez's greedy algorithm can be efficiently parallelized in several MapReduce rounds, in practice, we find that two rounds are sufficient, leading to a 4-approximation. In practice, we find this parallel scheme is about 100 times faster than the sequential Gonzalez algorithm, and barely compromises solution quality. We contrast this with an existing parallel algorithm for k-center that offers a 10-approximation. Our analysis reveals that this scheme is often slow, and that its sampling procedure only runs if k is sufficiently small, relative to input size. In practice, it is slightly more effective than Gonzalez's approach, but is slow. To trade off runtime for approximation guarantee, we parameterize this sampling algorithm. We prove a lower bound on the parameter for effectiveness, and find experimentally that with values even lower than the bound, the algorithm is not only faster, but sometimes more effective.\",\"PeriodicalId\":409991,\"journal\":{\"name\":\"2016 45th International Conference on Parallel Processing (ICPP)\",\"volume\":\"92 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 45th International Conference on Parallel Processing (ICPP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2016.22\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 45th International Conference on Parallel Processing (ICPP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2016.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

k中心问题是一个典型的NP-hard聚类问题。对于当代的海量数据集，基于ram的算法变得不切实际。虽然存在很好的k-center算法，但它们都是固有顺序的。本文设计并实现了k-中心的并行逼近算法。我们观察到Gonzalez的贪心算法可以在几个MapReduce轮中有效地并行化，在实践中，我们发现两个轮就足够了，导致一个4逼近。实践中，我们发现这种并行方案比顺序Gonzalez算法快100倍左右，而且几乎不影响解的质量。我们将其与现有的k-center并行算法进行比较，该算法提供了10的近似。我们的分析表明，这种方案通常很慢，并且它的采样过程只有在k相对于输入大小足够小时才能运行。在实践中，这种方法比冈萨雷斯的方法略微有效，但速度较慢。为了权衡运行时间和近似保证，我们对该采样算法进行了参数化。我们证明了参数的下界的有效性，并通过实验发现，当参数的值低于下界时，算法不仅更快，而且有时更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient Parallel Algorithms for k-Center Clustering

The k-center problem is a classic NP-hard clustering question. For contemporary massive data sets, RAM-based algorithms become impractical. Although there exist good algorithms for k-center, they are all inherently sequential. In this paper, we design and implement parallel approximation algorithms for k-center. We observe that Gonzalez's greedy algorithm can be efficiently parallelized in several MapReduce rounds, in practice, we find that two rounds are sufficient, leading to a 4-approximation. In practice, we find this parallel scheme is about 100 times faster than the sequential Gonzalez algorithm, and barely compromises solution quality. We contrast this with an existing parallel algorithm for k-center that offers a 10-approximation. Our analysis reveals that this scheme is often slow, and that its sampling procedure only runs if k is sufficiently small, relative to input size. In practice, it is slightly more effective than Gonzalez's approach, but is slow. To trade off runtime for approximation guarantee, we parameterize this sampling algorithm. We prove a lower bound on the parameter for effectiveness, and find experimentally that with values even lower than the bound, the algorithm is not only faster, but sometimes more effective.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 45th International Conference on Parallel Processing (ICPP)

自引率

0.00%

发文量

期刊最新文献

Parallel k-Means++ for Multiple Shared-Memory Architectures RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing Partial Flattening: A Compilation Technique for Irregular Nested Parallelism on GPGPUs Improving RAID Performance Using an Endurable SSD Cache PARVMEC: An Efficient, Scalable Implementation of the Variational Moments Equilibrium Code