Local Search Yields a PTAS for k-Means in Doubling Metrics

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS) Pub Date : 2016-03-29 DOI:10.1109/FOCS.2016.47

Zachary Friggstad, M. Rezapour, M. Salavatipour

{"title":"Local Search Yields a PTAS for k-Means in Doubling Metrics","authors":"Zachary Friggstad, M. Rezapour, M. Salavatipour","doi":"10.1109/FOCS.2016.47","DOIUrl":null,"url":null,"abstract":"The most well known and ubiquitous clustering problem encountered in nearly every branch of science is undoubtedly k-MEANS: given a set of data points and a parameter k, select k centres and partition the data points into k clusters around these centres so that the sum of squares of distances of the points to their cluster centre is minimized. Typically these data points lie in Euclidean space Rd for some d ≥ 2. k-MEANS and the first algorithms for it were introduced in the 1950's. Over the last six decades, hundreds of papers have studied this problem and different algorithms have been proposed for it. The most commonly used algorithm in practice is known as Lloyd-Forgy, which is also referred to as \"the\" k-MEANS algorithm, and various extensions of it often work very well in practice. However, they may produce solutions whose cost is arbitrarily large compared to the optimum solution. Kanungo et al. [2004] analyzed a very simple local search heuristic to get a polynomial-time algorithm with approximation ratio 9 + ε for any fixed ε > 0 for k-Umeans in Euclidean space. Finding an algorithm with a better worst-case approximation guarantee has remained one of the biggest open questions in this area, in particular whether one can get a true PTAS for fixed dimension Euclidean space. We settle this problem by showing that a simple local search algorithm provides a PTAS for k-MEANS for Rd for any fixed d. More precisely, for any error parameter ε > 0, the local search algorithm that considers swaps of up to ρ = dO(d) · ε-O(d/ε) centres at a time will produce a solution using exactly k centres whose cost is at most a (1+ε)-factor greater than the optimum solution. Our analysis extends very easily to the more general settings where we want to minimize the sum of q'th powers of the distances between data points and their cluster centres (instead of sum of squares of distances as in k-MEANS) for any fixed q ≥ 1 and where the metric may not be Euclidean but still has fixed doubling dimension.","PeriodicalId":414001,"journal":{"name":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"120","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2016.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 120

Abstract

The most well known and ubiquitous clustering problem encountered in nearly every branch of science is undoubtedly k-MEANS: given a set of data points and a parameter k, select k centres and partition the data points into k clusters around these centres so that the sum of squares of distances of the points to their cluster centre is minimized. Typically these data points lie in Euclidean space Rd for some d ≥ 2. k-MEANS and the first algorithms for it were introduced in the 1950's. Over the last six decades, hundreds of papers have studied this problem and different algorithms have been proposed for it. The most commonly used algorithm in practice is known as Lloyd-Forgy, which is also referred to as "the" k-MEANS algorithm, and various extensions of it often work very well in practice. However, they may produce solutions whose cost is arbitrarily large compared to the optimum solution. Kanungo et al. [2004] analyzed a very simple local search heuristic to get a polynomial-time algorithm with approximation ratio 9 + ε for any fixed ε > 0 for k-Umeans in Euclidean space. Finding an algorithm with a better worst-case approximation guarantee has remained one of the biggest open questions in this area, in particular whether one can get a true PTAS for fixed dimension Euclidean space. We settle this problem by showing that a simple local search algorithm provides a PTAS for k-MEANS for Rd for any fixed d. More precisely, for any error parameter ε > 0, the local search algorithm that considers swaps of up to ρ = dO(d) · ε-O(d/ε) centres at a time will produce a solution using exactly k centres whose cost is at most a (1+ε)-factor greater than the optimum solution. Our analysis extends very easily to the more general settings where we want to minimize the sum of q'th powers of the distances between data points and their cluster centres (instead of sum of squares of distances as in k-MEANS) for any fixed q ≥ 1 and where the metric may not be Euclidean but still has fixed doubling dimension.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

局部搜索产生k-Means加倍度量的PTAS

在几乎所有科学分支中遇到的最著名和最普遍的聚类问题无疑是k- means:给定一组数据点和参数k，选择k个中心，并将数据点围绕这些中心划分为k个聚类，从而使点到聚类中心的距离平方和最小化。通常这些数据点位于欧几里德空间Rd中，且d≥2。k-MEANS及其第一个算法是在20世纪50年代引入的。在过去的60年里，有数百篇论文研究了这个问题，并提出了不同的算法。在实践中最常用的算法被称为Lloyd-Forgy，它也被称为“k-MEANS”算法，它的各种扩展在实践中通常工作得很好。然而，与最优解决方案相比，它们可能产生成本任意大的解决方案。Kanungo等人[2004]分析了一种非常简单的局部搜索启发式算法，得到了欧几里德空间中k- u均值的近似比为9 + ε的任意固定ε > 0的多项式时间算法。寻找一种具有更好的最坏情况近似保证的算法一直是该领域最大的开放问题之一，特别是是否可以在固定维欧几里德空间中获得真正的PTAS。我们通过证明一个简单的局部搜索算法为任意固定d的Rd提供k- means的PTAS来解决这个问题。更准确地说，对于任何误差参数ε > 0，考虑一次至多ρ = dO(d)·ε- o (d/ε)中心交换的局部搜索算法将产生一个恰好使用k个中心的解，其代价最多大于最优解的(1+ε)因子。我们的分析很容易扩展到更一般的设置，我们想要最小化数据点和它们的簇中心之间的距离的q次方的总和(而不是k-MEANS中的距离的平方和)对于任何固定的q≥1，其中度量可能不是欧几里得，但仍然有固定的加倍维。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)

自引率

0.00%

发文量