Local Search Yields a PTAS for k-Means in Doubling Metrics

Zachary Friggstad, M. Rezapour, M. Salavatipour
{"title":"Local Search Yields a PTAS for k-Means in Doubling Metrics","authors":"Zachary Friggstad, M. Rezapour, M. Salavatipour","doi":"10.1109/FOCS.2016.47","DOIUrl":null,"url":null,"abstract":"The most well known and ubiquitous clustering problem encountered in nearly every branch of science is undoubtedly k-MEANS: given a set of data points and a parameter k, select k centres and partition the data points into k clusters around these centres so that the sum of squares of distances of the points to their cluster centre is minimized. Typically these data points lie in Euclidean space Rd for some d ≥ 2. k-MEANS and the first algorithms for it were introduced in the 1950's. Over the last six decades, hundreds of papers have studied this problem and different algorithms have been proposed for it. The most commonly used algorithm in practice is known as Lloyd-Forgy, which is also referred to as \"the\" k-MEANS algorithm, and various extensions of it often work very well in practice. However, they may produce solutions whose cost is arbitrarily large compared to the optimum solution. Kanungo et al. [2004] analyzed a very simple local search heuristic to get a polynomial-time algorithm with approximation ratio 9 + ε for any fixed ε > 0 for k-Umeans in Euclidean space. Finding an algorithm with a better worst-case approximation guarantee has remained one of the biggest open questions in this area, in particular whether one can get a true PTAS for fixed dimension Euclidean space. We settle this problem by showing that a simple local search algorithm provides a PTAS for k-MEANS for Rd for any fixed d. More precisely, for any error parameter ε > 0, the local search algorithm that considers swaps of up to ρ = dO(d) · ε-O(d/ε) centres at a time will produce a solution using exactly k centres whose cost is at most a (1+ε)-factor greater than the optimum solution. Our analysis extends very easily to the more general settings where we want to minimize the sum of q'th powers of the distances between data points and their cluster centres (instead of sum of squares of distances as in k-MEANS) for any fixed q ≥ 1 and where the metric may not be Euclidean but still has fixed doubling dimension.","PeriodicalId":414001,"journal":{"name":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"120","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2016.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 120

Abstract

The most well known and ubiquitous clustering problem encountered in nearly every branch of science is undoubtedly k-MEANS: given a set of data points and a parameter k, select k centres and partition the data points into k clusters around these centres so that the sum of squares of distances of the points to their cluster centre is minimized. Typically these data points lie in Euclidean space Rd for some d ≥ 2. k-MEANS and the first algorithms for it were introduced in the 1950's. Over the last six decades, hundreds of papers have studied this problem and different algorithms have been proposed for it. The most commonly used algorithm in practice is known as Lloyd-Forgy, which is also referred to as "the" k-MEANS algorithm, and various extensions of it often work very well in practice. However, they may produce solutions whose cost is arbitrarily large compared to the optimum solution. Kanungo et al. [2004] analyzed a very simple local search heuristic to get a polynomial-time algorithm with approximation ratio 9 + ε for any fixed ε > 0 for k-Umeans in Euclidean space. Finding an algorithm with a better worst-case approximation guarantee has remained one of the biggest open questions in this area, in particular whether one can get a true PTAS for fixed dimension Euclidean space. We settle this problem by showing that a simple local search algorithm provides a PTAS for k-MEANS for Rd for any fixed d. More precisely, for any error parameter ε > 0, the local search algorithm that considers swaps of up to ρ = dO(d) · ε-O(d/ε) centres at a time will produce a solution using exactly k centres whose cost is at most a (1+ε)-factor greater than the optimum solution. Our analysis extends very easily to the more general settings where we want to minimize the sum of q'th powers of the distances between data points and their cluster centres (instead of sum of squares of distances as in k-MEANS) for any fixed q ≥ 1 and where the metric may not be Euclidean but still has fixed doubling dimension.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
局部搜索产生k-Means加倍度量的PTAS
在几乎所有科学分支中遇到的最著名和最普遍的聚类问题无疑是k- means:给定一组数据点和参数k,选择k个中心,并将数据点围绕这些中心划分为k个聚类,从而使点到聚类中心的距离平方和最小化。通常这些数据点位于欧几里德空间Rd中,且d≥2。k-MEANS及其第一个算法是在20世纪50年代引入的。在过去的60年里,有数百篇论文研究了这个问题,并提出了不同的算法。在实践中最常用的算法被称为Lloyd-Forgy,它也被称为“k-MEANS”算法,它的各种扩展在实践中通常工作得很好。然而,与最优解决方案相比,它们可能产生成本任意大的解决方案。Kanungo等人[2004]分析了一种非常简单的局部搜索启发式算法,得到了欧几里德空间中k- u均值的近似比为9 + ε的任意固定ε > 0的多项式时间算法。寻找一种具有更好的最坏情况近似保证的算法一直是该领域最大的开放问题之一,特别是是否可以在固定维欧几里德空间中获得真正的PTAS。我们通过证明一个简单的局部搜索算法为任意固定d的Rd提供k- means的PTAS来解决这个问题。更准确地说,对于任何误差参数ε > 0,考虑一次至多ρ = dO(d)·ε- o (d/ε)中心交换的局部搜索算法将产生一个恰好使用k个中心的解,其代价最多大于最优解的(1+ε)因子。我们的分析很容易扩展到更一般的设置,我们想要最小化数据点和它们的簇中心之间的距离的q次方的总和(而不是k-MEANS中的距离的平方和)对于任何固定的q≥1,其中度量可能不是欧几里得,但仍然有固定的加倍维。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exponential Lower Bounds for Monotone Span Programs Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product Polynomial-Time Tensor Decompositions with Sum-of-Squares Decremental Single-Source Reachability and Strongly Connected Components in Õ(m√n) Total Update Time NP-Hardness of Reed-Solomon Decoding and the Prouhet-Tarry-Escott Problem
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1