稳定性产生k-Median和k-Means聚类的PTAS

2010 IEEE 51st Annual Symposium on Foundations of Computer Science Pub Date : 2010-10-23 DOI:10.1109/FOCS.2010.36

Pranjal Awasthi, Avrim Blum, Or Sheffet

{"title":"稳定性产生k-Median和k-Means聚类的PTAS","authors":"Pranjal Awasthi, Avrim Blum, Or Sheffet","doi":"10.1109/FOCS.2010.36","DOIUrl":null,"url":null,"abstract":"We consider $k$-median clustering in finite metric spaces and $k$-means clustering in Euclidean spaces, in the setting where $k$ is part of the input (not a constant). For the $k$-means problem, Ostrovsky et al. show that if the optimal $(k-1)$-means clustering of the input is more expensive than the optimal $k$-means clustering by a factor of $1/\\epsilon^2$, then one can achieve a $(1+f(\\epsilon))$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$ by using a variant of Lloyd's algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the $(k-1)$-means optimal is more expensive than the $k$-means optimal by a factor $1+\\alpha$ for {\\em some} constant $\\alpha>0$, we can obtain a PTAS. In particular, under this assumption, for any $\\eps>0$ we achieve a $(1+\\eps)$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$, and exponential in $1/\\eps$ and $1/\\alpha$. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the $k$-median problem in finite metrics under the analogous assumption as well. For $k$-means, we in addition give a randomized algorithm with improved running time of $n^{O(1)}(k \\log n)^{\\poly(1/\\epsilon,1/\\alpha)}$. Our technique also obtains a PTAS under the assumption of Balcan et al. that all $(1+\\alpha)$ approximations are $\\delta$-close to a desired target clustering, in the case that all target clusters have size greater than $\\delta n$ and $\\alpha>0$ is constant. Note that the motivation of Balcan et al. is that for many clustering problems, the objective function is only a proxy for the true goal of getting close to the target. From this perspective, our improvement is that for $k$-means in Euclidean spaces we reduce the distance of the clustering found to the target from $O(\\delta)$ to $\\delta$ when all target clusters are large, and for $k$-median we improve the ``largeness'' condition needed in the work of Balcan et al. to get exactly $\\delta$-close from $O(\\delta n)$ to $\\delta n$. Our results are based on a new notion of clustering stability.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"93","resultStr":"{\"title\":\"Stability Yields a PTAS for k-Median and k-Means Clustering\",\"authors\":\"Pranjal Awasthi, Avrim Blum, Or Sheffet\",\"doi\":\"10.1109/FOCS.2010.36\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider $k$-median clustering in finite metric spaces and $k$-means clustering in Euclidean spaces, in the setting where $k$ is part of the input (not a constant). For the $k$-means problem, Ostrovsky et al. show that if the optimal $(k-1)$-means clustering of the input is more expensive than the optimal $k$-means clustering by a factor of $1/\\\\epsilon^2$, then one can achieve a $(1+f(\\\\epsilon))$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$ by using a variant of Lloyd's algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the $(k-1)$-means optimal is more expensive than the $k$-means optimal by a factor $1+\\\\alpha$ for {\\\\em some} constant $\\\\alpha>0$, we can obtain a PTAS. In particular, under this assumption, for any $\\\\eps>0$ we achieve a $(1+\\\\eps)$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$, and exponential in $1/\\\\eps$ and $1/\\\\alpha$. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the $k$-median problem in finite metrics under the analogous assumption as well. For $k$-means, we in addition give a randomized algorithm with improved running time of $n^{O(1)}(k \\\\log n)^{\\\\poly(1/\\\\epsilon,1/\\\\alpha)}$. Our technique also obtains a PTAS under the assumption of Balcan et al. that all $(1+\\\\alpha)$ approximations are $\\\\delta$-close to a desired target clustering, in the case that all target clusters have size greater than $\\\\delta n$ and $\\\\alpha>0$ is constant. Note that the motivation of Balcan et al. is that for many clustering problems, the objective function is only a proxy for the true goal of getting close to the target. From this perspective, our improvement is that for $k$-means in Euclidean spaces we reduce the distance of the clustering found to the target from $O(\\\\delta)$ to $\\\\delta$ when all target clusters are large, and for $k$-median we improve the ``largeness'' condition needed in the work of Balcan et al. to get exactly $\\\\delta$-close from $O(\\\\delta n)$ to $\\\\delta n$. Our results are based on a new notion of clustering stability.\",\"PeriodicalId\":228365,\"journal\":{\"name\":\"2010 IEEE 51st Annual Symposium on Foundations of Computer Science\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"93\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE 51st Annual Symposium on Foundations of Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FOCS.2010.36\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2010.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 93

摘要

我们考虑$k$ -有限度量空间中的中位数聚类和$k$ -欧几里德空间中的均值聚类，其中$k$是输入的一部分(不是常数)。对于$k$ -means问题，Ostrovsky等人表明，如果输入的最优$(k-1)$ -means聚类比最优$k$ -means聚类代价高$1/\epsilon^2$倍，则可以通过使用Lloyd算法的变体来实现$n$和$k$中$k$ -means最优时间多项式的$(1+f(\epsilon))$ -近似。在这项工作中，我们大大改进了这种近似保证。我们证明，仅给定对于常数$\alpha>0$, $(k-1)$ -均值最优比$k$ -均值最优贵一个因子{\em}$1+\alpha$的条件，我们就可以得到PTAS。特别是，在这个假设下，对于任何$\eps>0$，我们在$n$和$k$中实现了$k$ -均值最优时间多项式的$(1+\eps)$ -近似，在$1/\eps$和$1/\alpha$中实现了指数。因此，我们将假设的强度与近似比率的质量解耦。在类似的假设下，我们也给出了有限度量下$k$ -中值问题的PTAS。对于$k$ -means，我们还给出了一个随机化算法，改进了$n^{O(1)}(k \log n)^{\poly(1/\epsilon,1/\alpha)}$的运行时间。我们的技术还在Balcan等人的假设下获得了PTAS，即在所有目标聚类的大小大于$\delta n$且$\alpha>0$为常数的情况下，所有$(1+\alpha)$近似都是$\delta$ -接近期望的目标聚类。请注意，Balcan等人的动机是，对于许多聚类问题，目标函数只是接近目标这一真实目标的代理。从这个角度来看，我们的改进是，对于欧几里得空间中的$k$ -means，当所有目标簇都很大时，我们减少了发现的聚类到目标的距离，从$O(\delta)$到$\delta$，对于$k$ -median，我们改进了Balcan等人的工作所需的“大”条件，以便从$O(\delta n)$到$\delta n$精确地获得$\delta$ -接近。我们的结果是基于聚类稳定性的新概念。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Stability Yields a PTAS for k-Median and k-Means Clustering

We consider $k$-median clustering in finite metric spaces and $k$-means clustering in Euclidean spaces, in the setting where $k$ is part of the input (not a constant). For the $k$-means problem, Ostrovsky et al. show that if the optimal $(k-1)$-means clustering of the input is more expensive than the optimal $k$-means clustering by a factor of $1/\epsilon^2$, then one can achieve a $(1+f(\epsilon))$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$ by using a variant of Lloyd's algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the $(k-1)$-means optimal is more expensive than the $k$-means optimal by a factor $1+\alpha$ for {\em some} constant $\alpha>0$, we can obtain a PTAS. In particular, under this assumption, for any $\eps>0$ we achieve a $(1+\eps)$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$, and exponential in $1/\eps$ and $1/\alpha$. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the $k$-median problem in finite metrics under the analogous assumption as well. For $k$-means, we in addition give a randomized algorithm with improved running time of $n^{O(1)}(k \log n)^{\poly(1/\epsilon,1/\alpha)}$. Our technique also obtains a PTAS under the assumption of Balcan et al. that all $(1+\alpha)$ approximations are $\delta$-close to a desired target clustering, in the case that all target clusters have size greater than $\delta n$ and $\alpha>0$ is constant. Note that the motivation of Balcan et al. is that for many clustering problems, the objective function is only a proxy for the true goal of getting close to the target. From this perspective, our improvement is that for $k$-means in Euclidean spaces we reduce the distance of the clustering found to the target from $O(\delta)$ to $\delta$ when all target clusters are large, and for $k$-median we improve the ``largeness'' condition needed in the work of Balcan et al. to get exactly $\delta$-close from $O(\delta n)$ to $\delta n$. Our results are based on a new notion of clustering stability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

自引率

0.00%

发文量

期刊最新文献

On the Computational Complexity of Coin Flipping The Monotone Complexity of k-clique on Random Graphs Local List Decoding with a Constant Number of Queries Agnostically Learning under Permutation Invariant Distributions Pseudorandom Generators for Regular Branching Programs