{"title":"具有Bregman散度的k-均值聚类的最坏情况和平滑分析","authors":"B. Manthey, Heiko Röglin","doi":"10.20382/jocg.v4i1a5","DOIUrl":null,"url":null,"abstract":"The k-means algorithm is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice. Most of the theoretical work is restricted to the case that squared Euclidean distances are used as similarity measure. In many applications, however, data is to be clustered with respect to other measures like, e.g., relative entropy, which is commonly used to cluster web pages. In this paper, we analyze the running-time of the k-means method for Bregman divergences, a very general class of similarity measures including squared Euclidean distances and relative entropy. We show that the exponential lower bound known for the Euclidean case carries over to almost every Bregman divergence. To narrow the gap between theory and practice, we also study k-means in the semi-random input model of smoothed analysis. For the case that n data points in ? d are perturbed by noise with standard deviation ?, we show that for almost arbitrary Bregman divergences the expected running-time is bounded by ${\\rm poly}(n^{\\sqrt k}, 1/\\sigma)$ and k kd ·poly(n, 1/?).","PeriodicalId":43044,"journal":{"name":"Journal of Computational Geometry","volume":"16 1","pages":"1024-1033"},"PeriodicalIF":0.4000,"publicationDate":"2009-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"Worst-Case and Smoothed Analysis of k-Means Clustering with Bregman Divergences\",\"authors\":\"B. Manthey, Heiko Röglin\",\"doi\":\"10.20382/jocg.v4i1a5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The k-means algorithm is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice. Most of the theoretical work is restricted to the case that squared Euclidean distances are used as similarity measure. In many applications, however, data is to be clustered with respect to other measures like, e.g., relative entropy, which is commonly used to cluster web pages. In this paper, we analyze the running-time of the k-means method for Bregman divergences, a very general class of similarity measures including squared Euclidean distances and relative entropy. We show that the exponential lower bound known for the Euclidean case carries over to almost every Bregman divergence. To narrow the gap between theory and practice, we also study k-means in the semi-random input model of smoothed analysis. For the case that n data points in ? d are perturbed by noise with standard deviation ?, we show that for almost arbitrary Bregman divergences the expected running-time is bounded by ${\\\\rm poly}(n^{\\\\sqrt k}, 1/\\\\sigma)$ and k kd ·poly(n, 1/?).\",\"PeriodicalId\":43044,\"journal\":{\"name\":\"Journal of Computational Geometry\",\"volume\":\"16 1\",\"pages\":\"1024-1033\"},\"PeriodicalIF\":0.4000,\"publicationDate\":\"2009-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computational Geometry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.20382/jocg.v4i1a5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"MATHEMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Geometry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20382/jocg.v4i1a5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICS","Score":null,"Total":0}
Worst-Case and Smoothed Analysis of k-Means Clustering with Bregman Divergences
The k-means algorithm is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice. Most of the theoretical work is restricted to the case that squared Euclidean distances are used as similarity measure. In many applications, however, data is to be clustered with respect to other measures like, e.g., relative entropy, which is commonly used to cluster web pages. In this paper, we analyze the running-time of the k-means method for Bregman divergences, a very general class of similarity measures including squared Euclidean distances and relative entropy. We show that the exponential lower bound known for the Euclidean case carries over to almost every Bregman divergence. To narrow the gap between theory and practice, we also study k-means in the semi-random input model of smoothed analysis. For the case that n data points in ? d are perturbed by noise with standard deviation ?, we show that for almost arbitrary Bregman divergences the expected running-time is bounded by ${\rm poly}(n^{\sqrt k}, 1/\sigma)$ and k kd ·poly(n, 1/?).