Kasper Overgaard Mortensen, Fatemeh Zardbani, M. A. Haque, S. Agustsson, D. Mottin, Philip Hofmann, Panagiotis Karras
{"title":"Marigold:高效的高维k均值聚类","authors":"Kasper Overgaard Mortensen, Fatemeh Zardbani, M. A. Haque, S. Agustsson, D. Mottin, Philip Hofmann, Panagiotis Karras","doi":"10.14778/3587136.3587147","DOIUrl":null,"url":null,"abstract":"\n How can we efficiently and scalably cluster high-dimensional data? The\n k\n -means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold (\n k\n -means for high-dimensional data), a scalable algorithm for\n k\n -means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to\n k\n -means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Marigold: Efficient k-means Clustering in High Dimensions\",\"authors\":\"Kasper Overgaard Mortensen, Fatemeh Zardbani, M. A. Haque, S. Agustsson, D. Mottin, Philip Hofmann, Panagiotis Karras\",\"doi\":\"10.14778/3587136.3587147\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n How can we efficiently and scalably cluster high-dimensional data? The\\n k\\n -means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold (\\n k\\n -means for high-dimensional data), a scalable algorithm for\\n k\\n -means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to\\n k\\n -means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.\\n\",\"PeriodicalId\":20467,\"journal\":{\"name\":\"Proc. VLDB Endow.\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. VLDB Endow.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14778/3587136.3587147\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3587136.3587147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
如何高效、可扩展地聚类高维数据?k均值算法通过迭代地减少聚类内的欧氏距离来聚类数据,直到收敛。虽然它从推荐引擎到图像分割都有应用,但由于需要反复计算点和质心之间的欧几里德距离,它在高维数据中的应用受到了阻碍。本文提出了一种可扩展的高维k均值聚类算法Marigold (k -means for high-dimensional data)。万寿菊李子距离的计算(i)紧距离边界格式;(ii)对一个多分辨率变换进行逐步计算;(3)利用三角不等式。据我们所知,迄今为止,这种修剪技术的武库尚未应用于k -means。我们的工作是由时间临界角分辨光谱学(ARPES)实验激发的,在该实验中,实时检测高维光谱中的簇是至关重要的。在对真实世界数据集的彻底实验研究中,我们证明了Marigold有效地聚类高维数据,比现有技术实现了大约一个数量级的改进。
Marigold: Efficient k-means Clustering in High Dimensions
How can we efficiently and scalably cluster high-dimensional data? The
k
-means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold (
k
-means for high-dimensional data), a scalable algorithm for
k
-means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to
k
-means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.