Sparse Partitioning Around Medoids

Mach. Learn. under Resour. Constraints Vol. 1 Pub Date : 2023-09-05 DOI:10.1515/9783110785944-005

Sibylle Hess

引用次数: 0

Abstract

Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore require quadratic memory. The standard algorithm also has cubic runtime to produce a full hierarchy. Both memory and runtime are especially problematic in the context of embedded or otherwise very resource-constrained systems. In this section, we present how data aggregation with BETULA, a numerically stable version of the well known BIRCH data aggregation algorithm, can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality, and hence allow exploratory data analysis of very large data sets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

围绕介质的稀疏分区

层次聚类(HAC)可能是最早和最灵活的聚类方法，因为它可以用于许多距离、相似性和各种链接策略。当数据集形成的簇的数量未知，并且数据中的某种层次结构是合理的时，通常使用它。大多数HAC算法在全距离矩阵上运行，因此需要二次内存。标准算法也有立方运行时间来生成完整的层次结构。内存和运行时在嵌入式或其他资源非常有限的系统环境中都是特别有问题的。在本节中，我们将介绍如何使用BETULA进行数据聚合，BETULA是众所周知的BIRCH数据聚合算法的一个数值稳定版本，它可以使HAC在资源受限的系统上可行，并且聚类质量损失很小，因此可以对非常大的数据集进行探索性数据分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Mach. Learn. under Resour. Constraints Vol. 1

自引率

0.00%

发文量

期刊最新文献

Sparse Partitioning Around Medoids Declarative Stream-based Acquisition and Processing of OSData with kCQL Summary Extraction from Streams Memory Awareness Communication Awareness