A subquadratic algorithm for cluster and outlier detection in massive metric data

Proceedings Eighth Symposium on String Processing and Information Retrieval Pub Date : 1900-01-01 DOI:10.1109/SPIRE.2001.10018

Edgar Chávez

{"title":"A subquadratic algorithm for cluster and outlier detection in massive metric data","authors":"Edgar Chávez","doi":"10.1109/SPIRE.2001.10018","DOIUrl":null,"url":null,"abstract":"The problem of cluster and outlier detection is a classic problem of non-parametric statistics. In recent times the need for cluster analysis in massive multimedia data sets (terabytes of data sampled from a metric space) have demonstrated the need for solutions both in the sense of being capable of automatic clustering metric data and at reasonable speed. Since cluster properties involve the relationship between each pair of data set elements, a good clustering algorithm must examine (in principle) every distance pair and hence has quadratic complexity. An appealing trend to achieve subquadratic complexity is either a) to use an approximation for a classic clustering algorithm or b) to design a new algorithm for clustering. This paper presents a new clustering algorithm performing O(n1+α) distance computations (the operation ofleading complexity), with 0 ⩽ α ⩽ 1 a constant depending on the intrinsic dimension of the sample data. The algorithm can detect outliers in the sample data and, if desired, it can produce a hierarchical structure (a dendogram) pointing to clusters at different resolutions.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Eighth Symposium on String Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPIRE.2001.10018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The problem of cluster and outlier detection is a classic problem of non-parametric statistics. In recent times the need for cluster analysis in massive multimedia data sets (terabytes of data sampled from a metric space) have demonstrated the need for solutions both in the sense of being capable of automatic clustering metric data and at reasonable speed. Since cluster properties involve the relationship between each pair of data set elements, a good clustering algorithm must examine (in principle) every distance pair and hence has quadratic complexity. An appealing trend to achieve subquadratic complexity is either a) to use an approximation for a classic clustering algorithm or b) to design a new algorithm for clustering. This paper presents a new clustering algorithm performing O(n1+α) distance computations (the operation ofleading complexity), with 0 ⩽ α ⩽ 1 a constant depending on the intrinsic dimension of the sample data. The algorithm can detect outliers in the sample data and, if desired, it can produce a hierarchical structure (a dendogram) pointing to clusters at different resolutions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

海量度量数据中聚类和离群点检测的次二次算法

聚类和离群点检测问题是非参数统计的一个经典问题。最近，对大量多媒体数据集(从度量空间采样的tb级数据)进行聚类分析的需求表明，需要能够以合理的速度自动聚类度量数据的解决方案。由于聚类属性涉及每对数据集元素之间的关系，一个好的聚类算法必须检查(原则上)每个距离对，因此具有二次复杂度。实现次二次复杂度的一个吸引人的趋势是a)使用经典聚类算法的近似值或b)设计一个新的聚类算法。本文提出了一种新的聚类算法，它执行O(n1+α)距离计算(领先复杂度的运算)，根据样本数据的内在维数以0≥α≤1为常数。该算法可以检测样本数据中的异常值，如果需要，它可以生成一个层次结构(树形图)，指向不同分辨率的集群。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings Eighth Symposium on String Processing and Information Retrieval

自引率

0.00%

发文量

期刊最新文献

Fast categorisation of large document collections An efficient bottom-up distance between trees A documental database query language Genome rearrangements distance by fusion, fission, and transposition is easy Using semantics for paragraph selection in question answering systems