{"title":"A subquadratic algorithm for cluster and outlier detection in massive metric data","authors":"Edgar Chávez","doi":"10.1109/SPIRE.2001.10018","DOIUrl":null,"url":null,"abstract":"The problem of cluster and outlier detection is a classic problem of non-parametric statistics. In recent times the need for cluster analysis in massive multimedia data sets (terabytes of data sampled from a metric space) have demonstrated the need for solutions both in the sense of being capable of automatic clustering metric data and at reasonable speed. Since cluster properties involve the relationship between each pair of data set elements, a good clustering algorithm must examine (in principle) every distance pair and hence has quadratic complexity. An appealing trend to achieve subquadratic complexity is either a) to use an approximation for a classic clustering algorithm or b) to design a new algorithm for clustering. This paper presents a new clustering algorithm performing O(n1+α) distance computations (the operation ofleading complexity), with 0 ⩽ α ⩽ 1 a constant depending on the intrinsic dimension of the sample data. The algorithm can detect outliers in the sample data and, if desired, it can produce a hierarchical structure (a dendogram) pointing to clusters at different resolutions.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Eighth Symposium on String Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPIRE.2001.10018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The problem of cluster and outlier detection is a classic problem of non-parametric statistics. In recent times the need for cluster analysis in massive multimedia data sets (terabytes of data sampled from a metric space) have demonstrated the need for solutions both in the sense of being capable of automatic clustering metric data and at reasonable speed. Since cluster properties involve the relationship between each pair of data set elements, a good clustering algorithm must examine (in principle) every distance pair and hence has quadratic complexity. An appealing trend to achieve subquadratic complexity is either a) to use an approximation for a classic clustering algorithm or b) to design a new algorithm for clustering. This paper presents a new clustering algorithm performing O(n1+α) distance computations (the operation ofleading complexity), with 0 ⩽ α ⩽ 1 a constant depending on the intrinsic dimension of the sample data. The algorithm can detect outliers in the sample data and, if desired, it can produce a hierarchical structure (a dendogram) pointing to clusters at different resolutions.