{"title":"A Criterion for Deciding the Number of Clusters in a Dataset Based on Data Depth","authors":"Ishwar Baidari, Channamma Patil","doi":"10.1142/s2196888820500232","DOIUrl":null,"url":null,"abstract":"Clustering is a key method in unsupervised learning with various applications in data mining, pattern recognition and intelligent information processing. However, the number of groups to be formed, usually notated as [Formula: see text] is a vital parameter for most of the existing clustering algorithms as their clustering results depend heavily on this parameter. The problem of finding the optimal [Formula: see text] value is very challenging. This paper proposes a novel idea for finding the correct number of groups in a dataset based on data depth. The idea is to avoid the traditional process of running the clustering algorithm over a dataset for [Formula: see text] times and further, finding the [Formula: see text] value for a dataset without setting any specific search range for [Formula: see text] parameter. We experiment with different indices, namely CH, KL, Silhouette, Gap, CSP and the proposed method on different real and synthetic datasets to estimate the correct number of groups in a dataset. The experimental results on real and synthetic datasets indicate good performance of the proposed method.","PeriodicalId":256649,"journal":{"name":"Vietnam. J. Comput. Sci.","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vietnam. J. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s2196888820500232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Clustering is a key method in unsupervised learning with various applications in data mining, pattern recognition and intelligent information processing. However, the number of groups to be formed, usually notated as [Formula: see text] is a vital parameter for most of the existing clustering algorithms as their clustering results depend heavily on this parameter. The problem of finding the optimal [Formula: see text] value is very challenging. This paper proposes a novel idea for finding the correct number of groups in a dataset based on data depth. The idea is to avoid the traditional process of running the clustering algorithm over a dataset for [Formula: see text] times and further, finding the [Formula: see text] value for a dataset without setting any specific search range for [Formula: see text] parameter. We experiment with different indices, namely CH, KL, Silhouette, Gap, CSP and the proposed method on different real and synthetic datasets to estimate the correct number of groups in a dataset. The experimental results on real and synthetic datasets indicate good performance of the proposed method.
聚类是无监督学习的一种关键方法,在数据挖掘、模式识别和智能信息处理等领域有着广泛的应用。然而,对于大多数现有的聚类算法来说,要形成的组的数量(通常记为[公式:见文本])是一个重要的参数,因为它们的聚类结果严重依赖于这个参数。寻找最优[公式:见文本]值的问题非常具有挑战性。本文提出了一种基于数据深度的数据集中查找正确组数的新思路。这个想法是为了避免在数据集上运行[Formula: see text]次的传统聚类算法,并且在不为[Formula: see text]参数设置任何特定搜索范围的情况下为数据集找到[Formula: see text]值。我们在不同的真实和合成数据集上实验了不同的指标,即CH, KL, Silhouette, Gap, CSP和所提出的方法,以估计数据集中正确的组数。在真实数据集和合成数据集上的实验结果表明了该方法的良好性能。