The Exploitation of Distance Distributions for Clustering

Int. J. Comput. Intell. Appl. Pub Date : 2021-08-12 DOI:10.1142/S1469026821500164

Michael C. Thrun

{"title":"The Exploitation of Distance Distributions for Clustering","authors":"Michael C. Thrun","doi":"10.1142/S1469026821500164","DOIUrl":null,"url":null,"abstract":"Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.","PeriodicalId":422521,"journal":{"name":"Int. J. Comput. Intell. Appl.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Intell. Appl.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/S1469026821500164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

距离分布在聚类中的应用

虽然距离度量在许多机器学习算法中使用，但在使用先验知识的意义上，关于距离度量的上下文独立选择和评估的文献是有限的。在聚类分析中，目前的研究在应用基于误差概率的无监督方法后评估距离度量的选择，隐式地设置再现数据中预定义分区的目标。此类研究使用的数据簇通常基于数据的上下文以及特定研究的自定义目标。根据数据上下文，距离分布的不同属性被判断为与适当的距离选择相关。然而，如果聚类分析的任务是寻找相似的数据分区，那么分区内的距离应该小于分区间的距离。通过镜像密度(MD图)的分布分析系统地研究了这一规范，表明多模态距离分布在聚类分析中更可取。因此，在无监督方法的评估阶段之前，用高斯混合模型来模拟距离分布是有利的。在几个人工数据集和自然数据集上进行了聚类实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Int. J. Comput. Intell. Appl.

自引率

0.00%

发文量