Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Evolutionary Computation Pub Date : 2020-12-02 DOI:10.1162/evco_a_00264

Andrew Lensen;Bing Xue;Mengjie Zhang

{"title":"Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis","authors":"Andrew Lensen;Bing Xue;Mengjie Zhang","doi":"10.1162/evco_a_00264","DOIUrl":null,"url":null,"abstract":"<para>Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.</para>","PeriodicalId":50470,"journal":{"name":"Evolutionary Computation","volume":"28 4","pages":"531-561"},"PeriodicalIF":3.4000,"publicationDate":"2020-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1162/evco_a_00264","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/9277963/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 12

Abstract

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

进化聚类相似函数的遗传规划：表示与分析

聚类是一项困难且研究广泛的数据挖掘任务，文献中提出了多种聚类算法。几乎所有的算法都使用相似性度量，例如距离度量（例如欧几里得距离）来决定将哪些实例分配给同一集群。这些相似性度量通常是预定义的，无法根据特定数据集的属性轻松调整，这导致所产生的聚类的质量和可解释性受到限制。在本文中，我们提出了一种新的方法，通过使用遗传规划来自动进化给定聚类算法的相似性函数。我们介绍了一种新的基于遗传编程的方法，该方法自动选择一小部分特征（特征选择），然后使用各种函数（特征构建）将它们组合起来，以生成专门为给定数据集设计的动态灵活的相似函数。我们展示了如何使用进化的相似性函数来使用基于图的表示进行聚类。在一系列大型高维数据集上进行的各种实验结果表明，与基准方法相比，所提出的方法可以实现更高、更一致的性能。我们进一步扩展了所提出的方法，通过使用多树方法自动生成多个互补相似函数，这进一步提高了性能。我们还分析了自动进化相似性函数的可解释性和结构，以深入了解它们如何以及为什么优于标准距离度量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Evolutionary Computation 工程技术-计算机：理论方法

CiteScore

6.40

自引率

1.50%

发文量

审稿时长

3 months

期刊介绍： Evolutionary Computation is a leading journal in its field. It provides an international forum for facilitating and enhancing the exchange of information among researchers involved in both the theoretical and practical aspects of computational systems drawing their inspiration from nature, with particular emphasis on evolutionary models of computation such as genetic algorithms, evolutionary strategies, classifier systems, evolutionary programming, and genetic programming. It welcomes articles from related fields such as swarm intelligence (e.g. Ant Colony Optimization and Particle Swarm Optimization), and other nature-inspired computation paradigms (e.g. Artificial Immune Systems). As well as publishing articles describing theoretical and/or experimental work, the journal also welcomes application-focused papers describing breakthrough results in an application domain or methodological papers where the specificities of the real-world problem led to significant algorithmic improvements that could possibly be generalized to other areas.