Hierarchical Clustering using Reversible Binary Cellular Automata for High-Dimensional Data

arXiv - CS - Formal Languages and Automata Theory Pub Date : 2024-08-05 DOI:arxiv-2408.02250

Baby C. J., Kamalika Bhattacharjee

{"title":"Hierarchical Clustering using Reversible Binary Cellular Automata for High-Dimensional Data","authors":"Baby C. J., Kamalika Bhattacharjee","doi":"arxiv-2408.02250","DOIUrl":null,"url":null,"abstract":"This work proposes a hierarchical clustering algorithm for high-dimensional\ndatasets using the cyclic space of reversible finite cellular automata. In\ncellular automaton (CA) based clustering, if two objects belong to the same\ncycle, they are closely related and considered as part of the same cluster.\nHowever, if a high-dimensional dataset is clustered using the cycles of one CA,\nclosely related objects may belong to different cycles. This paper identifies\nthe relationship between objects in two different cycles based on the median of\nall elements in each cycle so that they can be grouped in the next stage.\nFurther, to minimize the number of intermediate clusters which in turn reduces\nthe computational cost, a rule selection strategy is taken to find the best\nrules based on information propagation and cycle structure. After encoding the\ndataset using frequency-based encoding such that the consecutive data elements\nmaintain a minimum hamming distance in encoded form, our proposed clustering\nalgorithm iterates over three stages to finally cluster the data elements into\nthe desired number of clusters given by user. This algorithm can be applied to\nvarious fields, including healthcare, sports, chemical research, agriculture,\netc. When verified over standard benchmark datasets with various performance\nmetrics, our algorithm is at par with the existing algorithms with quadratic\ntime complexity.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This work proposes a hierarchical clustering algorithm for high-dimensional datasets using the cyclic space of reversible finite cellular automata. In cellular automaton (CA) based clustering, if two objects belong to the same cycle, they are closely related and considered as part of the same cluster. However, if a high-dimensional dataset is clustered using the cycles of one CA, closely related objects may belong to different cycles. This paper identifies the relationship between objects in two different cycles based on the median of all elements in each cycle so that they can be grouped in the next stage. Further, to minimize the number of intermediate clusters which in turn reduces the computational cost, a rule selection strategy is taken to find the best rules based on information propagation and cycle structure. After encoding the dataset using frequency-based encoding such that the consecutive data elements maintain a minimum hamming distance in encoded form, our proposed clustering algorithm iterates over three stages to finally cluster the data elements into the desired number of clusters given by user. This algorithm can be applied to various fields, including healthcare, sports, chemical research, agriculture, etc. When verified over standard benchmark datasets with various performance metrics, our algorithm is at par with the existing algorithms with quadratic time complexity.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用可逆二元蜂窝自动机对高维数据进行分层聚类

本研究提出了一种利用可逆有限蜂窝自动机循环空间对高维数据集进行分层聚类的算法。在基于细胞自动机（CA）的聚类中，如果两个对象属于同一个循环，那么它们就是密切相关的，并被视为同一个聚类的一部分。然而，如果使用一个 CA 的循环对高维数据集进行聚类，那么密切相关的对象可能属于不同的循环。本文根据每个循环中所有元素的中位数来识别两个不同循环中的对象之间的关系，以便在下一阶段对它们进行分组。此外，为了尽量减少中间聚类的数量，从而降低计算成本，本文采用了一种规则选择策略，根据信息传播和循环结构来寻找最佳规则。在使用基于频率的编码对数据集进行编码，使连续的数据元素在编码形式下保持最小的汉明距离之后，我们提出的聚类算法将经过三个阶段的迭代，最终将数据元素聚类到用户给出的所需数量的聚类中。该算法可应用于医疗保健、体育、化学研究、农业等多个领域。在对标准基准数据集进行各种性能指标验证时，我们的算法与现有算法不相上下，其复杂度为四倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Formal Languages and Automata Theory

自引率

0.00%

发文量

期刊最新文献

Query Learning of Advice and Nominal Automata Well-Behaved (Co)algebraic Semantics of Regular Expressions in Dafny Run supports and initial algebra supports of weighted automata Alternating hierarchy of sushifts defined by nondeterministic plane-walking automata $\mathbb{N}$-polyregular functions arise from well-quasi-orderings