{"title":"Hierarchical Clustering using Reversible Binary Cellular Automata for High-Dimensional Data","authors":"Baby C. J., Kamalika Bhattacharjee","doi":"arxiv-2408.02250","DOIUrl":null,"url":null,"abstract":"This work proposes a hierarchical clustering algorithm for high-dimensional\ndatasets using the cyclic space of reversible finite cellular automata. In\ncellular automaton (CA) based clustering, if two objects belong to the same\ncycle, they are closely related and considered as part of the same cluster.\nHowever, if a high-dimensional dataset is clustered using the cycles of one CA,\nclosely related objects may belong to different cycles. This paper identifies\nthe relationship between objects in two different cycles based on the median of\nall elements in each cycle so that they can be grouped in the next stage.\nFurther, to minimize the number of intermediate clusters which in turn reduces\nthe computational cost, a rule selection strategy is taken to find the best\nrules based on information propagation and cycle structure. After encoding the\ndataset using frequency-based encoding such that the consecutive data elements\nmaintain a minimum hamming distance in encoded form, our proposed clustering\nalgorithm iterates over three stages to finally cluster the data elements into\nthe desired number of clusters given by user. This algorithm can be applied to\nvarious fields, including healthcare, sports, chemical research, agriculture,\netc. When verified over standard benchmark datasets with various performance\nmetrics, our algorithm is at par with the existing algorithms with quadratic\ntime complexity.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This work proposes a hierarchical clustering algorithm for high-dimensional
datasets using the cyclic space of reversible finite cellular automata. In
cellular automaton (CA) based clustering, if two objects belong to the same
cycle, they are closely related and considered as part of the same cluster.
However, if a high-dimensional dataset is clustered using the cycles of one CA,
closely related objects may belong to different cycles. This paper identifies
the relationship between objects in two different cycles based on the median of
all elements in each cycle so that they can be grouped in the next stage.
Further, to minimize the number of intermediate clusters which in turn reduces
the computational cost, a rule selection strategy is taken to find the best
rules based on information propagation and cycle structure. After encoding the
dataset using frequency-based encoding such that the consecutive data elements
maintain a minimum hamming distance in encoded form, our proposed clustering
algorithm iterates over three stages to finally cluster the data elements into
the desired number of clusters given by user. This algorithm can be applied to
various fields, including healthcare, sports, chemical research, agriculture,
etc. When verified over standard benchmark datasets with various performance
metrics, our algorithm is at par with the existing algorithms with quadratic
time complexity.
本研究提出了一种利用可逆有限蜂窝自动机循环空间对高维数据集进行分层聚类的算法。在基于细胞自动机(CA)的聚类中,如果两个对象属于同一个循环,那么它们就是密切相关的,并被视为同一个聚类的一部分。然而,如果使用一个 CA 的循环对高维数据集进行聚类,那么密切相关的对象可能属于不同的循环。本文根据每个循环中所有元素的中位数来识别两个不同循环中的对象之间的关系,以便在下一阶段对它们进行分组。此外,为了尽量减少中间聚类的数量,从而降低计算成本,本文采用了一种规则选择策略,根据信息传播和循环结构来寻找最佳规则。在使用基于频率的编码对数据集进行编码,使连续的数据元素在编码形式下保持最小的汉明距离之后,我们提出的聚类算法将经过三个阶段的迭代,最终将数据元素聚类到用户给出的所需数量的聚类中。该算法可应用于医疗保健、体育、化学研究、农业等多个领域。在对标准基准数据集进行各种性能指标验证时,我们的算法与现有算法不相上下,其复杂度为四倍。