{"title":"CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences","authors":"Fatemeh Alipour, Kathleen A. Hill, Lila Kari","doi":"arxiv-2407.02538","DOIUrl":null,"url":null,"abstract":"This study proposes CGRclust, a novel combination of unsupervised twin\ncontrastive clustering of Chaos Game Representations (CGR) of DNA sequences,\nwith convolutional neural networks (CNNs). To the best of our knowledge,\nCGRclust is the first method to use unsupervised learning for image\nclassification (herein applied to two-dimensional CGR images) for clustering\ndatasets of DNA sequences. CGRclust overcomes the limitations of traditional\nsequence classification methods by leveraging unsupervised twin contrastive\nlearning to detect distinctive sequence patterns, without requiring DNA\nsequence alignment or biological/taxonomic labels. CGRclust accurately\nclustered twenty-five diverse datasets, with sequence lengths ranging from 664\nbp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as\nwell as viral whole genome assemblies and synthetic DNA sequences. Compared\nwith three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and\nMeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy\nacross all four taxonomic levels tested for mitochondrial DNA genomes of fish.\nMoreover, CGRclust also consistently demonstrates superior performance across\nall the viral genomic datasets. The high clustering accuracy of CGRclust on\nthese twenty-five datasets, which vary significantly in terms of sequence\nlength, number of genomes, number of clusters, and level of taxonomy,\ndemonstrates its robustness, scalability, and versatility.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"33 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.02538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study proposes CGRclust, a novel combination of unsupervised twin
contrastive clustering of Chaos Game Representations (CGR) of DNA sequences,
with convolutional neural networks (CNNs). To the best of our knowledge,
CGRclust is the first method to use unsupervised learning for image
classification (herein applied to two-dimensional CGR images) for clustering
datasets of DNA sequences. CGRclust overcomes the limitations of traditional
sequence classification methods by leveraging unsupervised twin contrastive
learning to detect distinctive sequence patterns, without requiring DNA
sequence alignment or biological/taxonomic labels. CGRclust accurately
clustered twenty-five diverse datasets, with sequence lengths ranging from 664
bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as
well as viral whole genome assemblies and synthetic DNA sequences. Compared
with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and
MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy
across all four taxonomic levels tested for mitochondrial DNA genomes of fish.
Moreover, CGRclust also consistently demonstrates superior performance across
all the viral genomic datasets. The high clustering accuracy of CGRclust on
these twenty-five datasets, which vary significantly in terms of sequence
length, number of genomes, number of clusters, and level of taxonomy,
demonstrates its robustness, scalability, and versatility.