{"title":"Clustering DNA Sequences of Aspergillus Fumigatus Using Incremental Multiple Medoids","authors":"T. Ajayan, P. Sony, Janu R. Panicker, S. Shailesh","doi":"10.1109/ICACC.2015.19","DOIUrl":null,"url":null,"abstract":"Clustering DNA sequences of Aspergillus fumigatus is a process that groups a set of sequences into clusters such that the similarity among sequences in the same cluster is high, while that among the sequences in different clusters is low. The main objective of this clustering is to obtain a more refined clustering techinque inorder to analyze biological data and to bunch DNA sequences to many clusters more easily. CDHIT and DNACLUST are the two existing approaches used in bioinformatics for clustering sequences. The major disadvantage of both approach is that longest sequence is selected as the cluster representative. As DNA sequences are enomorous in number, the traditional clustering algorithm are infeasible for analysis. To handle such large DNA sequences, a modified version of incremental clustering using multiple medoids has been proposed. The key idea is to find multiple representative sequences like medoids to represent a cluster in a chunk and final DNA analysis is carried out based on those identified medoids from all the chunks. The main advantage of this incremental clustering is that it uses multiple medoids to represent each cluster in each chunk which capture the pattern structure more accurately. Not only that it overcomes the disadvantages of existing techniques but also has the mechanism to make use of DNA sequence relationship among those identified medoids that serves as a side information to help the final DNA sequence clustering. The proposed incremental approach outperforms existing clustering approaches in terms of clustering accuracy.","PeriodicalId":368544,"journal":{"name":"2015 Fifth International Conference on Advances in Computing and Communications (ICACC)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Fifth International Conference on Advances in Computing and Communications (ICACC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACC.2015.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Clustering DNA sequences of Aspergillus fumigatus is a process that groups a set of sequences into clusters such that the similarity among sequences in the same cluster is high, while that among the sequences in different clusters is low. The main objective of this clustering is to obtain a more refined clustering techinque inorder to analyze biological data and to bunch DNA sequences to many clusters more easily. CDHIT and DNACLUST are the two existing approaches used in bioinformatics for clustering sequences. The major disadvantage of both approach is that longest sequence is selected as the cluster representative. As DNA sequences are enomorous in number, the traditional clustering algorithm are infeasible for analysis. To handle such large DNA sequences, a modified version of incremental clustering using multiple medoids has been proposed. The key idea is to find multiple representative sequences like medoids to represent a cluster in a chunk and final DNA analysis is carried out based on those identified medoids from all the chunks. The main advantage of this incremental clustering is that it uses multiple medoids to represent each cluster in each chunk which capture the pattern structure more accurately. Not only that it overcomes the disadvantages of existing techniques but also has the mechanism to make use of DNA sequence relationship among those identified medoids that serves as a side information to help the final DNA sequence clustering. The proposed incremental approach outperforms existing clustering approaches in terms of clustering accuracy.