{"title":"A protein secondary structure-based algorithm for partitioning large protein alignments","authors":"Thu Kim Le, L. Vinh","doi":"10.1109/KSE56063.2022.9953767","DOIUrl":null,"url":null,"abstract":"The evolutionary process of characters (e.g., nucleotides or amino acids) is heterogeneous among sites of alignments. Applying the same evolutionary model for all sites leads to unreliable results in evolutionary studies. Partitioning alignments into sub-alignments (groups) such that sites in each sub-alignment follow the same model of evolution is a proper and promising approach to adequately handle the heterogeneity among sites. A number of computational methods have been proposed to partition alignments, however, they are unable to properly handle invariant sites. The iterative k-means algorithm is widely used to partition large alignments, unfortunately, recently suspended because it always groups all invariant sites into one group that might distort phylogenetic trees reconstructed from sub-alignments.In this paper, we improve the iterative k-means algorithm for protein alignments by combining both amino acids and their secondary structures to properly partition invariant sites. The protein secondary structure information helps classify invariant sites into different groups each includes both variant and invariant sites. Experiments on real large protein alignments showed that the new algorithm overcomes the pitfall of grouping all invariant sites into one group and consequently produces better partitioning schemes.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE56063.2022.9953767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The evolutionary process of characters (e.g., nucleotides or amino acids) is heterogeneous among sites of alignments. Applying the same evolutionary model for all sites leads to unreliable results in evolutionary studies. Partitioning alignments into sub-alignments (groups) such that sites in each sub-alignment follow the same model of evolution is a proper and promising approach to adequately handle the heterogeneity among sites. A number of computational methods have been proposed to partition alignments, however, they are unable to properly handle invariant sites. The iterative k-means algorithm is widely used to partition large alignments, unfortunately, recently suspended because it always groups all invariant sites into one group that might distort phylogenetic trees reconstructed from sub-alignments.In this paper, we improve the iterative k-means algorithm for protein alignments by combining both amino acids and their secondary structures to properly partition invariant sites. The protein secondary structure information helps classify invariant sites into different groups each includes both variant and invariant sites. Experiments on real large protein alignments showed that the new algorithm overcomes the pitfall of grouping all invariant sites into one group and consequently produces better partitioning schemes.