J. Pal, Shinjini Ghosh, B. Maji, D. K. Bhattacharya
{"title":"用主成分分析法比较蛋白质序列的独特方法","authors":"J. Pal, Shinjini Ghosh, B. Maji, D. K. Bhattacharya","doi":"10.1109/ICTAI53825.2021.9673245","DOIUrl":null,"url":null,"abstract":"Physiochemical properties of amino acids has significant role in the study of comparison of protein sequences. In the literature, an arbitrary and random combination of these properties has been considered for protein sequence comparison. In the present paper, comparison of protein sequences is obtained using only five known physical properties of the amino acids. Principal component analysis (PCA) is applied on the numerical values corresponding to these physical properties related to twenty amino acids to reduce their dimensions. As a result, corresponding to each amino acid 20 TP values are obtained. Protein Sequences are represented based on these 20 TP values. Then cumulative sums on these represented sequences are taken to get the non-degenerate representations of each of the protein sequences. Now a new form of descriptor is obtained using generalized form of three moment vectors consisting of first, second and third order moments. Then distance matrices are obtained by using Euclidean distance as the distance measure. Finally phylogenetic tree based on such distance matrices using the UPGMA algorithm are constructed. The proposed method is applied on 9 ND4, 9 ND6, 16 ND5, 12 Baculovirus and also on 24 TF protein sequences. The result obtained by this new method is at par with the biological reference and also comparable with the results obtained earlier on the same species by other methods.","PeriodicalId":278263,"journal":{"name":"2021 International Conference on Technological Advancements and Innovations (ICTAI)","volume":"183 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Unique Approach for Comparison of Protein Sequence Using PCA Analysis\",\"authors\":\"J. Pal, Shinjini Ghosh, B. Maji, D. K. Bhattacharya\",\"doi\":\"10.1109/ICTAI53825.2021.9673245\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Physiochemical properties of amino acids has significant role in the study of comparison of protein sequences. In the literature, an arbitrary and random combination of these properties has been considered for protein sequence comparison. In the present paper, comparison of protein sequences is obtained using only five known physical properties of the amino acids. Principal component analysis (PCA) is applied on the numerical values corresponding to these physical properties related to twenty amino acids to reduce their dimensions. As a result, corresponding to each amino acid 20 TP values are obtained. Protein Sequences are represented based on these 20 TP values. Then cumulative sums on these represented sequences are taken to get the non-degenerate representations of each of the protein sequences. Now a new form of descriptor is obtained using generalized form of three moment vectors consisting of first, second and third order moments. Then distance matrices are obtained by using Euclidean distance as the distance measure. Finally phylogenetic tree based on such distance matrices using the UPGMA algorithm are constructed. The proposed method is applied on 9 ND4, 9 ND6, 16 ND5, 12 Baculovirus and also on 24 TF protein sequences. The result obtained by this new method is at par with the biological reference and also comparable with the results obtained earlier on the same species by other methods.\",\"PeriodicalId\":278263,\"journal\":{\"name\":\"2021 International Conference on Technological Advancements and Innovations (ICTAI)\",\"volume\":\"183 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Technological Advancements and Innovations (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI53825.2021.9673245\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Technological Advancements and Innovations (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI53825.2021.9673245","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Unique Approach for Comparison of Protein Sequence Using PCA Analysis
Physiochemical properties of amino acids has significant role in the study of comparison of protein sequences. In the literature, an arbitrary and random combination of these properties has been considered for protein sequence comparison. In the present paper, comparison of protein sequences is obtained using only five known physical properties of the amino acids. Principal component analysis (PCA) is applied on the numerical values corresponding to these physical properties related to twenty amino acids to reduce their dimensions. As a result, corresponding to each amino acid 20 TP values are obtained. Protein Sequences are represented based on these 20 TP values. Then cumulative sums on these represented sequences are taken to get the non-degenerate representations of each of the protein sequences. Now a new form of descriptor is obtained using generalized form of three moment vectors consisting of first, second and third order moments. Then distance matrices are obtained by using Euclidean distance as the distance measure. Finally phylogenetic tree based on such distance matrices using the UPGMA algorithm are constructed. The proposed method is applied on 9 ND4, 9 ND6, 16 ND5, 12 Baculovirus and also on 24 TF protein sequences. The result obtained by this new method is at par with the biological reference and also comparable with the results obtained earlier on the same species by other methods.