{"title":"基于近邻 CCP 的分子序列分析","authors":"Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson","doi":"arxiv-2409.04922","DOIUrl":null,"url":null,"abstract":"Molecular sequence analysis is crucial for comprehending several biological\nprocesses, including protein-protein interactions, functional annotation, and\ndisease classification. The large number of sequences and the inherently\ncomplicated nature of protein structures make it challenging to analyze such\ndata. Finding patterns and enhancing subsequent research requires the use of\ndimensionality reduction and feature selection approaches. Recently, a method\ncalled Correlated Clustering and Projection (CCP) has been proposed as an\neffective method for biological sequencing data. The CCP technique is still\ncostly to compute even though it is effective for sequence visualization.\nFurthermore, its utility for classifying molecular sequences is still\nuncertain. To solve these two problems, we present a Nearest Neighbor\nCorrelated Clustering and Projection (CCP-NN)-based technique for efficiently\npreprocessing molecular sequence data. To group related molecular sequences and\nproduce representative supersequences, CCP makes use of sequence-to-sequence\ncorrelations. As opposed to conventional methods, CCP doesn't rely on matrix\ndiagonalization, therefore it can be applied to a range of machine-learning\nproblems. We estimate the density map and compute the correlation using a\nnearest-neighbor search technique. We performed molecular sequence\nclassification using CCP and CCP-NN representations to assess the efficacy of\nour proposed approach. Our findings show that CCP-NN considerably improves\nclassification task accuracy as well as significantly outperforms CCP in terms\nof computational runtime.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"2017 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Nearest Neighbor CCP-Based Molecular Sequence Analysis\",\"authors\":\"Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson\",\"doi\":\"arxiv-2409.04922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Molecular sequence analysis is crucial for comprehending several biological\\nprocesses, including protein-protein interactions, functional annotation, and\\ndisease classification. The large number of sequences and the inherently\\ncomplicated nature of protein structures make it challenging to analyze such\\ndata. Finding patterns and enhancing subsequent research requires the use of\\ndimensionality reduction and feature selection approaches. Recently, a method\\ncalled Correlated Clustering and Projection (CCP) has been proposed as an\\neffective method for biological sequencing data. The CCP technique is still\\ncostly to compute even though it is effective for sequence visualization.\\nFurthermore, its utility for classifying molecular sequences is still\\nuncertain. To solve these two problems, we present a Nearest Neighbor\\nCorrelated Clustering and Projection (CCP-NN)-based technique for efficiently\\npreprocessing molecular sequence data. To group related molecular sequences and\\nproduce representative supersequences, CCP makes use of sequence-to-sequence\\ncorrelations. As opposed to conventional methods, CCP doesn't rely on matrix\\ndiagonalization, therefore it can be applied to a range of machine-learning\\nproblems. We estimate the density map and compute the correlation using a\\nnearest-neighbor search technique. We performed molecular sequence\\nclassification using CCP and CCP-NN representations to assess the efficacy of\\nour proposed approach. Our findings show that CCP-NN considerably improves\\nclassification task accuracy as well as significantly outperforms CCP in terms\\nof computational runtime.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"2017 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04922\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04922","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Molecular sequence analysis is crucial for comprehending several biological
processes, including protein-protein interactions, functional annotation, and
disease classification. The large number of sequences and the inherently
complicated nature of protein structures make it challenging to analyze such
data. Finding patterns and enhancing subsequent research requires the use of
dimensionality reduction and feature selection approaches. Recently, a method
called Correlated Clustering and Projection (CCP) has been proposed as an
effective method for biological sequencing data. The CCP technique is still
costly to compute even though it is effective for sequence visualization.
Furthermore, its utility for classifying molecular sequences is still
uncertain. To solve these two problems, we present a Nearest Neighbor
Correlated Clustering and Projection (CCP-NN)-based technique for efficiently
preprocessing molecular sequence data. To group related molecular sequences and
produce representative supersequences, CCP makes use of sequence-to-sequence
correlations. As opposed to conventional methods, CCP doesn't rely on matrix
diagonalization, therefore it can be applied to a range of machine-learning
problems. We estimate the density map and compute the correlation using a
nearest-neighbor search technique. We performed molecular sequence
classification using CCP and CCP-NN representations to assess the efficacy of
our proposed approach. Our findings show that CCP-NN considerably improves
classification task accuracy as well as significantly outperforms CCP in terms
of computational runtime.