Mingeun Ji, Yejin Kan, Dongyeon Kim, Seungmin Lee, Gangman Yi
{"title":"DeepPI: Alignment-Free Analysis of Flexible Length Proteins Based on Deep Learning and Image Generator","authors":"Mingeun Ji, Yejin Kan, Dongyeon Kim, Seungmin Lee, Gangman Yi","doi":"10.1007/s12539-024-00618-x","DOIUrl":null,"url":null,"abstract":"<p>With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.</p><h3 data-test=\"abstract-sub-heading\">Graphical Abstract</h3>\n","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":"51 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interdisciplinary Sciences: Computational Life Sciences","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s12539-024-00618-x","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.
期刊介绍:
Interdisciplinary Sciences--Computational Life Sciences aims to cover the most recent and outstanding developments in interdisciplinary areas of sciences, especially focusing on computational life sciences, an area that is enjoying rapid development at the forefront of scientific research and technology.
The journal publishes original papers of significant general interest covering recent research and developments. Articles will be published rapidly by taking full advantage of internet technology for online submission and peer-reviewing of manuscripts, and then by publishing OnlineFirstTM through SpringerLink even before the issue is built or sent to the printer.
The editorial board consists of many leading scientists with international reputation, among others, Luc Montagnier (UNESCO, France), Dennis Salahub (University of Calgary, Canada), Weitao Yang (Duke University, USA). Prof. Dongqing Wei at the Shanghai Jiatong University is appointed as the editor-in-chief; he made important contributions in bioinformatics and computational physics and is best known for his ground-breaking works on the theory of ferroelectric liquids. With the help from a team of associate editors and the editorial board, an international journal with sound reputation shall be created.