DeepPI：基于深度学习和图像生成器的灵活长度蛋白质免对齐分析

IF 3.9 2区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Interdisciplinary Sciences: Computational Life Sciences Pub Date : 2024-04-03 DOI:10.1007/s12539-024-00618-x

Mingeun Ji, Yejin Kan, Dongyeon Kim, Seungmin Lee, Gangman Yi

{"title":"DeepPI：基于深度学习和图像生成器的灵活长度蛋白质免对齐分析","authors":"Mingeun Ji, Yejin Kan, Dongyeon Kim, Seungmin Lee, Gangman Yi","doi":"10.1007/s12539-024-00618-x","DOIUrl":null,"url":null,"abstract":"<p>With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.</p><h3 data-test=\"abstract-sub-heading\">Graphical Abstract</h3>\n","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":"51 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeepPI: Alignment-Free Analysis of Flexible Length Proteins Based on Deep Learning and Image Generator\",\"authors\":\"Mingeun Ji, Yejin Kan, Dongyeon Kim, Seungmin Lee, Gangman Yi\",\"doi\":\"10.1007/s12539-024-00618-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.</p><h3 data-test=\\\"abstract-sub-heading\\\">Graphical Abstract</h3>\\n\",\"PeriodicalId\":13670,\"journal\":{\"name\":\"Interdisciplinary Sciences: Computational Life Sciences\",\"volume\":\"51 1\",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interdisciplinary Sciences: Computational Life Sciences\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1007/s12539-024-00618-x\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interdisciplinary Sciences: Computational Life Sciences","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s12539-024-00618-x","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

随着 NGS 技术的快速发展，蛋白质序列的数量呈指数级增长。由于通过生物实验分析大量蛋白质既费钱又费时，因此蛋白质功能研究引入了计算方法。近年来，人们提出了基于深度学习的新方法，以克服传统方法的局限性。虽然基于深度学习的方法能有效利用蛋白质功能的特征，但它们仅限于固定长度的序列，并考虑相邻氨基酸的信息。因此，需要能从长度灵活的蛋白质中提取功能特征并训练模型的新蛋白质分析工具。我们介绍了 DeepPI，这是一种基于深度学习的工具，用于分析大规模数据库中的蛋白质。利用全局平均池化技术提出的模型适用于长度灵活的蛋白质，与使用固定大小的现有算法相比，可减少信息损失。图像生成器可将一维序列转换为独特的二维结构，从而提取出各种形状的共同部分。最后，过滤技术可自动检测整个数据库中的代表性数据，确保覆盖大型蛋白质数据库。我们证明，DeepPI 已成功应用于 Pfam-A 数据库等大型数据库。四种图像生成器的对比实验说明了结构对特征提取的影响。通过改变参数值验证了过滤性能，并证明其适用于大型数据库。与现有方法相比，DeepPI在蛋白质功能推断方面的族分类准确性更胜一筹。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DeepPI: Alignment-Free Analysis of Flexible Length Proteins Based on Deep Learning and Image Generator

With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.

Graphical Abstract

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interdisciplinary Sciences: Computational Life Sciences MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

8.60

自引率

4.20%

发文量

期刊介绍： Interdisciplinary Sciences--Computational Life Sciences aims to cover the most recent and outstanding developments in interdisciplinary areas of sciences, especially focusing on computational life sciences, an area that is enjoying rapid development at the forefront of scientific research and technology. The journal publishes original papers of significant general interest covering recent research and developments. Articles will be published rapidly by taking full advantage of internet technology for online submission and peer-reviewing of manuscripts, and then by publishing OnlineFirstTM through SpringerLink even before the issue is built or sent to the printer. The editorial board consists of many leading scientists with international reputation, among others, Luc Montagnier (UNESCO, France), Dennis Salahub (University of Calgary, Canada), Weitao Yang (Duke University, USA). Prof. Dongqing Wei at the Shanghai Jiatong University is appointed as the editor-in-chief; he made important contributions in bioinformatics and computational physics and is best known for his ground-breaking works on the theory of ferroelectric liquids. With the help from a team of associate editors and the editorial board, an international journal with sound reputation shall be created.