BMC Bioinformatics最新文献_第9页

Predicting viral proteins that evade the innate immune system: a machine learning-based immunoinformatics tool. 预测逃避先天性免疫系统的病毒蛋白：基于机器学习的免疫信息学工具。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-09 DOI: 10.1186/s12859-024-05972-7

Jorge F Beltrán, Lisandra Herrera Belén, Alejandro J Yáñez, Luis Jimenez

Viral proteins that evade the host's innate immune response play a crucial role in pathogenesis, significantly impacting viral infections and potential therapeutic strategies. Identifying these proteins through traditional methods is challenging and time-consuming due to the complexity of virus-host interactions. Leveraging advancements in computational biology, we present VirusHound-II, a novel tool that utilizes machine learning techniques to predict viral proteins evading the innate immune response with high accuracy. We evaluated a comprehensive range of machine learning models, including ensemble methods, neural networks, and support vector machines. Using a dataset of 1337 viral proteins known to evade the innate immune response (VPEINRs) and an equal number of non-VPEINRs, we employed pseudo amino acid composition as the molecular descriptor. Our methodology involved a tenfold cross-validation strategy on 80% of the data for training, followed by testing on an independent dataset comprising the remaining 20%. The random forest model demonstrated superior performance metrics, achieving 0.9290 accuracy, 0.9283 F1 score, 0.9354 precision, and 0.9213 sensitivity in the independent testing phase. These results establish VirusHound-II as an advancement in computational virology, accessible via a user-friendly web application. We anticipate that VirusHound-II will be a crucial resource for researchers, enabling the rapid and reliable prediction of viral proteins evading the innate immune response. This tool has the potential to accelerate the identification of therapeutic targets and enhance our understanding of viral evasion mechanisms, contributing to the development of more effective antiviral strategies and advancing our knowledge of virus-host interactions.

逃避宿主先天免疫反应的病毒蛋白在致病过程中起着至关重要的作用，对病毒感染和潜在的治疗策略产生重大影响。由于病毒与宿主相互作用的复杂性，通过传统方法鉴定这些蛋白既具有挑战性又耗费时间。利用计算生物学的进步，我们推出了 VirusHound-II，这是一种利用机器学习技术高精度预测逃避先天免疫反应的病毒蛋白的新型工具。我们评估了一系列机器学习模型，包括集合方法、神经网络和支持向量机。我们使用了一个包含 1337 种已知可逃避先天性免疫反应的病毒蛋白（VPEINRs）和同等数量的非 VPEINRs 的数据集，并采用了伪氨基酸组成作为分子描述符。我们的方法包括在 80% 的数据上采用十倍交叉验证策略进行训练，然后在由剩余 20% 数据组成的独立数据集上进行测试。随机森林模型在独立测试阶段取得了 0.9290 的准确率、0.92831 的 F1 分数、0.9354 的精确度和 0.9213 的灵敏度，表现出卓越的性能指标。这些结果确立了 VirusHound-II 在计算病毒学领域的领先地位，它可以通过用户友好的网络应用程序访问。我们预计 VirusHound-II 将成为研究人员的重要资源，能够快速可靠地预测逃避先天免疫反应的病毒蛋白。该工具有可能加快治疗目标的确定，并增强我们对病毒逃避机制的了解，从而有助于开发更有效的抗病毒策略，并增进我们对病毒与宿主相互作用的了解。

{"title":"Predicting viral proteins that evade the innate immune system: a machine learning-based immunoinformatics tool.","authors":"Jorge F Beltrán, Lisandra Herrera Belén, Alejandro J Yáñez, Luis Jimenez","doi":"10.1186/s12859-024-05972-7","DOIUrl":"10.1186/s12859-024-05972-7","url":null,"abstract":"Viral proteins that evade the host's innate immune response play a crucial role in pathogenesis, significantly impacting viral infections and potential therapeutic strategies. Identifying these proteins through traditional methods is challenging and time-consuming due to the complexity of virus-host interactions. Leveraging advancements in computational biology, we present VirusHound-II, a novel tool that utilizes machine learning techniques to predict viral proteins evading the innate immune response with high accuracy. We evaluated a comprehensive range of machine learning models, including ensemble methods, neural networks, and support vector machines. Using a dataset of 1337 viral proteins known to evade the innate immune response (VPEINRs) and an equal number of non-VPEINRs, we employed pseudo amino acid composition as the molecular descriptor. Our methodology involved a tenfold cross-validation strategy on 80% of the data for training, followed by testing on an independent dataset comprising the remaining 20%. The random forest model demonstrated superior performance metrics, achieving 0.9290 accuracy, 0.9283 F1 score, 0.9354 precision, and 0.9213 sensitivity in the independent testing phase. These results establish VirusHound-II as an advancement in computational virology, accessible via a user-friendly web application. We anticipate that VirusHound-II will be a crucial resource for researchers, enabling the rapid and reliable prediction of viral proteins evading the innate immune response. This tool has the potential to accelerate the identification of therapeutic targets and enhance our understanding of viral evasion mechanisms, contributing to the development of more effective antiviral strategies and advancing our knowledge of virus-host interactions.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"351"},"PeriodicalIF":2.9,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11550529/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Abstraction-based segmental simulation of reaction networks using adaptive memoization. 基于抽象的反应网络分段仿真，使用自适应内存化。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-08 DOI: 10.1186/s12859-024-05966-5

Martin Helfrich, Roman Andriushchenko, Milan Češka, Jan Křetínský, Štefan Martiček, David Šafránek

Background: Stochastic models are commonly employed in the system and synthetic biology to study the effects of stochastic fluctuations emanating from reactions involving species with low copy-numbers. Many important models feature complex dynamics, involving a state-space explosion, stiffness, and multimodality, that complicate the quantitative analysis needed to understand their stochastic behavior. Direct numerical analysis of such models is typically not feasible and generating many simulation runs that adequately approximate the model's dynamics may take a prohibitively long time.

Results: We propose a new memoization technique that leverages a population-based abstraction and combines previously generated parts of simulations, called segments, to generate new simulations more efficiently while preserving the original system's dynamics and its diversity. Our algorithm adapts online to identify the most important abstract states and thus utilizes the available memory efficiently.

Conclusion: We demonstrate that in combination with a novel fully automatic and adaptive hybrid simulation scheme, we can speed up the generation of trajectories significantly and correctly predict the transient behavior of complex stochastic systems.

背景：系统和合成生物学通常采用随机模型来研究涉及低拷贝数物种的反应所产生的随机波动的影响。许多重要的模型具有复杂的动力学特征，涉及状态空间爆炸、刚性和多模态，这使得理解其随机行为所需的定量分析变得更加复杂。对这类模型进行直接数值分析通常是不可行的，而生成许多能充分近似模型动态的模拟运行可能会耗费过长的时间：我们提出了一种新的记忆化技术，该技术利用基于种群的抽象，将先前生成的模拟部分（称为段）组合起来，从而更高效地生成新的模拟，同时保留原始系统的动态及其多样性。我们的算法可在线调整以识别最重要的抽象状态，从而高效利用可用内存：我们证明，结合新颖的全自动自适应混合模拟方案，我们可以显著加快轨迹生成速度，并正确预测复杂随机系统的瞬态行为。

{"title":"Abstraction-based segmental simulation of reaction networks using adaptive memoization.","authors":"Martin Helfrich, Roman Andriushchenko, Milan Češka, Jan Křetínský, Štefan Martiček, David Šafránek","doi":"10.1186/s12859-024-05966-5","DOIUrl":"10.1186/s12859-024-05966-5","url":null,"abstract":"Background: Stochastic models are commonly employed in the system and synthetic biology to study the effects of stochastic fluctuations emanating from reactions involving species with low copy-numbers. Many important models feature complex dynamics, involving a state-space explosion, stiffness, and multimodality, that complicate the quantitative analysis needed to understand their stochastic behavior. Direct numerical analysis of such models is typically not feasible and generating many simulation runs that adequately approximate the model's dynamics may take a prohibitively long time.Results: We propose a new memoization technique that leverages a population-based abstraction and combines previously generated parts of simulations, called segments, to generate new simulations more efficiently while preserving the original system's dynamics and its diversity. Our algorithm adapts online to identify the most important abstract states and thus utilizes the available memory efficiently.Conclusion: We demonstrate that in combination with a novel fully automatic and adaptive hybrid simulation scheme, we can speed up the generation of trajectories significantly and correctly predict the transient behavior of complex stochastic systems.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"350"},"PeriodicalIF":2.9,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549863/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph-based machine learning model for weight prediction in protein-protein networks. 基于图的蛋白质-蛋白质网络权重预测机器学习模型。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-08 DOI: 10.1186/s12859-024-05973-6

Hajer Akid, Kirsley Chennen, Gabriel Frey, Julie Thompson, Mounir Ben Ayed, Nicolas Lachiche

Proteins interact with each other in complex ways to perform significant biological functions. These interactions, known as protein-protein interactions (PPIs), can be depicted as a graph where proteins are nodes and their interactions are edges. The development of high-throughput experimental technologies allows for the generation of numerous data which permits increasing the sophistication of PPI models. However, despite significant progress, current PPI networks remain incomplete. Discovering missing interactions through experimental techniques can be costly, time-consuming, and challenging. Therefore, computational approaches have emerged as valuable tools for predicting missing interactions. In PPI networks, a graph is usually used to model the interactions between proteins. An edge between two proteins indicates a known interaction, while the absence of an edge means the interaction is not known or missed. However, this binary representation overlooks the reliability of known interactions when predicting new ones. To address this challenge, we propose a novel approach for link prediction in weighted protein-protein networks, where interaction weights denote confidence scores. By leveraging data from the yeast Saccharomyces cerevisiae obtained from the STRING database, we introduce a new model that combines similarity-based algorithms and aggregated confidence score weights for accurate link prediction purposes. Our model significantly improves prediction accuracy, surpassing traditional approaches in terms of Mean Absolute Error, Mean Relative Absolute Error, and Root Mean Square Error. Our proposed approach holds the potential for improved accuracy in predicting PPIs, which is crucial for better understanding the underlying biological processes.

蛋白质以复杂的方式相互作用，发挥重要的生物功能。这些相互作用被称为蛋白质-蛋白质相互作用（PPIs），可以描绘成一张图，其中蛋白质是节点，它们之间的相互作用是边。高通量实验技术的发展允许生成大量数据，从而提高了 PPI 模型的复杂性。然而，尽管取得了重大进展，目前的 PPI 网络仍然不完整。通过实验技术发现缺失的相互作用可能成本高、耗时长，而且具有挑战性。因此，计算方法已成为预测缺失相互作用的重要工具。在 PPI 网络中，通常使用图来模拟蛋白质之间的相互作用。两个蛋白质之间的边表示已知的相互作用，而没有边则表示不知道或错过了相互作用。然而，这种二元表示法在预测新的相互作用时忽略了已知相互作用的可靠性。为了应对这一挑战，我们提出了一种在加权蛋白质-蛋白质网络中进行链接预测的新方法，其中相互作用权重表示置信度分数。通过利用从 STRING 数据库中获得的酿酒酵母数据，我们引入了一个新模型，该模型结合了基于相似性的算法和聚合置信度分数权重，以达到精确链接预测的目的。我们的模型大大提高了预测准确性，在平均绝对误差、平均相对绝对误差和均方根误差方面都超过了传统方法。我们提出的方法有望提高预测 PPIs 的准确性，这对于更好地理解潜在的生物过程至关重要。

{"title":"Graph-based machine learning model for weight prediction in protein-protein networks.","authors":"Hajer Akid, Kirsley Chennen, Gabriel Frey, Julie Thompson, Mounir Ben Ayed, Nicolas Lachiche","doi":"10.1186/s12859-024-05973-6","DOIUrl":"10.1186/s12859-024-05973-6","url":null,"abstract":"Proteins interact with each other in complex ways to perform significant biological functions. These interactions, known as protein-protein interactions (PPIs), can be depicted as a graph where proteins are nodes and their interactions are edges. The development of high-throughput experimental technologies allows for the generation of numerous data which permits increasing the sophistication of PPI models. However, despite significant progress, current PPI networks remain incomplete. Discovering missing interactions through experimental techniques can be costly, time-consuming, and challenging. Therefore, computational approaches have emerged as valuable tools for predicting missing interactions. In PPI networks, a graph is usually used to model the interactions between proteins. An edge between two proteins indicates a known interaction, while the absence of an edge means the interaction is not known or missed. However, this binary representation overlooks the reliability of known interactions when predicting new ones. To address this challenge, we propose a novel approach for link prediction in weighted protein-protein networks, where interaction weights denote confidence scores. By leveraging data from the yeast Saccharomyces cerevisiae obtained from the STRING database, we introduce a new model that combines similarity-based algorithms and aggregated confidence score weights for accurate link prediction purposes. Our model significantly improves prediction accuracy, surpassing traditional approaches in terms of Mean Absolute Error, Mean Relative Absolute Error, and Root Mean Square Error. Our proposed approach holds the potential for improved accuracy in predicting PPIs, which is crucial for better understanding the underlying biological processes.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"349"},"PeriodicalIF":2.9,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11546293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142602864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rapid bacterial identification through volatile organic compound analysis and deep learning. 通过挥发性有机化合物分析和深度学习快速识别细菌。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-06 DOI: 10.1186/s12859-024-05967-4

Bowen Yan, Lin Zeng, Yanyi Lu, Min Li, Weiping Lu, Bangfu Zhou, Qinghua He

Background: The increasing antimicrobial resistance caused by the improper use of antibiotics poses a significant challenge to humanity. Rapid and accurate identification of microbial species in clinical settings is crucial for precise medication and reducing the development of antimicrobial resistance. This study aimed to explore a method for automatic identification of bacteria using Volatile Organic Compounds (VOCs) analysis and deep learning algorithms.

Results: AlexNet, where augmentation is applied, produces the best results. The average accuracy rate for single bacterial culture classification reached 99.24% using cross-validation, and the accuracy rates for identifying the three bacteria in randomly mixed cultures were SA:98.6%, EC:98.58% and PA:98.99%, respectively.

Conclusion: This work provides a new approach to quickly identify bacterial microorganisms. Using this method can automatically identify bacteria in GC-IMS detection results, helping clinical doctors quickly detect bacterial species, accurately prescribe medication, thereby controlling epidemics, and minimizing the negative impact of bacterial resistance on society.

背景：抗生素的不当使用导致抗菌药耐药性不断增加，给人类带来了巨大挑战。在临床环境中快速准确地识别微生物种类对于精确用药和减少抗菌药耐药性的产生至关重要。本研究旨在探索一种利用挥发性有机化合物（VOCs）分析和深度学习算法自动识别细菌的方法：结果：采用增强算法的 AlexNet 效果最好。通过交叉验证，单一细菌培养物分类的平均准确率达到 99.24%，随机混合培养物中识别三种细菌的准确率分别为 SA:98.6%、EC:98.58% 和 PA:98.99%：这项工作提供了一种快速识别细菌微生物的新方法。结论：这项研究提供了一种快速识别细菌微生物的新方法，利用这种方法可以自动识别 GC-IMS 检测结果中的细菌，帮助临床医生快速检测细菌种类，准确开具处方，从而控制流行病，将细菌耐药性对社会的负面影响降到最低。

{"title":"Rapid bacterial identification through volatile organic compound analysis and deep learning.","authors":"Bowen Yan, Lin Zeng, Yanyi Lu, Min Li, Weiping Lu, Bangfu Zhou, Qinghua He","doi":"10.1186/s12859-024-05967-4","DOIUrl":"10.1186/s12859-024-05967-4","url":null,"abstract":"Background: The increasing antimicrobial resistance caused by the improper use of antibiotics poses a significant challenge to humanity. Rapid and accurate identification of microbial species in clinical settings is crucial for precise medication and reducing the development of antimicrobial resistance. This study aimed to explore a method for automatic identification of bacteria using Volatile Organic Compounds (VOCs) analysis and deep learning algorithms.Results: AlexNet, where augmentation is applied, produces the best results. The average accuracy rate for single bacterial culture classification reached 99.24% using cross-validation, and the accuracy rates for identifying the three bacteria in randomly mixed cultures were SA:98.6%, EC:98.58% and PA:98.99%, respectively.Conclusion: This work provides a new approach to quickly identify bacterial microorganisms. Using this method can automatically identify bacteria in GC-IMS detection results, helping clinical doctors quickly detect bacterial species, accurately prescribe medication, thereby controlling epidemics, and minimizing the negative impact of bacterial resistance on society.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"347"},"PeriodicalIF":2.9,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prediction of antibody-antigen interaction based on backbone aware with invariant point attention. 基于骨干意识和不变点注意力的抗体-抗原相互作用预测。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-06 DOI: 10.1186/s12859-024-05961-w

Miao Gu, Weiyang Yang, Min Liu

Background: Antibodies play a crucial role in disease treatment, leveraging their ability to selectively interact with the specific antigen. However, screening antibody gene sequences for target antigens via biological experiments is extremely time-consuming and labor-intensive. Several computational methods have been developed to predict antibody-antigen interaction while suffering from the lack of characterizing the underlying structure of the antibody.

Results: Beneficial from the recent breakthroughs in deep learning for antibody structure prediction, we propose a novel neural network architecture to predict antibody-antigen interaction. We first introduce AbAgIPA: an antibody structure prediction network to obtain the antibody backbone structure, where the structural features of antibodies and antigens are encoded into representation vectors according to the amino acid physicochemical features and Invariant Point Attention (IPA) computation methods. Finally, the antibody-antigen interaction is predicted by global max pooling, feature concatenation, and a fully connected layer. We evaluated our method on antigen diversity and antigen-specific antibody-antigen interaction datasets. Additionally, our model exhibits a commendable level of interpretability, essential for understanding underlying interaction mechanisms.

Conclusions: Quantitative experimental results demonstrate that the new neural network architecture significantly outperforms the best sequence-based methods as well as the methods based on residue contact maps and graph convolution networks (GCNs). The source code is freely available on GitHub at https://github.com/gmthu66/AbAgIPA .

背景：抗体利用其与特定抗原选择性相互作用的能力，在疾病治疗中发挥着至关重要的作用。然而，通过生物实验筛选抗体基因序列以确定目标抗原极其耗时耗力。目前已开发出几种计算方法来预测抗体与抗原的相互作用，但却缺乏对抗体底层结构的表征：受益于最近在抗体结构预测的深度学习方面取得的突破，我们提出了一种预测抗体-抗原相互作用的新型网络架构。我们首先介绍了AbAgIPA：一种用于获取抗体骨架结构的抗体结构预测网络，根据氨基酸理化特征和不变点注意（IPA）计算方法，将抗体和抗原的结构特征编码成表示向量。最后，通过全局最大集合、特征串联和全连接层预测抗体与抗原的相互作用。我们在抗原多样性和抗原特异性抗体-抗原相互作用数据集上评估了我们的方法。此外，我们的模型表现出了值得称赞的可解释性，这对于理解潜在的相互作用机制至关重要：定量实验结果表明，新的神经网络架构明显优于基于序列的最佳方法以及基于残基接触图和图卷积网络（GCN）的方法。源代码可在 GitHub 上免费获取：https://github.com/gmthu66/AbAgIPA 。

{"title":"Prediction of antibody-antigen interaction based on backbone aware with invariant point attention.","authors":"Miao Gu, Weiyang Yang, Min Liu","doi":"10.1186/s12859-024-05961-w","DOIUrl":"10.1186/s12859-024-05961-w","url":null,"abstract":"Background: Antibodies play a crucial role in disease treatment, leveraging their ability to selectively interact with the specific antigen. However, screening antibody gene sequences for target antigens via biological experiments is extremely time-consuming and labor-intensive. Several computational methods have been developed to predict antibody-antigen interaction while suffering from the lack of characterizing the underlying structure of the antibody.Results: Beneficial from the recent breakthroughs in deep learning for antibody structure prediction, we propose a novel neural network architecture to predict antibody-antigen interaction. We first introduce AbAgIPA: an antibody structure prediction network to obtain the antibody backbone structure, where the structural features of antibodies and antigens are encoded into representation vectors according to the amino acid physicochemical features and Invariant Point Attention (IPA) computation methods. Finally, the antibody-antigen interaction is predicted by global max pooling, feature concatenation, and a fully connected layer. We evaluated our method on antigen diversity and antigen-specific antibody-antigen interaction datasets. Additionally, our model exhibits a commendable level of interpretability, essential for understanding underlying interaction mechanisms.Conclusions: Quantitative experimental results demonstrate that the new neural network architecture significantly outperforms the best sequence-based methods as well as the methods based on residue contact maps and graph convolution networks (GCNs). The source code is freely available on GitHub at https://github.com/gmthu66/AbAgIPA .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"348"},"PeriodicalIF":2.9,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11542381/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

REDalign: accurate RNA structural alignment using residual encoder-decoder network. REDalign：利用残差编码器-解码器网络进行精确的 RNA 结构配准。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-05 DOI: 10.1186/s12859-024-05956-7

Chun-Chi Chen, Yi-Ming Chan, Hyundoo Jeong

Background: RNA secondary structural alignment serves as a foundational procedure in identifying conserved structural motifs among RNA sequences, crucially advancing our understanding of novel RNAs via comparative genomic analysis. While various computational strategies for RNA structural alignment exist, they often come with high computational complexity. Specifically, when addressing a set of RNAs with unknown structures, the task of simultaneously predicting their consensus secondary structure and determining the optimal sequence alignment requires an overwhelming computational effort of $O (L^{6})$ for each RNA pair. Such an extremely high computational complexity makes these methods impractical for large-scale analysis despite their accurate alignment capabilities.

Results: In this paper, we introduce REDalign, an innovative approach based on deep learning for RNA secondary structural alignment. By utilizing a residual encoder-decoder network, REDalign can efficiently capture consensus structures and optimize structural alignments. In this learning model, the encoder network leverages a hierarchical pyramid to assimilate high-level structural features. Concurrently, the decoder network, enhanced with residual skip connections, integrates multi-level encoded features to learn detailed feature hierarchies with fewer parameter sets. REDalign significantly reduces computational complexity compared to Sankoff-style algorithms and effectively handles non-nested structures, including pseudoknots, which are challenging for traditional alignment methods. Extensive evaluations demonstrate that REDalign provides superior accuracy and substantial computational efficiency.

Conclusion: REDalign presents a significant advancement in RNA secondary structural alignment, balancing high alignment accuracy with lower computational demands. Its ability to handle complex RNA structures, including pseudoknots, makes it an effective tool for large-scale RNA analysis, with potential implications for accelerating discoveries in RNA research and comparative genomics.

背景：RNA 二级结构比对是识别 RNA 序列中保守结构模式的基础程序，可通过比较基因组分析加深我们对新型 RNA 的理解。虽然存在各种用于 RNA 结构比对的计算策略，但它们往往具有很高的计算复杂性。具体来说，在处理一组结构未知的 RNA 时，同时预测它们的共识二级结构和确定最佳序列比对的任务需要对每对 RNA 进行 O ( L 6 ) 的计算。这样极高的计算复杂度使得这些方法尽管具有精确的比对能力，但在大规模分析中并不实用：在本文中，我们介绍了 REDalign，一种基于深度学习的 RNA 二级结构配准创新方法。通过利用残差编码器-解码器网络，REDalign 可以有效捕捉共识结构并优化结构配准。在这种学习模型中，编码器网络利用分层金字塔吸收高级结构特征。同时，解码器网络通过残余跳转连接进行增强，整合多层次编码特征，以更少的参数集学习详细的特征层次。与 Sankoff 算法相比，REDalign 大大降低了计算复杂度，并能有效处理非嵌套结构，包括对传统配准方法具有挑战性的伪节点。广泛的评估结果表明，REDalign 具有卓越的准确性和可观的计算效率：REDalign 在 RNA 二级结构配准方面取得了重大进展，在高配准精度和低计算需求之间实现了平衡。REDalign 能够处理复杂的 RNA 结构（包括假结点），是进行大规模 RNA 分析的有效工具，对加速 RNA 研究和比较基因组学的发现具有潜在意义。

{"title":"REDalign: accurate RNA structural alignment using residual encoder-decoder network.","authors":"Chun-Chi Chen, Yi-Ming Chan, Hyundoo Jeong","doi":"10.1186/s12859-024-05956-7","DOIUrl":"10.1186/s12859-024-05956-7","url":null,"abstract":"Background: RNA secondary structural alignment serves as a foundational procedure in identifying conserved structural motifs among RNA sequences, crucially advancing our understanding of novel RNAs via comparative genomic analysis. While various computational strategies for RNA structural alignment exist, they often come with high computational complexity. Specifically, when addressing a set of RNAs with unknown structures, the task of simultaneously predicting their consensus secondary structure and determining the optimal sequence alignment requires an overwhelming computational effort of <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>L</mi> <mn>6</mn></msup> <mo>)</mo></mrow> </math> for each RNA pair. Such an extremely high computational complexity makes these methods impractical for large-scale analysis despite their accurate alignment capabilities.Results: In this paper, we introduce REDalign, an innovative approach based on deep learning for RNA secondary structural alignment. By utilizing a residual encoder-decoder network, REDalign can efficiently capture consensus structures and optimize structural alignments. In this learning model, the encoder network leverages a hierarchical pyramid to assimilate high-level structural features. Concurrently, the decoder network, enhanced with residual skip connections, integrates multi-level encoded features to learn detailed feature hierarchies with fewer parameter sets. REDalign significantly reduces computational complexity compared to Sankoff-style algorithms and effectively handles non-nested structures, including pseudoknots, which are challenging for traditional alignment methods. Extensive evaluations demonstrate that REDalign provides superior accuracy and substantial computational efficiency.Conclusion: REDalign presents a significant advancement in RNA secondary structural alignment, balancing high alignment accuracy with lower computational demands. Its ability to handle complex RNA structures, including pseudoknots, makes it an effective tool for large-scale RNA analysis, with potential implications for accelerating discoveries in RNA research and comparative genomics.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"346"},"PeriodicalIF":2.9,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PangeBlocks: customized construction of pangenome graphs via maximal blocks. PangeBlocks：通过最大块定制构建泛基因组图。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05958-5

Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti

Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.

Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.

Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.

背景：构建庞基因组图是庞基因组学的一项基本任务。一个自然的理论问题是如何将构建最优庞基因组图的计算问题形式化，明确基本优化标准和可行解决方案集。目前的方法是利用一些启发式方法构建庞基因组图，而不假定一些明确的优化标准。因此，具体的优化标准如何影响图拓扑和下游分析（如读取映射和变异调用）尚不清楚：本文利用多重序列比对（MSA）中最大区块的概念，将泛基因组图构建问题重构为区块上的精确覆盖问题，称为最小加权区块覆盖（MWBC）。然后，我们为 MWBC 问题提出了一种整数线性规划（ILP）公式，使我们能够研究构建图的最自然目标函数。我们提供了求解 MWBC 的 ILP 方法的实现，并在 SARS-CoV-2 完整基因组上对其进行了评估，显示了不同的目标函数如何导致具有不同属性的 pangenome 图，暗示了特定的下游任务可以驱动图构建阶段：我们的研究表明，基于目标函数的庞基因组图的定制化构建会对生成的图产生直接影响。特别是，我们基于寻找覆盖 MSA 的最优块子集对 MWBC 问题进行了形式化，为用户可以指导构建 MSA 图表示的新型实用方法铺平了道路。

{"title":"PangeBlocks: customized construction of pangenome graphs via maximal blocks.","authors":"Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti","doi":"10.1186/s12859-024-05958-5","DOIUrl":"10.1186/s12859-024-05958-5","url":null,"abstract":"Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"344"},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GPCR-BSD: a database of binding sites of human G-protein coupled receptors under diverse states. GPCR-BSD：人类 G 蛋白偶联受体在不同状态下的结合位点数据库。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05962-9

Fan Liu, Han Zhou, Xiaonong Li, Liangliang Zhou, Chungong Yu, Haicang Zhang, Dongbo Bu, Xinmiao Liang

G-protein coupled receptors (GPCRs), the largest family of membrane proteins in human body, involve a great variety of biological processes and thus have become highly valuable drug targets. By binding with ligands (e.g., drugs), GPCRs switch between active and inactive conformational states, thereby performing functions such as signal transmission. The changes in binding pockets under different states are important for a better understanding of drug-target interactions. Therefore it is critical, as well as a practical need, to obtain binding sites in human GPCR structures. We report a database (called GPCR-BSD) that collects 127,990 predicted binding sites of 803 GPCRs under active and inactive states (thus 1,606 structures in total). The binding sites were identified from the predicted GPCR structures by executing three geometric-based pocket prediction methods, fpocket, CavityPlus and GHECOM. The server provides query, visualization, and comparison of the predicted binding sites for both GPCR predicted and experimentally determined structures recorded in PDB. We evaluated the identified pockets of 132 experimentally determined human GPCR structures in terms of pocket residue coverage, pocket center distance and redocking accuracy. The evaluation showed that fpocket and CavityPlus methods performed better and successfully predicted orthosteric binding sites in over 60% of the 132 experimentally determined structures. The GPCR Binding Site database is freely accessible at https://gpcrbs.bigdata.jcmsc.cn . This study not only provides a systematic evaluation of the commonly-used fpocket and CavityPlus methods for the first time but also meets the need for binding site information in GPCR studies.

G 蛋白偶联受体（GPCR）是人体内最大的膜蛋白家族，涉及多种生物过程，因此成为极具价值的药物靶标。通过与配体（如药物）结合，GPCR 在活性和非活性构象状态之间切换，从而实现信号传输等功能。不同状态下结合口袋的变化对于更好地理解药物与靶点的相互作用非常重要。因此，获取人类 GPCR 结构中的结合位点至关重要，也是实际需要。我们报告的数据库（称为 GPCR-BSD）收集了 803 个 GPCR 在活跃和非活跃状态下的 127,990 个预测结合位点（因此共有 1,606 个结构）。这些结合位点是通过三种基于几何的口袋预测方法（fpocket、CavityPlus 和 GHECOM）从预测的 GPCR 结构中确定的。该服务器可对 PDB 中记录的 GPCR 预测结构和实验测定结构的预测结合位点进行查询、可视化和比较。我们从口袋残基覆盖率、口袋中心距离和再锁定准确性等方面评估了 132 个实验测定的人类 GPCR 结构的已识别口袋。评估结果表明，fpocket 和 CavityPlus 方法表现更好，在 132 个实验测定的结构中成功预测了 60% 以上的正交结合位点。GPCR 结合位点数据库可在 https://gpcrbs.bigdata.jcmsc.cn 免费访问。这项研究不仅首次对常用的 fpocket 和 CavityPlus 方法进行了系统评估，而且满足了 GPCR 研究对结合位点信息的需求。

{"title":"GPCR-BSD: a database of binding sites of human G-protein coupled receptors under diverse states.","authors":"Fan Liu, Han Zhou, Xiaonong Li, Liangliang Zhou, Chungong Yu, Haicang Zhang, Dongbo Bu, Xinmiao Liang","doi":"10.1186/s12859-024-05962-9","DOIUrl":"10.1186/s12859-024-05962-9","url":null,"abstract":"G-protein coupled receptors (GPCRs), the largest family of membrane proteins in human body, involve a great variety of biological processes and thus have become highly valuable drug targets. By binding with ligands (e.g., drugs), GPCRs switch between active and inactive conformational states, thereby performing functions such as signal transmission. The changes in binding pockets under different states are important for a better understanding of drug-target interactions. Therefore it is critical, as well as a practical need, to obtain binding sites in human GPCR structures. We report a database (called GPCR-BSD) that collects 127,990 predicted binding sites of 803 GPCRs under active and inactive states (thus 1,606 structures in total). The binding sites were identified from the predicted GPCR structures by executing three geometric-based pocket prediction methods, fpocket, CavityPlus and GHECOM. The server provides query, visualization, and comparison of the predicted binding sites for both GPCR predicted and experimentally determined structures recorded in PDB. We evaluated the identified pockets of 132 experimentally determined human GPCR structures in terms of pocket residue coverage, pocket center distance and redocking accuracy. The evaluation showed that fpocket and CavityPlus methods performed better and successfully predicted orthosteric binding sites in over 60% of the 132 experimentally determined structures. The GPCR Binding Site database is freely accessible at https://gpcrbs.bigdata.jcmsc.cn . This study not only provides a systematic evaluation of the commonly-used fpocket and CavityPlus methods for the first time but also meets the need for binding site information in GPCR studies.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"343"},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533411/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MIPPIS: protein-protein interaction site prediction network with multi-information fusion. MIPPIS：多信息融合的蛋白质-蛋白质相互作用位点预测网络。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05964-7

Shuang Wang, Kaiyu Dong, Dingming Liang, Yunjing Zhang, Xue Li, Tao Song

Background: The prediction of protein-protein interaction sites plays a crucial role in biochemical processes. Investigating the interaction between viruses and receptor proteins through biological techniques aids in understanding disease mechanisms and guides the development of corresponding drugs. While various methods have been proposed in the past, they often suffer from drawbacks such as long processing times, high costs, and low accuracy.

Results: Addressing these challenges, we propose a novel protein-protein interaction site prediction network based on multi-information fusion. In our approach, the initial amino acid features are depicted by the position-specific scoring matrix, hidden Markov model, dictionary of protein secondary structure, and one-hot encoding. Simultaneously, we adopt a multi-channel approach to extract deep-level amino acids features from different perspectives. The graph convolutional network channel effectively extracts spatial structural information. The bidirectional long short-term memory channel treats the amino acid sequence as natural language, capturing the protein's primary structure information. The ProtT5 protein large language model channel outputs a more comprehensive amino acid embedding representation, providing a robust complement to the two aforementioned channels. Finally, the obtained amino acid features are fed into the prediction layer for the final prediction.

Conclusion: Compared with six protein structure-based methods and six protein sequence-based methods, our model achieves optimal performance across evaluation metrics, including accuracy, precision, F₁, Matthews correlation coefficient, and area under the precision recall curve, which demonstrates the superiority of our model.

背景：预测蛋白质与蛋白质之间的相互作用位点在生化过程中起着至关重要的作用。通过生物技术研究病毒与受体蛋白之间的相互作用有助于了解疾病机理并指导相应药物的开发。过去曾提出过多种方法，但往往存在处理时间长、成本高、准确性低等缺点：针对这些挑战，我们提出了一种基于多信息融合的新型蛋白质-蛋白质相互作用位点预测网络。在我们的方法中，初始氨基酸特征由特定位置评分矩阵、隐马尔可夫模型、蛋白质二级结构字典和单次编码来描述。同时，我们采用多通道方法从不同角度提取深层次氨基酸特征。图卷积网络通道能有效提取空间结构信息。双向长短期记忆通道将氨基酸序列视为自然语言，捕捉蛋白质的主要结构信息。ProtT5 蛋白质大语言模型通道输出更全面的氨基酸嵌入表示，为上述两个通道提供了稳健的补充。最后，将获得的氨基酸特征输入预测层进行最终预测：结论：与六种基于蛋白质结构的方法和六种基于蛋白质序列的方法相比，我们的模型在准确率、精确度、F1、马太相关系数和精确召回曲线下面积等评价指标上都达到了最佳性能，这证明了我们模型的优越性。

{"title":"MIPPIS: protein-protein interaction site prediction network with multi-information fusion.","authors":"Shuang Wang, Kaiyu Dong, Dingming Liang, Yunjing Zhang, Xue Li, Tao Song","doi":"10.1186/s12859-024-05964-7","DOIUrl":"10.1186/s12859-024-05964-7","url":null,"abstract":"Background: The prediction of protein-protein interaction sites plays a crucial role in biochemical processes. Investigating the interaction between viruses and receptor proteins through biological techniques aids in understanding disease mechanisms and guides the development of corresponding drugs. While various methods have been proposed in the past, they often suffer from drawbacks such as long processing times, high costs, and low accuracy.Results: Addressing these challenges, we propose a novel protein-protein interaction site prediction network based on multi-information fusion. In our approach, the initial amino acid features are depicted by the position-specific scoring matrix, hidden Markov model, dictionary of protein secondary structure, and one-hot encoding. Simultaneously, we adopt a multi-channel approach to extract deep-level amino acids features from different perspectives. The graph convolutional network channel effectively extracts spatial structural information. The bidirectional long short-term memory channel treats the amino acid sequence as natural language, capturing the protein's primary structure information. The ProtT5 protein large language model channel outputs a more comprehensive amino acid embedding representation, providing a robust complement to the two aforementioned channels. Finally, the obtained amino acid features are fed into the prediction layer for the final prediction.Conclusion: Compared with six protein structure-based methods and six protein sequence-based methods, our model achieves optimal performance across evaluation metrics, including accuracy, precision, F1, Matthews correlation coefficient, and area under the precision recall curve, which demonstrates the superiority of our model.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"345"},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536593/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search. CUDASW++4.0：基于 GPU 的超快速史密斯-沃特曼蛋白质序列数据库搜索。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-11-02 DOI: 10.1186/s12859-024-05965-6

Bertil Schmidt, Felix Kallenborn, Alejandro Chacon, Christian Hundt

Background: The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations.

Results: CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt.

Conclusion: CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4 .

背景：史密斯-沃特曼算法（Smith-Waterman algorithm）对局部配对的灵敏度最高，因此成为蛋白质序列数据库搜索的热门选择。然而，它的二次时间复杂性使其成为计算密集型算法。遗憾的是，目前最先进的软件工具无法利用现代 GPU 的大规模并行处理能力实现接近峰值的性能。这就促使我们需要更高效的实现方法：CUDASW++4.0是一款快速软件工具，用于在支持CUDA的GPU上使用史密斯-沃特曼算法扫描蛋白质序列数据库。我们的方法通过最大限度地减少内存访问和指令，实现了基于动态编程的高效比对计算。我们提供了高效的矩阵平铺和序列数据库分区方案，并利用了新一代浮点运算和新型 DPX 指令。这使得现代 GPU（Ampere、Ada、Hopper）的性能接近峰值，在 A100、L40S 和 H100 上的吞吐率分别高达 1.94 TCUPS、5.01 TCUPS 和 5.71 TCUPS。在 Swiss-Prot、UniRef50 和 TrEMBL 数据库上进行的评估表明，CUDASW++4.0 的性能比以前基于 GPU 的方法（CUDASW++3.0、ADEPT、SW#DB）提高了一个数量级。此外，我们的算法比基于CPU的高性能工具（BLASTP、SWIPE、SWIMM2.0）显著提速，可以线性扩展利用多GPU节点，能效高达15.7 GCUPS/Watt，令人印象深刻：CUDASW++4.0通过在现代GPU上提供接近峰值的性能，改变了GPU在利用史密斯-沃特曼配准进行蛋白质序列数据库搜索方面的地位。它可在 https://github.com/asbschmidt/CUDASW4 免费获取。

{"title":"CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search.","authors":"Bertil Schmidt, Felix Kallenborn, Alejandro Chacon, Christian Hundt","doi":"10.1186/s12859-024-05965-6","DOIUrl":"10.1186/s12859-024-05965-6","url":null,"abstract":"Background: The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations.Results: CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt.Conclusion: CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4 .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"342"},"PeriodicalIF":2.9,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11531700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0