首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Graph-based machine learning model for weight prediction in protein-protein networks. 基于图的蛋白质-蛋白质网络权重预测机器学习模型。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-08 DOI: 10.1186/s12859-024-05973-6
Hajer Akid, Kirsley Chennen, Gabriel Frey, Julie Thompson, Mounir Ben Ayed, Nicolas Lachiche

Proteins interact with each other in complex ways to perform significant biological functions. These interactions, known as protein-protein interactions (PPIs), can be depicted as a graph where proteins are nodes and their interactions are edges. The development of high-throughput experimental technologies allows for the generation of numerous data which permits increasing the sophistication of PPI models. However, despite significant progress, current PPI networks remain incomplete. Discovering missing interactions through experimental techniques can be costly, time-consuming, and challenging. Therefore, computational approaches have emerged as valuable tools for predicting missing interactions. In PPI networks, a graph is usually used to model the interactions between proteins. An edge between two proteins indicates a known interaction, while the absence of an edge means the interaction is not known or missed. However, this binary representation overlooks the reliability of known interactions when predicting new ones. To address this challenge, we propose a novel approach for link prediction in weighted protein-protein networks, where interaction weights denote confidence scores. By leveraging data from the yeast Saccharomyces cerevisiae obtained from the STRING database, we introduce a new model that combines similarity-based algorithms and aggregated confidence score weights for accurate link prediction purposes. Our model significantly improves prediction accuracy, surpassing traditional approaches in terms of Mean Absolute Error, Mean Relative Absolute Error, and Root Mean Square Error. Our proposed approach holds the potential for improved accuracy in predicting PPIs, which is crucial for better understanding the underlying biological processes.

蛋白质以复杂的方式相互作用,发挥重要的生物功能。这些相互作用被称为蛋白质-蛋白质相互作用(PPIs),可以描绘成一张图,其中蛋白质是节点,它们之间的相互作用是边。高通量实验技术的发展允许生成大量数据,从而提高了 PPI 模型的复杂性。然而,尽管取得了重大进展,目前的 PPI 网络仍然不完整。通过实验技术发现缺失的相互作用可能成本高、耗时长,而且具有挑战性。因此,计算方法已成为预测缺失相互作用的重要工具。在 PPI 网络中,通常使用图来模拟蛋白质之间的相互作用。两个蛋白质之间的边表示已知的相互作用,而没有边则表示不知道或错过了相互作用。然而,这种二元表示法在预测新的相互作用时忽略了已知相互作用的可靠性。为了应对这一挑战,我们提出了一种在加权蛋白质-蛋白质网络中进行链接预测的新方法,其中相互作用权重表示置信度分数。通过利用从 STRING 数据库中获得的酿酒酵母数据,我们引入了一个新模型,该模型结合了基于相似性的算法和聚合置信度分数权重,以达到精确链接预测的目的。我们的模型大大提高了预测准确性,在平均绝对误差、平均相对绝对误差和均方根误差方面都超过了传统方法。我们提出的方法有望提高预测 PPIs 的准确性,这对于更好地理解潜在的生物过程至关重要。
{"title":"Graph-based machine learning model for weight prediction in protein-protein networks.","authors":"Hajer Akid, Kirsley Chennen, Gabriel Frey, Julie Thompson, Mounir Ben Ayed, Nicolas Lachiche","doi":"10.1186/s12859-024-05973-6","DOIUrl":"https://doi.org/10.1186/s12859-024-05973-6","url":null,"abstract":"<p><p>Proteins interact with each other in complex ways to perform significant biological functions. These interactions, known as protein-protein interactions (PPIs), can be depicted as a graph where proteins are nodes and their interactions are edges. The development of high-throughput experimental technologies allows for the generation of numerous data which permits increasing the sophistication of PPI models. However, despite significant progress, current PPI networks remain incomplete. Discovering missing interactions through experimental techniques can be costly, time-consuming, and challenging. Therefore, computational approaches have emerged as valuable tools for predicting missing interactions. In PPI networks, a graph is usually used to model the interactions between proteins. An edge between two proteins indicates a known interaction, while the absence of an edge means the interaction is not known or missed. However, this binary representation overlooks the reliability of known interactions when predicting new ones. To address this challenge, we propose a novel approach for link prediction in weighted protein-protein networks, where interaction weights denote confidence scores. By leveraging data from the yeast Saccharomyces cerevisiae obtained from the STRING database, we introduce a new model that combines similarity-based algorithms and aggregated confidence score weights for accurate link prediction purposes. Our model significantly improves prediction accuracy, surpassing traditional approaches in terms of Mean Absolute Error, Mean Relative Absolute Error, and Root Mean Square Error. Our proposed approach holds the potential for improved accuracy in predicting PPIs, which is crucial for better understanding the underlying biological processes.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142602864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rapid bacterial identification through volatile organic compound analysis and deep learning. 通过挥发性有机化合物分析和深度学习快速识别细菌。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-06 DOI: 10.1186/s12859-024-05967-4
Bowen Yan, Lin Zeng, Yanyi Lu, Min Li, Weiping Lu, Bangfu Zhou, Qinghua He

Background: The increasing antimicrobial resistance caused by the improper use of antibiotics poses a significant challenge to humanity. Rapid and accurate identification of microbial species in clinical settings is crucial for precise medication and reducing the development of antimicrobial resistance. This study aimed to explore a method for automatic identification of bacteria using Volatile Organic Compounds (VOCs) analysis and deep learning algorithms.

Results: AlexNet, where augmentation is applied, produces the best results. The average accuracy rate for single bacterial culture classification reached 99.24% using cross-validation, and the accuracy rates for identifying the three bacteria in randomly mixed cultures were SA:98.6%, EC:98.58% and PA:98.99%, respectively.

Conclusion: This work provides a new approach to quickly identify bacterial microorganisms. Using this method can automatically identify bacteria in GC-IMS detection results, helping clinical doctors quickly detect bacterial species, accurately prescribe medication, thereby controlling epidemics, and minimizing the negative impact of bacterial resistance on society.

背景:抗生素的不当使用导致抗菌药耐药性不断增加,给人类带来了巨大挑战。在临床环境中快速准确地识别微生物种类对于精确用药和减少抗菌药耐药性的产生至关重要。本研究旨在探索一种利用挥发性有机化合物(VOCs)分析和深度学习算法自动识别细菌的方法:结果:采用增强算法的 AlexNet 效果最好。通过交叉验证,单一细菌培养物分类的平均准确率达到 99.24%,随机混合培养物中识别三种细菌的准确率分别为 SA:98.6%、EC:98.58% 和 PA:98.99%:这项工作提供了一种快速识别细菌微生物的新方法。结论:这项研究提供了一种快速识别细菌微生物的新方法,利用这种方法可以自动识别 GC-IMS 检测结果中的细菌,帮助临床医生快速检测细菌种类,准确开具处方,从而控制流行病,将细菌耐药性对社会的负面影响降到最低。
{"title":"Rapid bacterial identification through volatile organic compound analysis and deep learning.","authors":"Bowen Yan, Lin Zeng, Yanyi Lu, Min Li, Weiping Lu, Bangfu Zhou, Qinghua He","doi":"10.1186/s12859-024-05967-4","DOIUrl":"10.1186/s12859-024-05967-4","url":null,"abstract":"<p><strong>Background: </strong>The increasing antimicrobial resistance caused by the improper use of antibiotics poses a significant challenge to humanity. Rapid and accurate identification of microbial species in clinical settings is crucial for precise medication and reducing the development of antimicrobial resistance. This study aimed to explore a method for automatic identification of bacteria using Volatile Organic Compounds (VOCs) analysis and deep learning algorithms.</p><p><strong>Results: </strong>AlexNet, where augmentation is applied, produces the best results. The average accuracy rate for single bacterial culture classification reached 99.24% using cross-validation, and the accuracy rates for identifying the three bacteria in randomly mixed cultures were SA:98.6%, EC:98.58% and PA:98.99%, respectively.</p><p><strong>Conclusion: </strong>This work provides a new approach to quickly identify bacterial microorganisms. Using this method can automatically identify bacteria in GC-IMS detection results, helping clinical doctors quickly detect bacterial species, accurately prescribe medication, thereby controlling epidemics, and minimizing the negative impact of bacterial resistance on society.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of antibody-antigen interaction based on backbone aware with invariant point attention. 基于骨干意识和不变点注意力的抗体-抗原相互作用预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-06 DOI: 10.1186/s12859-024-05961-w
Miao Gu, Weiyang Yang, Min Liu

Background: Antibodies play a crucial role in disease treatment, leveraging their ability to selectively interact with the specific antigen. However, screening antibody gene sequences for target antigens via biological experiments is extremely time-consuming and labor-intensive. Several computational methods have been developed to predict antibody-antigen interaction while suffering from the lack of characterizing the underlying structure of the antibody.

Results: Beneficial from the recent breakthroughs in deep learning for antibody structure prediction, we propose a novel neural network architecture to predict antibody-antigen interaction. We first introduce AbAgIPA: an antibody structure prediction network to obtain the antibody backbone structure, where the structural features of antibodies and antigens are encoded into representation vectors according to the amino acid physicochemical features and Invariant Point Attention (IPA) computation methods. Finally, the antibody-antigen interaction is predicted by global max pooling, feature concatenation, and a fully connected layer. We evaluated our method on antigen diversity and antigen-specific antibody-antigen interaction datasets. Additionally, our model exhibits a commendable level of interpretability, essential for understanding underlying interaction mechanisms.

Conclusions: Quantitative experimental results demonstrate that the new neural network architecture significantly outperforms the best sequence-based methods as well as the methods based on residue contact maps and graph convolution networks (GCNs). The source code is freely available on GitHub at https://github.com/gmthu66/AbAgIPA .

背景:抗体利用其与特定抗原选择性相互作用的能力,在疾病治疗中发挥着至关重要的作用。然而,通过生物实验筛选抗体基因序列以确定目标抗原极其耗时耗力。目前已开发出几种计算方法来预测抗体与抗原的相互作用,但却缺乏对抗体底层结构的表征:受益于最近在抗体结构预测的深度学习方面取得的突破,我们提出了一种预测抗体-抗原相互作用的新型网络架构。我们首先介绍了AbAgIPA:一种用于获取抗体骨架结构的抗体结构预测网络,根据氨基酸理化特征和不变点注意(IPA)计算方法,将抗体和抗原的结构特征编码成表示向量。最后,通过全局最大集合、特征串联和全连接层预测抗体与抗原的相互作用。我们在抗原多样性和抗原特异性抗体-抗原相互作用数据集上评估了我们的方法。此外,我们的模型表现出了值得称赞的可解释性,这对于理解潜在的相互作用机制至关重要:定量实验结果表明,新的神经网络架构明显优于基于序列的最佳方法以及基于残基接触图和图卷积网络(GCN)的方法。源代码可在 GitHub 上免费获取:https://github.com/gmthu66/AbAgIPA 。
{"title":"Prediction of antibody-antigen interaction based on backbone aware with invariant point attention.","authors":"Miao Gu, Weiyang Yang, Min Liu","doi":"10.1186/s12859-024-05961-w","DOIUrl":"10.1186/s12859-024-05961-w","url":null,"abstract":"<p><strong>Background: </strong>Antibodies play a crucial role in disease treatment, leveraging their ability to selectively interact with the specific antigen. However, screening antibody gene sequences for target antigens via biological experiments is extremely time-consuming and labor-intensive. Several computational methods have been developed to predict antibody-antigen interaction while suffering from the lack of characterizing the underlying structure of the antibody.</p><p><strong>Results: </strong>Beneficial from the recent breakthroughs in deep learning for antibody structure prediction, we propose a novel neural network architecture to predict antibody-antigen interaction. We first introduce AbAgIPA: an antibody structure prediction network to obtain the antibody backbone structure, where the structural features of antibodies and antigens are encoded into representation vectors according to the amino acid physicochemical features and Invariant Point Attention (IPA) computation methods. Finally, the antibody-antigen interaction is predicted by global max pooling, feature concatenation, and a fully connected layer. We evaluated our method on antigen diversity and antigen-specific antibody-antigen interaction datasets. Additionally, our model exhibits a commendable level of interpretability, essential for understanding underlying interaction mechanisms.</p><p><strong>Conclusions: </strong>Quantitative experimental results demonstrate that the new neural network architecture significantly outperforms the best sequence-based methods as well as the methods based on residue contact maps and graph convolution networks (GCNs). The source code is freely available on GitHub at https://github.com/gmthu66/AbAgIPA .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11542381/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
REDalign: accurate RNA structural alignment using residual encoder-decoder network. REDalign:利用残差编码器-解码器网络进行精确的 RNA 结构配准。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-05 DOI: 10.1186/s12859-024-05956-7
Chun-Chi Chen, Yi-Ming Chan, Hyundoo Jeong

Background: RNA secondary structural alignment serves as a foundational procedure in identifying conserved structural motifs among RNA sequences, crucially advancing our understanding of novel RNAs via comparative genomic analysis. While various computational strategies for RNA structural alignment exist, they often come with high computational complexity. Specifically, when addressing a set of RNAs with unknown structures, the task of simultaneously predicting their consensus secondary structure and determining the optimal sequence alignment requires an overwhelming computational effort of O ( L 6 ) for each RNA pair. Such an extremely high computational complexity makes these methods impractical for large-scale analysis despite their accurate alignment capabilities.

Results: In this paper, we introduce REDalign, an innovative approach based on deep learning for RNA secondary structural alignment. By utilizing a residual encoder-decoder network, REDalign can efficiently capture consensus structures and optimize structural alignments. In this learning model, the encoder network leverages a hierarchical pyramid to assimilate high-level structural features. Concurrently, the decoder network, enhanced with residual skip connections, integrates multi-level encoded features to learn detailed feature hierarchies with fewer parameter sets. REDalign significantly reduces computational complexity compared to Sankoff-style algorithms and effectively handles non-nested structures, including pseudoknots, which are challenging for traditional alignment methods. Extensive evaluations demonstrate that REDalign provides superior accuracy and substantial computational efficiency.

Conclusion: REDalign presents a significant advancement in RNA secondary structural alignment, balancing high alignment accuracy with lower computational demands. Its ability to handle complex RNA structures, including pseudoknots, makes it an effective tool for large-scale RNA analysis, with potential implications for accelerating discoveries in RNA research and comparative genomics.

背景:RNA 二级结构比对是识别 RNA 序列中保守结构模式的基础程序,可通过比较基因组分析加深我们对新型 RNA 的理解。虽然存在各种用于 RNA 结构比对的计算策略,但它们往往具有很高的计算复杂性。具体来说,在处理一组结构未知的 RNA 时,同时预测它们的共识二级结构和确定最佳序列比对的任务需要对每对 RNA 进行 O ( L 6 ) 的计算。这样极高的计算复杂度使得这些方法尽管具有精确的比对能力,但在大规模分析中并不实用:在本文中,我们介绍了 REDalign,一种基于深度学习的 RNA 二级结构配准创新方法。通过利用残差编码器-解码器网络,REDalign 可以有效捕捉共识结构并优化结构配准。在这种学习模型中,编码器网络利用分层金字塔吸收高级结构特征。同时,解码器网络通过残余跳转连接进行增强,整合多层次编码特征,以更少的参数集学习详细的特征层次。与 Sankoff 算法相比,REDalign 大大降低了计算复杂度,并能有效处理非嵌套结构,包括对传统配准方法具有挑战性的伪节点。广泛的评估结果表明,REDalign 具有卓越的准确性和可观的计算效率:REDalign 在 RNA 二级结构配准方面取得了重大进展,在高配准精度和低计算需求之间实现了平衡。REDalign 能够处理复杂的 RNA 结构(包括假结点),是进行大规模 RNA 分析的有效工具,对加速 RNA 研究和比较基因组学的发现具有潜在意义。
{"title":"REDalign: accurate RNA structural alignment using residual encoder-decoder network.","authors":"Chun-Chi Chen, Yi-Ming Chan, Hyundoo Jeong","doi":"10.1186/s12859-024-05956-7","DOIUrl":"10.1186/s12859-024-05956-7","url":null,"abstract":"<p><strong>Background: </strong>RNA secondary structural alignment serves as a foundational procedure in identifying conserved structural motifs among RNA sequences, crucially advancing our understanding of novel RNAs via comparative genomic analysis. While various computational strategies for RNA structural alignment exist, they often come with high computational complexity. Specifically, when addressing a set of RNAs with unknown structures, the task of simultaneously predicting their consensus secondary structure and determining the optimal sequence alignment requires an overwhelming computational effort of <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>L</mi> <mn>6</mn></msup> <mo>)</mo></mrow> </math> for each RNA pair. Such an extremely high computational complexity makes these methods impractical for large-scale analysis despite their accurate alignment capabilities.</p><p><strong>Results: </strong>In this paper, we introduce REDalign, an innovative approach based on deep learning for RNA secondary structural alignment. By utilizing a residual encoder-decoder network, REDalign can efficiently capture consensus structures and optimize structural alignments. In this learning model, the encoder network leverages a hierarchical pyramid to assimilate high-level structural features. Concurrently, the decoder network, enhanced with residual skip connections, integrates multi-level encoded features to learn detailed feature hierarchies with fewer parameter sets. REDalign significantly reduces computational complexity compared to Sankoff-style algorithms and effectively handles non-nested structures, including pseudoknots, which are challenging for traditional alignment methods. Extensive evaluations demonstrate that REDalign provides superior accuracy and substantial computational efficiency.</p><p><strong>Conclusion: </strong>REDalign presents a significant advancement in RNA secondary structural alignment, balancing high alignment accuracy with lower computational demands. Its ability to handle complex RNA structures, including pseudoknots, makes it an effective tool for large-scale RNA analysis, with potential implications for accelerating discoveries in RNA research and comparative genomics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PangeBlocks: customized construction of pangenome graphs via maximal blocks. PangeBlocks:通过最大块定制构建泛基因组图。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05958-5
Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti

Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.

Results: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.

Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.

背景:构建庞基因组图是庞基因组学的一项基本任务。一个自然的理论问题是如何将构建最优庞基因组图的计算问题形式化,明确基本优化标准和可行解决方案集。目前的方法是利用一些启发式方法构建庞基因组图,而不假定一些明确的优化标准。因此,具体的优化标准如何影响图拓扑和下游分析(如读取映射和变异调用)尚不清楚:本文利用多重序列比对(MSA)中最大区块的概念,将泛基因组图构建问题重构为区块上的精确覆盖问题,称为最小加权区块覆盖(MWBC)。然后,我们为 MWBC 问题提出了一种整数线性规划(ILP)公式,使我们能够研究构建图的最自然目标函数。我们提供了求解 MWBC 的 ILP 方法的实现,并在 SARS-CoV-2 完整基因组上对其进行了评估,显示了不同的目标函数如何导致具有不同属性的 pangenome 图,暗示了特定的下游任务可以驱动图构建阶段:我们的研究表明,基于目标函数的庞基因组图的定制化构建会对生成的图产生直接影响。特别是,我们基于寻找覆盖 MSA 的最优块子集对 MWBC 问题进行了形式化,为用户可以指导构建 MSA 图表示的新型实用方法铺平了道路。
{"title":"PangeBlocks: customized construction of pangenome graphs via maximal blocks.","authors":"Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti","doi":"10.1186/s12859-024-05958-5","DOIUrl":"10.1186/s12859-024-05958-5","url":null,"abstract":"<p><strong>Background: </strong>The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.</p><p><strong>Results: </strong>In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.</p><p><strong>Conclusion: </strong>We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPCR-BSD: a database of binding sites of human G-protein coupled receptors under diverse states. GPCR-BSD:人类 G 蛋白偶联受体在不同状态下的结合位点数据库。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05962-9
Fan Liu, Han Zhou, Xiaonong Li, Liangliang Zhou, Chungong Yu, Haicang Zhang, Dongbo Bu, Xinmiao Liang

G-protein coupled receptors (GPCRs), the largest family of membrane proteins in human body, involve a great variety of biological processes and thus have become highly valuable drug targets. By binding with ligands (e.g., drugs), GPCRs switch between active and inactive conformational states, thereby performing functions such as signal transmission. The changes in binding pockets under different states are important for a better understanding of drug-target interactions. Therefore it is critical, as well as a practical need, to obtain binding sites in human GPCR structures. We report a database (called GPCR-BSD) that collects 127,990 predicted binding sites of 803 GPCRs under active and inactive states (thus 1,606 structures in total). The binding sites were identified from the predicted GPCR structures by executing three geometric-based pocket prediction methods, fpocket, CavityPlus and GHECOM. The server provides query, visualization, and comparison of the predicted binding sites for both GPCR predicted and experimentally determined structures recorded in PDB. We evaluated the identified pockets of 132 experimentally determined human GPCR structures in terms of pocket residue coverage, pocket center distance and redocking accuracy. The evaluation showed that fpocket and CavityPlus methods performed better and successfully predicted orthosteric binding sites in over 60% of the 132 experimentally determined structures. The GPCR Binding Site database is freely accessible at https://gpcrbs.bigdata.jcmsc.cn . This study not only provides a systematic evaluation of the commonly-used fpocket and CavityPlus methods for the first time but also meets the need for binding site information in GPCR studies.

G 蛋白偶联受体(GPCR)是人体内最大的膜蛋白家族,涉及多种生物过程,因此成为极具价值的药物靶标。通过与配体(如药物)结合,GPCR 在活性和非活性构象状态之间切换,从而实现信号传输等功能。不同状态下结合口袋的变化对于更好地理解药物与靶点的相互作用非常重要。因此,获取人类 GPCR 结构中的结合位点至关重要,也是实际需要。我们报告的数据库(称为 GPCR-BSD)收集了 803 个 GPCR 在活跃和非活跃状态下的 127,990 个预测结合位点(因此共有 1,606 个结构)。这些结合位点是通过三种基于几何的口袋预测方法(fpocket、CavityPlus 和 GHECOM)从预测的 GPCR 结构中确定的。该服务器可对 PDB 中记录的 GPCR 预测结构和实验测定结构的预测结合位点进行查询、可视化和比较。我们从口袋残基覆盖率、口袋中心距离和再锁定准确性等方面评估了 132 个实验测定的人类 GPCR 结构的已识别口袋。评估结果表明,fpocket 和 CavityPlus 方法表现更好,在 132 个实验测定的结构中成功预测了 60% 以上的正交结合位点。GPCR 结合位点数据库可在 https://gpcrbs.bigdata.jcmsc.cn 免费访问。这项研究不仅首次对常用的 fpocket 和 CavityPlus 方法进行了系统评估,而且满足了 GPCR 研究对结合位点信息的需求。
{"title":"GPCR-BSD: a database of binding sites of human G-protein coupled receptors under diverse states.","authors":"Fan Liu, Han Zhou, Xiaonong Li, Liangliang Zhou, Chungong Yu, Haicang Zhang, Dongbo Bu, Xinmiao Liang","doi":"10.1186/s12859-024-05962-9","DOIUrl":"10.1186/s12859-024-05962-9","url":null,"abstract":"<p><p>G-protein coupled receptors (GPCRs), the largest family of membrane proteins in human body, involve a great variety of biological processes and thus have become highly valuable drug targets. By binding with ligands (e.g., drugs), GPCRs switch between active and inactive conformational states, thereby performing functions such as signal transmission. The changes in binding pockets under different states are important for a better understanding of drug-target interactions. Therefore it is critical, as well as a practical need, to obtain binding sites in human GPCR structures. We report a database (called GPCR-BSD) that collects 127,990 predicted binding sites of 803 GPCRs under active and inactive states (thus 1,606 structures in total). The binding sites were identified from the predicted GPCR structures by executing three geometric-based pocket prediction methods, fpocket, CavityPlus and GHECOM. The server provides query, visualization, and comparison of the predicted binding sites for both GPCR predicted and experimentally determined structures recorded in PDB. We evaluated the identified pockets of 132 experimentally determined human GPCR structures in terms of pocket residue coverage, pocket center distance and redocking accuracy. The evaluation showed that fpocket and CavityPlus methods performed better and successfully predicted orthosteric binding sites in over 60% of the 132 experimentally determined structures. The GPCR Binding Site database is freely accessible at https://gpcrbs.bigdata.jcmsc.cn . This study not only provides a systematic evaluation of the commonly-used fpocket and CavityPlus methods for the first time but also meets the need for binding site information in GPCR studies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533411/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIPPIS: protein-protein interaction site prediction network with multi-information fusion. MIPPIS:多信息融合的蛋白质-蛋白质相互作用位点预测网络。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05964-7
Shuang Wang, Kaiyu Dong, Dingming Liang, Yunjing Zhang, Xue Li, Tao Song

Background: The prediction of protein-protein interaction sites plays a crucial role in biochemical processes. Investigating the interaction between viruses and receptor proteins through biological techniques aids in understanding disease mechanisms and guides the development of corresponding drugs. While various methods have been proposed in the past, they often suffer from drawbacks such as long processing times, high costs, and low accuracy.

Results: Addressing these challenges, we propose a novel protein-protein interaction site prediction network based on multi-information fusion. In our approach, the initial amino acid features are depicted by the position-specific scoring matrix, hidden Markov model, dictionary of protein secondary structure, and one-hot encoding. Simultaneously, we adopt a multi-channel approach to extract deep-level amino acids features from different perspectives. The graph convolutional network channel effectively extracts spatial structural information. The bidirectional long short-term memory channel treats the amino acid sequence as natural language, capturing the protein's primary structure information. The ProtT5 protein large language model channel outputs a more comprehensive amino acid embedding representation, providing a robust complement to the two aforementioned channels. Finally, the obtained amino acid features are fed into the prediction layer for the final prediction.

Conclusion: Compared with six protein structure-based methods and six protein sequence-based methods, our model achieves optimal performance across evaluation metrics, including accuracy, precision, F1, Matthews correlation coefficient, and area under the precision recall curve, which demonstrates the superiority of our model.

背景:预测蛋白质与蛋白质之间的相互作用位点在生化过程中起着至关重要的作用。通过生物技术研究病毒与受体蛋白之间的相互作用有助于了解疾病机理并指导相应药物的开发。过去曾提出过多种方法,但往往存在处理时间长、成本高、准确性低等缺点:针对这些挑战,我们提出了一种基于多信息融合的新型蛋白质-蛋白质相互作用位点预测网络。在我们的方法中,初始氨基酸特征由特定位置评分矩阵、隐马尔可夫模型、蛋白质二级结构字典和单次编码来描述。同时,我们采用多通道方法从不同角度提取深层次氨基酸特征。图卷积网络通道能有效提取空间结构信息。双向长短期记忆通道将氨基酸序列视为自然语言,捕捉蛋白质的主要结构信息。ProtT5 蛋白质大语言模型通道输出更全面的氨基酸嵌入表示,为上述两个通道提供了稳健的补充。最后,将获得的氨基酸特征输入预测层进行最终预测:结论:与六种基于蛋白质结构的方法和六种基于蛋白质序列的方法相比,我们的模型在准确率、精确度、F1、马太相关系数和精确召回曲线下面积等评价指标上都达到了最佳性能,这证明了我们模型的优越性。
{"title":"MIPPIS: protein-protein interaction site prediction network with multi-information fusion.","authors":"Shuang Wang, Kaiyu Dong, Dingming Liang, Yunjing Zhang, Xue Li, Tao Song","doi":"10.1186/s12859-024-05964-7","DOIUrl":"10.1186/s12859-024-05964-7","url":null,"abstract":"<p><strong>Background: </strong>The prediction of protein-protein interaction sites plays a crucial role in biochemical processes. Investigating the interaction between viruses and receptor proteins through biological techniques aids in understanding disease mechanisms and guides the development of corresponding drugs. While various methods have been proposed in the past, they often suffer from drawbacks such as long processing times, high costs, and low accuracy.</p><p><strong>Results: </strong>Addressing these challenges, we propose a novel protein-protein interaction site prediction network based on multi-information fusion. In our approach, the initial amino acid features are depicted by the position-specific scoring matrix, hidden Markov model, dictionary of protein secondary structure, and one-hot encoding. Simultaneously, we adopt a multi-channel approach to extract deep-level amino acids features from different perspectives. The graph convolutional network channel effectively extracts spatial structural information. The bidirectional long short-term memory channel treats the amino acid sequence as natural language, capturing the protein's primary structure information. The ProtT5 protein large language model channel outputs a more comprehensive amino acid embedding representation, providing a robust complement to the two aforementioned channels. Finally, the obtained amino acid features are fed into the prediction layer for the final prediction.</p><p><strong>Conclusion: </strong>Compared with six protein structure-based methods and six protein sequence-based methods, our model achieves optimal performance across evaluation metrics, including accuracy, precision, F<sub>1</sub>, Matthews correlation coefficient, and area under the precision recall curve, which demonstrates the superiority of our model.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536593/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search. CUDASW++4.0:基于 GPU 的超快速史密斯-沃特曼蛋白质序列数据库搜索。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-02 DOI: 10.1186/s12859-024-05965-6
Bertil Schmidt, Felix Kallenborn, Alejandro Chacon, Christian Hundt

Background: The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations.

Results: CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt.

Conclusion: CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4 .

背景:史密斯-沃特曼算法(Smith-Waterman algorithm)对局部配对的灵敏度最高,因此成为蛋白质序列数据库搜索的热门选择。然而,它的二次时间复杂性使其成为计算密集型算法。遗憾的是,目前最先进的软件工具无法利用现代 GPU 的大规模并行处理能力实现接近峰值的性能。这就促使我们需要更高效的实现方法:CUDASW++4.0是一款快速软件工具,用于在支持CUDA的GPU上使用史密斯-沃特曼算法扫描蛋白质序列数据库。我们的方法通过最大限度地减少内存访问和指令,实现了基于动态编程的高效比对计算。我们提供了高效的矩阵平铺和序列数据库分区方案,并利用了新一代浮点运算和新型 DPX 指令。这使得现代 GPU(Ampere、Ada、Hopper)的性能接近峰值,在 A100、L40S 和 H100 上的吞吐率分别高达 1.94 TCUPS、5.01 TCUPS 和 5.71 TCUPS。在 Swiss-Prot、UniRef50 和 TrEMBL 数据库上进行的评估表明,CUDASW++4.0 的性能比以前基于 GPU 的方法(CUDASW++3.0、ADEPT、SW#DB)提高了一个数量级。此外,我们的算法比基于CPU的高性能工具(BLASTP、SWIPE、SWIMM2.0)显著提速,可以线性扩展利用多GPU节点,能效高达15.7 GCUPS/Watt,令人印象深刻:CUDASW++4.0通过在现代GPU上提供接近峰值的性能,改变了GPU在利用史密斯-沃特曼配准进行蛋白质序列数据库搜索方面的地位。它可在 https://github.com/asbschmidt/CUDASW4 免费获取。
{"title":"CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search.","authors":"Bertil Schmidt, Felix Kallenborn, Alejandro Chacon, Christian Hundt","doi":"10.1186/s12859-024-05965-6","DOIUrl":"10.1186/s12859-024-05965-6","url":null,"abstract":"<p><strong>Background: </strong>The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations.</p><p><strong>Results: </strong>CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt.</p><p><strong>Conclusion: </strong>CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4 .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11531700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving crop production using an agro-deep learning framework in precision agriculture. 在精准农业中利用农业深度学习框架提高作物产量。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-11-01 DOI: 10.1186/s12859-024-05970-9
J Logeshwaran, Durgesh Srivastava, K Sree Kumar, M Jenolin Rex, Amal Al-Rasheed, Masresha Getahun, Ben Othman Soufiene

Background: The study focuses on enhancing the effectiveness of precision agriculture through the application of deep learning technologies. Precision agriculture, which aims to optimize farming practices by monitoring and adjusting various factors influencing crop growth, can greatly benefit from artificial intelligence (AI) methods like deep learning. The Agro Deep Learning Framework (ADLF) was developed to tackle critical issues in crop cultivation by processing vast datasets. These datasets include variables such as soil moisture, temperature, and humidity, all of which are essential to understanding and predicting crop behavior. By leveraging deep learning models, the framework seeks to improve decision-making processes, detect potential crop problems early, and boost agricultural productivity.

Results: The study found that the Agro Deep Learning Framework (ADLF) achieved an accuracy of 85.41%, precision of 84.87%, recall of 84.24%, and an F1-Score of 88.91%, indicating strong predictive capabilities for improving crop management. The false negative rate was 91.17% and the false positive rate was 89.82%, highlighting the framework's ability to correctly detect issues while minimizing errors. These results suggest that ADLF can significantly enhance decision-making in precision agriculture, leading to improved crop yield and reduced agricultural losses.

Conclusions: The ADLF can significantly improve precision agriculture by leveraging deep learning to process complex datasets and provide valuable insights into crop management. The framework allows farmers to detect issues early, optimize resource use, and improve yields. The study demonstrates that AI-driven agriculture has the potential to revolutionize farming, making it more efficient and sustainable. Future research could focus on further refining the model and exploring its applicability across different types of crops and farming environments.

背景:本研究的重点是通过应用深度学习技术提高精准农业的效率。精准农业旨在通过监测和调整影响作物生长的各种因素来优化耕作方法,它可以从深度学习等人工智能(AI)方法中获益匪浅。开发农业深度学习框架(ADLF)的目的是通过处理庞大的数据集来解决作物栽培中的关键问题。这些数据集包括土壤湿度、温度和湿度等变量,所有这些变量对于理解和预测作物行为都至关重要。通过利用深度学习模型,该框架旨在改进决策过程,及早发现潜在的作物问题,并提高农业生产率:研究发现,农业深度学习框架(ADLF)的准确率为 85.41%,精确率为 84.87%,召回率为 84.24%,F1-分数为 88.91%,这表明该框架在改善作物管理方面具有很强的预测能力。假阴性率为 91.17%,假阳性率为 89.82%,突显了该框架在正确检测问题的同时将误差降至最低的能力。这些结果表明,ADLF 可以显著提高精准农业的决策水平,从而提高作物产量,减少农业损失:ADLF 可以利用深度学习处理复杂的数据集,为作物管理提供有价值的见解,从而极大地改善精准农业。该框架能让农民及早发现问题,优化资源利用,提高产量。这项研究表明,人工智能驱动的农业有可能彻底改变农业,使其更高效、更可持续。未来的研究可侧重于进一步完善该模型,并探索其在不同类型作物和耕作环境中的适用性。
{"title":"Improving crop production using an agro-deep learning framework in precision agriculture.","authors":"J Logeshwaran, Durgesh Srivastava, K Sree Kumar, M Jenolin Rex, Amal Al-Rasheed, Masresha Getahun, Ben Othman Soufiene","doi":"10.1186/s12859-024-05970-9","DOIUrl":"10.1186/s12859-024-05970-9","url":null,"abstract":"<p><strong>Background: </strong>The study focuses on enhancing the effectiveness of precision agriculture through the application of deep learning technologies. Precision agriculture, which aims to optimize farming practices by monitoring and adjusting various factors influencing crop growth, can greatly benefit from artificial intelligence (AI) methods like deep learning. The Agro Deep Learning Framework (ADLF) was developed to tackle critical issues in crop cultivation by processing vast datasets. These datasets include variables such as soil moisture, temperature, and humidity, all of which are essential to understanding and predicting crop behavior. By leveraging deep learning models, the framework seeks to improve decision-making processes, detect potential crop problems early, and boost agricultural productivity.</p><p><strong>Results: </strong>The study found that the Agro Deep Learning Framework (ADLF) achieved an accuracy of 85.41%, precision of 84.87%, recall of 84.24%, and an F1-Score of 88.91%, indicating strong predictive capabilities for improving crop management. The false negative rate was 91.17% and the false positive rate was 89.82%, highlighting the framework's ability to correctly detect issues while minimizing errors. These results suggest that ADLF can significantly enhance decision-making in precision agriculture, leading to improved crop yield and reduced agricultural losses.</p><p><strong>Conclusions: </strong>The ADLF can significantly improve precision agriculture by leveraging deep learning to process complex datasets and provide valuable insights into crop management. The framework allows farmers to detect issues early, optimize resource use, and improve yields. The study demonstrates that AI-driven agriculture has the potential to revolutionize farming, making it more efficient and sustainable. Future research could focus on further refining the model and exploring its applicability across different types of crops and farming environments.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias. BASE:提供化合物与蛋白质结合亲和力预测数据集的网络服务,可减少相似性偏差。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-30 DOI: 10.1186/s12859-024-05968-3
Hyojin Son, Sechan Lee, Jaeuk Kim, Haangik Park, Myeong-Ha Hwang, Gwan-Su Yi

Background: Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets.

Results: By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features.

Conclusions: We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .

背景:基于深度学习的药物-靶点亲和力(DTA)预测方法表现出令人印象深刻的性能,尽管相对于可用数据而言,训练参数的数量较多。以往的研究强调了数据集偏差的存在,认为仅根据蛋白质或配体结构训练的模型可能与根据复杂结构训练的模型表现类似。不过,这些研究并没有提出解决方案,而只是侧重于分析基于复杂结构的模型。即使排除了配体,在复合结构上训练的纯蛋白质模型仍然会在结合位点纳入一些配体信息。因此,由于潜在的数据集偏差,仅使用化合物或蛋白质特征能否准确预测结合亲和力尚不清楚。在本研究中,我们将分析范围扩大到了综合数据库,并使用多层感知器模型通过基于化合物和蛋白质特征的方法研究了数据集偏差。我们评估了这种偏差对当前预测模型的影响,并提出了结合亲和力相似性探索者(BASE)网络服务,该服务可提供减少偏差的数据集:结果:通过使用多层感知器模型分析八个结合亲和力数据库,我们证实了一种偏差,即仅使用化合物特征就能准确预测化合物与蛋白质的结合亲和力。产生这种偏差的原因是,大多数化合物的结合亲和力都是一致的,这是因为它们的靶蛋白在序列或功能上具有高度相似性。我们基于化合物指纹图谱的均匀簇逼近和投影分析进一步显示,低变异和高变异化合物在结构上没有明显差异。这表明,导致结合亲和力一致的主要因素是蛋白质的相似性,而不是化合物的结构。针对这一偏差,我们创建了训练集和测试集之间蛋白质相似性逐渐降低的数据集,观察到了模型性能的显著变化。我们开发了 BASE 网络服务,允许研究人员下载和使用这些数据集。特征重要性分析表明,以前的模型严重依赖蛋白质特征。然而,使用减少偏差的数据集提高了化合物和相互作用特征的重要性,从而能够更均衡地提取关键特征:我们提出了 BASE 网络服务,提供现有模型的亲和力预测结果和偏倚还原数据集。这些资源有助于开发通用、稳健的预测模型,提高药物发现过程中 DTA 预测的准确性和可靠性。BASE 可通过 https://synbi2024.kaist.ac.kr/base 免费在线获取。
{"title":"BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias.","authors":"Hyojin Son, Sechan Lee, Jaeuk Kim, Haangik Park, Myeong-Ha Hwang, Gwan-Su Yi","doi":"10.1186/s12859-024-05968-3","DOIUrl":"10.1186/s12859-024-05968-3","url":null,"abstract":"<p><strong>Background: </strong>Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets.</p><p><strong>Results: </strong>By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features.</p><p><strong>Conclusions: </strong>We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142543453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1