首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Drug repositioning based on residual attention network and free multiscale adversarial training. 基于残差注意网络和自由多尺度对抗训练的药物重新定位。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-08 DOI: 10.1186/s12859-024-05893-5
Guanghui Li, Shuwen Li, Cheng Liang, Qiu Xiao, Jiawei Luo

Background: Conducting traditional wet experiments to guide drug development is an expensive, time-consuming and risky process. Analyzing drug function and repositioning plays a key role in identifying new therapeutic potential of approved drugs and discovering therapeutic approaches for untreated diseases. Exploring drug-disease associations has far-reaching implications for identifying disease pathogenesis and treatment. However, reliable detection of drug-disease relationships via traditional methods is costly and slow. Therefore, investigations into computational methods for predicting drug-disease associations are currently needed.

Results: This paper presents a novel drug-disease association prediction method, RAFGAE. First, RAFGAE integrates known associations between diseases and drugs into a bipartite network. Second, RAFGAE designs the Re_GAT framework, which includes multilayer graph attention networks (GATs) and two residual networks. The multilayer GATs are utilized for learning the node embeddings, which is achieved by aggregating information from multihop neighbors. The two residual networks are used to alleviate the deep network oversmoothing problem, and an attention mechanism is introduced to combine the node embeddings from different attention layers. Third, two graph autoencoders (GAEs) with collaborative training are constructed to simulate label propagation to predict potential associations. On this basis, free multiscale adversarial training (FMAT) is introduced. FMAT enhances node feature quality through small gradient adversarial perturbation iterations, improving the prediction performance. Finally, tenfold cross-validations on two benchmark datasets show that RAFGAE outperforms current methods. In addition, case studies have confirmed that RAFGAE can detect novel drug-disease associations.

Conclusions: The comprehensive experimental results validate the utility and accuracy of RAFGAE. We believe that this method may serve as an excellent predictor for identifying unobserved disease-drug associations.

背景:进行传统的湿法实验来指导药物开发是一个昂贵、耗时且有风险的过程。分析药物功能和重新定位在确定已批准药物的新治疗潜力和发现未治疗疾病的治疗方法方面发挥着关键作用。探索药物与疾病的关联对确定疾病发病机制和治疗具有深远影响。然而,通过传统方法可靠地检测药物与疾病的关系既昂贵又缓慢。因此,目前需要研究预测药物-疾病关联的计算方法:本文提出了一种新型药物-疾病关联预测方法--RAFGAE。首先,RAFGAE 将疾病与药物之间的已知关联整合到一个双方网络中。其次,RAFGAE 设计了 Re_GAT 框架,其中包括多层图注意网络(GAT)和两个残差网络。多层图注意力网络用于学习节点嵌入,而节点嵌入是通过聚合多跳邻居的信息实现的。两个残差网络用于缓解深度网络的过平滑问题,并引入了一种关注机制,将来自不同关注层的节点嵌入结合起来。第三,构建了两个协同训练的图自动编码器(GAE),模拟标签传播来预测潜在关联。在此基础上,引入了自由多尺度对抗训练(FMAT)。FMAT 通过小梯度对抗扰动迭代来增强节点特征质量,从而提高预测性能。最后,在两个基准数据集上进行的十倍交叉验证表明,RAFGAE 的性能优于现有方法。此外,案例研究也证实 RAFGAE 可以检测出新型药物-疾病关联:全面的实验结果验证了 RAFGAE 的实用性和准确性。我们相信,该方法可作为一种出色的预测方法,用于识别未观察到的疾病-药物关联。
{"title":"Drug repositioning based on residual attention network and free multiscale adversarial training.","authors":"Guanghui Li, Shuwen Li, Cheng Liang, Qiu Xiao, Jiawei Luo","doi":"10.1186/s12859-024-05893-5","DOIUrl":"10.1186/s12859-024-05893-5","url":null,"abstract":"<p><strong>Background: </strong>Conducting traditional wet experiments to guide drug development is an expensive, time-consuming and risky process. Analyzing drug function and repositioning plays a key role in identifying new therapeutic potential of approved drugs and discovering therapeutic approaches for untreated diseases. Exploring drug-disease associations has far-reaching implications for identifying disease pathogenesis and treatment. However, reliable detection of drug-disease relationships via traditional methods is costly and slow. Therefore, investigations into computational methods for predicting drug-disease associations are currently needed.</p><p><strong>Results: </strong>This paper presents a novel drug-disease association prediction method, RAFGAE. First, RAFGAE integrates known associations between diseases and drugs into a bipartite network. Second, RAFGAE designs the Re_GAT framework, which includes multilayer graph attention networks (GATs) and two residual networks. The multilayer GATs are utilized for learning the node embeddings, which is achieved by aggregating information from multihop neighbors. The two residual networks are used to alleviate the deep network oversmoothing problem, and an attention mechanism is introduced to combine the node embeddings from different attention layers. Third, two graph autoencoders (GAEs) with collaborative training are constructed to simulate label propagation to predict potential associations. On this basis, free multiscale adversarial training (FMAT) is introduced. FMAT enhances node feature quality through small gradient adversarial perturbation iterations, improving the prediction performance. Finally, tenfold cross-validations on two benchmark datasets show that RAFGAE outperforms current methods. In addition, case studies have confirmed that RAFGAE can detect novel drug-disease associations.</p><p><strong>Conclusions: </strong>The comprehensive experimental results validate the utility and accuracy of RAFGAE. We believe that this method may serve as an excellent predictor for identifying unobserved disease-drug associations.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11308596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141905823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Maptcha: an efficient parallel workflow for hybrid genome scaffolding. Maptcha:用于混合基因组支架构建的高效并行工作流程。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-08 DOI: 10.1186/s12859-024-05878-4
Oieswarya Bhowmik, Tazin Rahman, Ananth Kalyanaraman

Background: Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome.

Results: In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long scaffolds of a target genome, from two sets of input sequences-an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a contig,contig graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic "wiring" heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds.

Conclusions: Our experiments with Maptcha on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that Maptcha is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by Maptcha are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. Maptcha runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings ( 1 × - 10 × ).

背景:基因组组装涉及目标基因组的重建,依赖于脚手架方法来组织和连接部分组装的片段。长读数测序技术朝着更精确的长读数方向快速发展,加上短读数技术的持续使用,对混合组装工作流程产生了独特的需求。由于规模、测序技术的多样性(如短读取与长读取、等位片段或部分组装)以及目标基因组中的重复区域,在混合工作流中构建精确的基因组支架非常复杂:在本文中,我们提出了一种新的混合基因组支架并行工作流程,它可以将预构建的部分组装与新测序的长读数结合起来,从而改进组装。更具体地说,这个名为 Maptcha 的工作流程旨在从两组输入序列--已构建的等位基因部分装配和一组新测序的长读数--生成目标基因组的长支架。我们的脚手架方法在内部使用免比对映射步骤,以长读数作为链接信息,构建一个⟨ contig,contig ⟩图。随后,利用该图生成支架。我们提出并评估了一种图论 "布线 "启发式来执行这一脚手架步骤。为了在并行环境中实现高效的工作量管理,我们采用了一种分批技术,将脚手架任务分批进行,这样就能有效地并行化最后耗资较大的基于配准的组装步骤。这一步骤还允许使用任何独立的装配器生成最终的脚手架:我们使用 Maptcha 对各种输入基因组进行了实验,并与两种最先进的混合支架器进行了比较,结果表明 Maptcha 能够以更快的速度生成更长、更精确的支架。几乎在所有情况下,Maptcha 生成的脚手架都比最先进工具生成的脚手架至少长一个数量级(在某些情况下长两个数量级)。Maptcha 的运行速度也明显更快,在大多数输入情况下,其解决问题的时间从几小时缩短到几分钟。我们还通过改变长读数的测序覆盖深度进行了覆盖率实验,结果表明 Maptcha 有潜力在低覆盖率设置(1 × - 10 ×)下生成更长的脚手架。
{"title":"Maptcha: an efficient parallel workflow for hybrid genome scaffolding.","authors":"Oieswarya Bhowmik, Tazin Rahman, Ananth Kalyanaraman","doi":"10.1186/s12859-024-05878-4","DOIUrl":"10.1186/s12859-024-05878-4","url":null,"abstract":"<p><strong>Background: </strong>Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome.</p><p><strong>Results: </strong>In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long scaffolds of a target genome, from two sets of input sequences-an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a <math><mo>⟨</mo></math> contig,contig <math><mo>⟩</mo></math> graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic \"wiring\" heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds.</p><p><strong>Conclusions: </strong>Our experiments with Maptcha on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that Maptcha is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by Maptcha are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. Maptcha runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings ( <math><mrow><mn>1</mn> <mo>×</mo></mrow> </math> - <math><mrow><mn>10</mn> <mo>×</mo></mrow> </math> ).</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11313021/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141905824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RankCompV3: a differential expression analysis algorithm based on relative expression orderings and applications in single-cell RNA transcriptomics. RankCompV3:基于相对表达排序的差异表达分析算法及在单细胞 RNA 转录组学中的应用。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-07 DOI: 10.1186/s12859-024-05889-1
Jing Yan, Qiuhong Zeng, Xianlong Wang

Background: Effective identification of differentially expressed genes (DEGs) has been challenging for single-cell RNA sequencing (scRNA-seq) profiles. Many existing algorithms have high false positive rates (FPRs) and often fail to identify weak biological signals.

Results: We present a novel method for identifying DEGs in scRNA-seq data called RankCompV3. It is based on the comparison of relative expression orderings (REOs) of gene pairs which are determined by comparing the expression levels of a pair of genes in a set of single-cell profiles. The numbers of genes with consistently higher or lower expression levels than the gene of interest are counted in two groups in comparison, respectively, and the result is tabulated in a 3 × 3 contingency table which is tested by McCullagh's method to determine if the gene is dysregulated. In both simulated and real scRNA-seq data, RankCompV3 tightly controlled the FPR and demonstrated high accuracy, outperforming 11 other common single-cell DEG detection algorithms. Analysis with either regular single-cell or synthetic pseudo-bulk profiles produced highly concordant DEGs with the ground-truth. In addition, RankCompV3 demonstrates higher sensitivity to weak biological signals than other methods. The algorithm was implemented using Julia and can be called in R. The source code is available at https://github.com/pathint/RankCompV3.jl .

Conclusions: The REOs-based algorithm is a valuable tool for analyzing single-cell RNA profiles and identifying DEGs with high accuracy and sensitivity.

背景:对于单细胞 RNA 测序(scRNA-seq)图谱而言,有效识别差异表达基因(DEGs)是一项挑战。许多现有算法的假阳性率(FPR)很高,而且往往无法识别微弱的生物信号:我们提出了一种在 scRNA-seq 数据中识别 DEGs 的新方法,称为 RankCompV3。该方法基于基因对相对表达排序(REO)的比较,REO 是通过比较一组单细胞图谱中一对基因的表达水平而确定的。将表达水平持续高于或低于相关基因的基因数量分别计入两组比较中,并将结果以 3 × 3 或然率表的形式列出,通过麦卡拉方法进行检验,以确定基因是否失调。在模拟和真实的 scRNA-seq 数据中,RankCompV3 都严格控制了 FPR,表现出很高的准确性,优于其他 11 种常见的单细胞 DEG 检测算法。利用常规单细胞或合成伪大容量图谱进行分析,得出的 DEG 与地面实况高度一致。此外,与其他方法相比,RankCompV3 对微弱生物信号的灵敏度更高。该算法是用 Julia 实现的,可以在 R 中调用。源代码可在 https://github.com/pathint/RankCompV3.jl .Conclusions 上获得:基于 REOs 的算法是分析单细胞 RNA 图谱和高精度、高灵敏度识别 DEGs 的重要工具。
{"title":"RankCompV3: a differential expression analysis algorithm based on relative expression orderings and applications in single-cell RNA transcriptomics.","authors":"Jing Yan, Qiuhong Zeng, Xianlong Wang","doi":"10.1186/s12859-024-05889-1","DOIUrl":"10.1186/s12859-024-05889-1","url":null,"abstract":"<p><strong>Background: </strong>Effective identification of differentially expressed genes (DEGs) has been challenging for single-cell RNA sequencing (scRNA-seq) profiles. Many existing algorithms have high false positive rates (FPRs) and often fail to identify weak biological signals.</p><p><strong>Results: </strong>We present a novel method for identifying DEGs in scRNA-seq data called RankCompV3. It is based on the comparison of relative expression orderings (REOs) of gene pairs which are determined by comparing the expression levels of a pair of genes in a set of single-cell profiles. The numbers of genes with consistently higher or lower expression levels than the gene of interest are counted in two groups in comparison, respectively, and the result is tabulated in a 3 × 3 contingency table which is tested by McCullagh's method to determine if the gene is dysregulated. In both simulated and real scRNA-seq data, RankCompV3 tightly controlled the FPR and demonstrated high accuracy, outperforming 11 other common single-cell DEG detection algorithms. Analysis with either regular single-cell or synthetic pseudo-bulk profiles produced highly concordant DEGs with the ground-truth. In addition, RankCompV3 demonstrates higher sensitivity to weak biological signals than other methods. The algorithm was implemented using Julia and can be called in R. The source code is available at https://github.com/pathint/RankCompV3.jl .</p><p><strong>Conclusions: </strong>The REOs-based algorithm is a valuable tool for analyzing single-cell RNA profiles and identifying DEGs with high accuracy and sensitivity.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141900893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: Holomics ‑ a user‑friendly R shiny application for multi‑omics data integration and analysis. 更正:Holomics - 用于多组学数据整合与分析的用户友好型 R 应用程序。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-07 DOI: 10.1186/s12859-024-05868-6
Katharina Munk, Daria Ilina, Lisa Ziemba, Günter Brader, Eva M Molin
{"title":"Correction: Holomics ‑ a user‑friendly R shiny application for multi‑omics data integration and analysis.","authors":"Katharina Munk, Daria Ilina, Lisa Ziemba, Günter Brader, Eva M Molin","doi":"10.1186/s12859-024-05868-6","DOIUrl":"10.1186/s12859-024-05868-6","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141900892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
scMaui: a widely applicable deep learning framework for single-cell multiomics integration in the presence of batch effects and missing data. scMaui:一种广泛适用的深度学习框架,用于在批次效应和数据缺失的情况下进行单细胞多组学整合。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-06 DOI: 10.1186/s12859-024-05880-w
Yunhee Jeong, Jonathan Ronen, Wolfgang Kopp, Pavlo Lutsik, Altuna Akalin

The recent advances in high-throughput single-cell sequencing have created an urgent demand for computational models which can address the high complexity of single-cell multiomics data. Meticulous single-cell multiomics integration models are required to avoid biases towards a specific modality and overcome sparsity. Batch effects obfuscating biological signals must also be taken into account. Here, we introduce a new single-cell multiomics integration model, Single-cell Multiomics Autoencoder Integration (scMaui) based on variational product-of-experts autoencoders and adversarial learning. scMaui calculates a joint representation of multiple marginal distributions based on a product-of-experts approach which is especially effective for missing values in the modalities. Furthermore, it overcomes limitations seen in previous VAE-based integration methods with regard to batch effect correction and restricted applicable assays. It handles multiple batch effects independently accepting both discrete and continuous values, as well as provides varied reconstruction loss functions to cover all possible assays and preprocessing pipelines. We demonstrate that scMaui achieves superior performance in many tasks compared to other methods. Further downstream analyses also demonstrate its potential in identifying relations between assays and discovering hidden subpopulations.

近年来,高通量单细胞测序技术的发展催生了对能够处理高复杂度单细胞多组学数据的计算模型的迫切需求。需要精心设计的单细胞多组学整合模型,以避免对特定模式的偏差,并克服稀疏性。同时还必须考虑混淆生物信号的批次效应。在此,我们介绍一种新的单细胞多组学整合模型--单细胞多组学自动编码器整合(single-cell Multiomics Autoencoder Integration,scMaui),它基于变异专家乘积自动编码器和对抗学习。此外,它还克服了以往基于 VAE 的整合方法在批次效应校正和限制适用测定方面的局限性。它能独立处理多个批次效应,同时接受离散值和连续值,并提供多种重构损失函数,以涵盖所有可能的检测方法和预处理管道。我们证明,与其他方法相比,scMaui 在许多任务中都取得了优异的性能。进一步的下游分析也证明了它在识别检测之间的关系和发现隐藏亚群方面的潜力。
{"title":"scMaui: a widely applicable deep learning framework for single-cell multiomics integration in the presence of batch effects and missing data.","authors":"Yunhee Jeong, Jonathan Ronen, Wolfgang Kopp, Pavlo Lutsik, Altuna Akalin","doi":"10.1186/s12859-024-05880-w","DOIUrl":"10.1186/s12859-024-05880-w","url":null,"abstract":"<p><p>The recent advances in high-throughput single-cell sequencing have created an urgent demand for computational models which can address the high complexity of single-cell multiomics data. Meticulous single-cell multiomics integration models are required to avoid biases towards a specific modality and overcome sparsity. Batch effects obfuscating biological signals must also be taken into account. Here, we introduce a new single-cell multiomics integration model, Single-cell Multiomics Autoencoder Integration (scMaui) based on variational product-of-experts autoencoders and adversarial learning. scMaui calculates a joint representation of multiple marginal distributions based on a product-of-experts approach which is especially effective for missing values in the modalities. Furthermore, it overcomes limitations seen in previous VAE-based integration methods with regard to batch effect correction and restricted applicable assays. It handles multiple batch effects independently accepting both discrete and continuous values, as well as provides varied reconstruction loss functions to cover all possible assays and preprocessing pipelines. We demonstrate that scMaui achieves superior performance in many tasks compared to other methods. Further downstream analyses also demonstrate its potential in identifying relations between assays and discovering hidden subpopulations.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304929/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141896691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning. StackedEnC-AOP:使用基于变换进化和序列特征的多尺度向量与堆叠集合学习预测抗氧化蛋白质。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-04 DOI: 10.1186/s12859-024-05884-6
Gul Rukh, Shahid Akbar, Gauhar Rehman, Fawaz Khaled Alarfaj, Quan Zou

Background: Antioxidant proteins are involved in several biological processes and can protect DNA and cells from the damage of free radicals. These proteins regulate the body's oxidative stress and perform a significant role in many antioxidant-based drugs. The current invitro-based medications are costly, time-consuming, and unable to efficiently screen and identify the targeted motif of antioxidant proteins.

Methods: In this model, we proposed an accurate prediction method to discriminate antioxidant proteins namely StackedEnC-AOP. The training sequences are formulation encoded via incorporating a discrete wavelet transform (DWT) into the evolutionary matrix to decompose the PSSM-based images via two levels of DWT to form a Pseudo position-specific scoring matrix (PsePSSM-DWT) based embedded vector. Additionally, the Evolutionary difference formula and composite physiochemical properties methods are also employed to collect the structural and sequential descriptors. Then the combined vector of sequential features, evolutionary descriptors, and physiochemical properties is produced to cover the flaws of individual encoding schemes. To reduce the computational cost of the combined features vector, the optimal features are chosen using Minimum redundancy and maximum relevance (mRMR). The optimal feature vector is trained using a stacking-based ensemble meta-model.

Results: Our developed StackedEnC-AOP method reported a prediction accuracy of 98.40% and an AUC of 0.99 via training sequences. To evaluate model validation, the StackedEnC-AOP training model using an independent set achieved an accuracy of 96.92% and an AUC of 0.98.

Conclusion: Our proposed StackedEnC-AOP strategy performed significantly better than current computational models with a ~ 5% and ~ 3% improved accuracy via training and independent sets, respectively. The efficacy and consistency of our proposed StackedEnC-AOP make it a valuable tool for data scientists and can execute a key role in research academia and drug design.

背景:抗氧化蛋白参与多个生物过程,可保护 DNA 和细胞免受自由基的破坏。这些蛋白质能调节机体的氧化应激,在许多抗氧化药物中发挥着重要作用。目前基于体外试验的药物成本高、耗时长,且无法有效筛选和确定抗氧化蛋白的靶向基团:在该模型中,我们提出了一种精确的抗氧化蛋白预测方法,即 StackedEnC-AOP。通过将离散小波变换(DWT)纳入进化矩阵,对训练序列进行配方编码,通过两级DWT分解基于PSSM的图像,形成基于伪位置特异性评分矩阵(PsePSSM-DWT)的嵌入向量。此外,还采用了进化差分公式和复合理化特性方法来收集结构和序列描述符。然后生成序列特征、进化描述符和理化特性的组合向量,以弥补单个编码方案的缺陷。为了降低组合特征向量的计算成本,使用最小冗余和最大相关性(mRMR)来选择最佳特征。最佳特征向量使用基于堆叠的集合元模型进行训练:通过训练序列,我们开发的 StackedEnC-AOP 方法的预测准确率为 98.40%,AUC 为 0.99。在评估模型验证时,使用独立集的 StackedEnC-AOP 训练模型的准确率为 96.92%,AUC 为 0.98:我们提出的 StackedEnC-AOP 策略的表现明显优于当前的计算模型,通过训练集和独立集分别提高了约 5% 和约 3% 的准确率。我们提出的 StackedEnC-AOP 的有效性和一致性使其成为数据科学家的重要工具,并能在学术研究和药物设计中发挥关键作用。
{"title":"StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning.","authors":"Gul Rukh, Shahid Akbar, Gauhar Rehman, Fawaz Khaled Alarfaj, Quan Zou","doi":"10.1186/s12859-024-05884-6","DOIUrl":"10.1186/s12859-024-05884-6","url":null,"abstract":"<p><strong>Background: </strong>Antioxidant proteins are involved in several biological processes and can protect DNA and cells from the damage of free radicals. These proteins regulate the body's oxidative stress and perform a significant role in many antioxidant-based drugs. The current invitro-based medications are costly, time-consuming, and unable to efficiently screen and identify the targeted motif of antioxidant proteins.</p><p><strong>Methods: </strong>In this model, we proposed an accurate prediction method to discriminate antioxidant proteins namely StackedEnC-AOP. The training sequences are formulation encoded via incorporating a discrete wavelet transform (DWT) into the evolutionary matrix to decompose the PSSM-based images via two levels of DWT to form a Pseudo position-specific scoring matrix (PsePSSM-DWT) based embedded vector. Additionally, the Evolutionary difference formula and composite physiochemical properties methods are also employed to collect the structural and sequential descriptors. Then the combined vector of sequential features, evolutionary descriptors, and physiochemical properties is produced to cover the flaws of individual encoding schemes. To reduce the computational cost of the combined features vector, the optimal features are chosen using Minimum redundancy and maximum relevance (mRMR). The optimal feature vector is trained using a stacking-based ensemble meta-model.</p><p><strong>Results: </strong>Our developed StackedEnC-AOP method reported a prediction accuracy of 98.40% and an AUC of 0.99 via training sequences. To evaluate model validation, the StackedEnC-AOP training model using an independent set achieved an accuracy of 96.92% and an AUC of 0.98.</p><p><strong>Conclusion: </strong>Our proposed StackedEnC-AOP strategy performed significantly better than current computational models with a ~ 5% and ~ 3% improved accuracy via training and independent sets, respectively. The efficacy and consistency of our proposed StackedEnC-AOP make it a valuable tool for data scientists and can execute a key role in research academia and drug design.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11298090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141888454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Random forests for the analysis of matched case–control studies 用于分析匹配病例对照研究的随机森林
IF 3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-01 DOI: 10.1186/s12859-024-05877-5
Gunther Schauberger, Stefanie J. Klug, Moritz Berger
Conditional logistic regression trees have been proposed as a flexible alternative to the standard method of conditional logistic regression for the analysis of matched case–control studies. While they allow to avoid the strict assumption of linearity and automatically incorporate interactions, conditional logistic regression trees may suffer from a relatively high variability. Further machine learning methods for the analysis of matched case–control studies are missing because conventional machine learning methods cannot handle the matched structure of the data. A random forest method for the analysis of matched case–control studies based on conditional logistic regression trees is proposed, which overcomes the issue of high variability. It provides an accurate estimation of exposure effects while being more flexible in the functional form of covariate effects. The efficacy of the method is illustrated in a simulation study and within an application to real-world data from a matched case–control study on the effect of regular participation in cervical cancer screening on the development of cervical cancer. The proposed random forest method is a promising add-on to the toolbox for the analysis of matched case–control studies and addresses the need for machine-learning methods in this field. It provides a more flexible approach compared to the standard method of conditional logistic regression, but also compared to conditional logistic regression trees. It allows for non-linearity and the automatic inclusion of interaction effects and is suitable both for exploratory and explanatory analyses.
条件 logistic 回归树作为条件 logistic 回归标准方法的一种灵活替代方法,已被提出用于匹配病例对照研究的分析。虽然条件 logistic 回归树可以避免严格的线性假设,并自动纳入交互作用,但可能存在相对较高的变异性。由于传统的机器学习方法无法处理数据的匹配结构,因此缺少用于分析匹配病例对照研究的进一步机器学习方法。本文提出了一种基于条件逻辑回归树的随机森林方法,用于分析匹配的病例对照研究,克服了高变异性的问题。它能准确估计暴露效应,同时在协变量效应的函数形式上更加灵活。在一项模拟研究和一项关于定期参加宫颈癌筛查对宫颈癌发病影响的匹配病例对照研究的实际数据应用中,说明了该方法的有效性。所提出的随机森林方法是分析匹配病例对照研究工具箱中一个很有前途的附加工具,满足了这一领域对机器学习方法的需求。与条件逻辑回归的标准方法相比,它提供了一种更灵活的方法,与条件逻辑回归树相比也是如此。它允许非线性和自动纳入交互效应,既适用于探索性分析,也适用于解释性分析。
{"title":"Random forests for the analysis of matched case–control studies","authors":"Gunther Schauberger, Stefanie J. Klug, Moritz Berger","doi":"10.1186/s12859-024-05877-5","DOIUrl":"https://doi.org/10.1186/s12859-024-05877-5","url":null,"abstract":"Conditional logistic regression trees have been proposed as a flexible alternative to the standard method of conditional logistic regression for the analysis of matched case–control studies. While they allow to avoid the strict assumption of linearity and automatically incorporate interactions, conditional logistic regression trees may suffer from a relatively high variability. Further machine learning methods for the analysis of matched case–control studies are missing because conventional machine learning methods cannot handle the matched structure of the data. A random forest method for the analysis of matched case–control studies based on conditional logistic regression trees is proposed, which overcomes the issue of high variability. It provides an accurate estimation of exposure effects while being more flexible in the functional form of covariate effects. The efficacy of the method is illustrated in a simulation study and within an application to real-world data from a matched case–control study on the effect of regular participation in cervical cancer screening on the development of cervical cancer. The proposed random forest method is a promising add-on to the toolbox for the analysis of matched case–control studies and addresses the need for machine-learning methods in this field. It provides a more flexible approach compared to the standard method of conditional logistic regression, but also compared to conditional logistic regression trees. It allows for non-linearity and the automatic inclusion of interaction effects and is suitable both for exploratory and explanatory analyses.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ant colony optimization for the identification of dysregulated gene subnetworks from expression data 从表达数据中识别失调基因子网络的蚁群优化技术
IF 3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-01 DOI: 10.1186/s12859-024-05871-x
Eileen Marie Hanna, Ghadi El Hasbani, Danielle Azar
High-throughput experimental technologies can provide deeper insights into pathway perturbations in biomedical studies. Accordingly, their usage is central to the identification of molecular targets and the subsequent development of suitable treatments for various diseases. Classical interpretations of generated data, such as differential gene expression and pathway analyses, disregard interconnections between studied genes when looking for gene-disease associations. Given that these interconnections are central to cellular processes, there has been a recent interest in incorporating them in such studies. The latter allows the detection of gene modules that underlie complex phenotypes in gene interaction networks. Existing methods either impose radius-based restrictions or freely grow modules at the expense of a statistical bias towards large modules. We propose a heuristic method, inspired by Ant Colony Optimization, to apply gene-level scoring and module identification with distance-based search constraints and penalties, rather than radius-based constraints. We test and compare our results to other approaches using three datasets of different neurodegenerative diseases, namely Alzheimer’s, Parkinson’s, and Huntington’s, over three independent experiments. We report the outcomes of enrichment analyses and concordance of gene-level scores for each disease. Results indicate that the proposed approach generally shows superior stability in comparison to existing methods. It produces stable and meaningful enrichment results in all three datasets which have different case to control proportions and sample sizes. The presented network-based gene expression analysis approach successfully identifies dysregulated gene modules associated with a certain disease. Using a heuristic based on Ant Colony Optimization, we perform a distance-based search with no radius constraints. Experimental results support the effectiveness and stability of our method in prioritizing modules of high relevance. Our tool is publicly available at github.com/GhadiElHasbani/ACOxGS.git.
在生物医学研究中,高通量实验技术能让人们更深入地了解通路扰动。因此,使用高通量实验技术对于确定分子靶点以及随后开发适合各种疾病的治疗方法至关重要。对所生成数据的经典解释,如差异基因表达和通路分析,在寻找基因与疾病的关联时忽略了所研究基因之间的相互联系。鉴于这些相互联系是细胞过程的核心,近来人们对将其纳入此类研究产生了浓厚兴趣。后者可以在基因相互作用网络中检测出导致复杂表型的基因模块。现有的方法要么是施加基于半径的限制,要么是以牺牲对大型模块的统计偏差为代价自由扩展模块。我们受蚁群优化的启发,提出了一种启发式方法,利用基于距离的搜索限制和惩罚,而不是基于半径的限制,来进行基因水平的评分和模块识别。我们使用阿尔茨海默病、帕金森氏症和亨廷顿氏症这三种不同神经退行性疾病的数据集进行了三次独立实验,并将实验结果与其他方法进行了比较。我们报告了每种疾病的富集分析结果和基因水平评分的一致性。结果表明,与现有方法相比,所提出的方法总体上表现出更高的稳定性。它能在所有三个数据集中产生稳定而有意义的富集结果,这三个数据集的病例与对照比例和样本量各不相同。所提出的基于网络的基因表达分析方法成功地识别了与某种疾病相关的失调基因模块。我们使用基于蚁群优化的启发式方法,在没有半径限制的情况下进行基于距离的搜索。实验结果表明,我们的方法在确定高相关性模块的优先级方面具有有效性和稳定性。我们的工具可在 github.com/GhadiElHasbani/ACOxGS.git 上公开获取。
{"title":"Ant colony optimization for the identification of dysregulated gene subnetworks from expression data","authors":"Eileen Marie Hanna, Ghadi El Hasbani, Danielle Azar","doi":"10.1186/s12859-024-05871-x","DOIUrl":"https://doi.org/10.1186/s12859-024-05871-x","url":null,"abstract":"High-throughput experimental technologies can provide deeper insights into pathway perturbations in biomedical studies. Accordingly, their usage is central to the identification of molecular targets and the subsequent development of suitable treatments for various diseases. Classical interpretations of generated data, such as differential gene expression and pathway analyses, disregard interconnections between studied genes when looking for gene-disease associations. Given that these interconnections are central to cellular processes, there has been a recent interest in incorporating them in such studies. The latter allows the detection of gene modules that underlie complex phenotypes in gene interaction networks. Existing methods either impose radius-based restrictions or freely grow modules at the expense of a statistical bias towards large modules. We propose a heuristic method, inspired by Ant Colony Optimization, to apply gene-level scoring and module identification with distance-based search constraints and penalties, rather than radius-based constraints. We test and compare our results to other approaches using three datasets of different neurodegenerative diseases, namely Alzheimer’s, Parkinson’s, and Huntington’s, over three independent experiments. We report the outcomes of enrichment analyses and concordance of gene-level scores for each disease. Results indicate that the proposed approach generally shows superior stability in comparison to existing methods. It produces stable and meaningful enrichment results in all three datasets which have different case to control proportions and sample sizes. The presented network-based gene expression analysis approach successfully identifies dysregulated gene modules associated with a certain disease. Using a heuristic based on Ant Colony Optimization, we perform a distance-based search with no radius constraints. Experimental results support the effectiveness and stability of our method in prioritizing modules of high relevance. Our tool is publicly available at github.com/GhadiElHasbani/ACOxGS.git.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery. 用于药物发现中 ADMET 预测的混合片段-SMILES 标记化。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-08-01 DOI: 10.1186/s12859-024-05861-z
Nicholas Aksamit, Alain Tchagang, Yifeng Li, Beatrice Ombuki-Berman

Background: Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized.

Results: This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs.

Conclusion: The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.

背景:药物发现和开发是一个成本极高、耗时极长的过程,需要找出能与生物标志物靶点相互作用的新分子,以阻断相关疾病的发病途径。除了与靶点结合外,候选药物还需要满足影响吸收、分布、代谢、排泄和毒性(ADMET)的多种特性。人工智能方法为改进药物发现和开发过程的每一步提供了机会,其中我们面临的第一个问题是如何对分子进行信息表征,从而优化硅内解决方案:本研究介绍了一种新颖的 SMILES-片段混合标记化方法,结合两种预训练策略,利用基于 Transformer 的模型。我们研究了混合标记化在提高 ADMET 预测任务性能方面的功效。我们的方法利用了 MTL-BERT(一种仅用于编码器的 Transformer 模型,可实现最先进的 ADMET 预测),并在片段库截断范围内将标准 SMILES 标记化与我们的混合方法进行了对比:研究结果表明,虽然过多的片段会影响性能,但使用混合标记法和高频片段可以提高结果,超过基本的 SMILES 标记法。这一进步强调了在用于 ADMET 特性预测的 Transformer 模型训练中整合片段和字符级分子特征的潜力。
{"title":"Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.","authors":"Nicholas Aksamit, Alain Tchagang, Yifeng Li, Beatrice Ombuki-Berman","doi":"10.1186/s12859-024-05861-z","DOIUrl":"10.1186/s12859-024-05861-z","url":null,"abstract":"<p><strong>Background: </strong>Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized.</p><p><strong>Results: </strong>This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs.</p><p><strong>Conclusion: </strong>The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11295479/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141874077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DGCPPISP: a PPI site prediction model based on dynamic graph convolutional network and two-stage transfer learning. DGCPPISP:基于动态图卷积网络和两阶段迁移学习的 PPI 位点预测模型。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-07-31 DOI: 10.1186/s12859-024-05864-w
Zijian Feng, Weihong Huang, Haohao Li, Hancan Zhu, Yanlei Kang, Zhong Li

Background: Proteins play a pivotal role in the diverse array of biological processes, making the precise prediction of protein-protein interaction (PPI) sites critical to numerous disciplines including biology, medicine and pharmacy. While deep learning methods have progressively been implemented for the prediction of PPI sites within proteins, the task of enhancing their predictive performance remains an arduous challenge.

Results: In this paper, we propose a novel PPI site prediction model (DGCPPISP) based on a dynamic graph convolutional neural network and a two-stage transfer learning strategy. Initially, we implement the transfer learning from dual perspectives, namely feature input and model training that serve to supply efficacious prior knowledge for our model. Subsequently, we construct a network designed for the second stage of training, which is built on the foundation of dynamic graph convolution.

Conclusions: To evaluate its effectiveness, the performance of the DGCPPISP model is scrutinized using two benchmark datasets. The ensuing results demonstrate that DGCPPISP outshines competing methods in terms of performance. Specifically, DGCPPISP surpasses the second-best method, EGRET, by margins of 5.9%, 10.1%, and 13.3% for F1-measure, AUPRC, and MCC metrics respectively on Dset_186_72_PDB164. Similarly, on Dset_331, it eclipses the performance of the runner-up method, HN-PPISP, by 14.5%, 19.8%, and 29.9% respectively.

背景:蛋白质在各种生物过程中发挥着举足轻重的作用,因此精确预测蛋白质相互作用(PPI)位点对生物学、医学和药学等众多学科至关重要。虽然深度学习方法已逐步用于预测蛋白质中的 PPI 位点,但提高其预测性能仍是一项艰巨的任务:本文提出了一种基于动态图卷积神经网络和两阶段迁移学习策略的新型 PPI 位点预测模型(DGCPPISP)。首先,我们从两个角度实施迁移学习,即特征输入和模型训练,为我们的模型提供有效的先验知识。随后,我们构建了一个专为第二阶段训练设计的网络,该网络建立在动态图卷积的基础上:为了评估 DGCPPISP 模型的有效性,我们使用两个基准数据集对其性能进行了仔细研究。随后的结果表明,DGCPPISP 在性能方面优于其他竞争方法。具体来说,在 Dset_186_72_PDB164 上,DGCPPISP 在 F1-measure、AUPRC 和 MCC 指标上分别以 5.9%、10.1% 和 13.3% 的优势超过了排名第二的 EGRET 方法。同样,在 Dset_331 上,它的性能分别比亚军方法 HN-PPISP 高出 14.5%、19.8% 和 29.9%。
{"title":"DGCPPISP: a PPI site prediction model based on dynamic graph convolutional network and two-stage transfer learning.","authors":"Zijian Feng, Weihong Huang, Haohao Li, Hancan Zhu, Yanlei Kang, Zhong Li","doi":"10.1186/s12859-024-05864-w","DOIUrl":"10.1186/s12859-024-05864-w","url":null,"abstract":"<p><strong>Background: </strong>Proteins play a pivotal role in the diverse array of biological processes, making the precise prediction of protein-protein interaction (PPI) sites critical to numerous disciplines including biology, medicine and pharmacy. While deep learning methods have progressively been implemented for the prediction of PPI sites within proteins, the task of enhancing their predictive performance remains an arduous challenge.</p><p><strong>Results: </strong>In this paper, we propose a novel PPI site prediction model (DGCPPISP) based on a dynamic graph convolutional neural network and a two-stage transfer learning strategy. Initially, we implement the transfer learning from dual perspectives, namely feature input and model training that serve to supply efficacious prior knowledge for our model. Subsequently, we construct a network designed for the second stage of training, which is built on the foundation of dynamic graph convolution.</p><p><strong>Conclusions: </strong>To evaluate its effectiveness, the performance of the DGCPPISP model is scrutinized using two benchmark datasets. The ensuing results demonstrate that DGCPPISP outshines competing methods in terms of performance. Specifically, DGCPPISP surpasses the second-best method, EGRET, by margins of 5.9%, 10.1%, and 13.3% for F1-measure, AUPRC, and MCC metrics respectively on Dset_186_72_PDB164. Similarly, on Dset_331, it eclipses the performance of the runner-up method, HN-PPISP, by 14.5%, 19.8%, and 29.9% respectively.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11293074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141858957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1