Briefings in bioinformatics最新文献_第2页

Deep learning in template-free de novo biosynthetic pathway design of natural products. 天然产品无模板从头生物合成途径设计中的深度学习。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae495

Xueying Xie, Lin Gui, Baixue Qiao, Guohua Wang, Shan Huang, Yuming Zhao, Shanwen Sun

Natural products (NPs) are indispensable in drug development, particularly in combating infections, cancer, and neurodegenerative diseases. However, their limited availability poses significant challenges. Template-free de novo biosynthetic pathway design provides a strategic solution for NP production, with deep learning standing out as a powerful tool in this domain. This review delves into state-of-the-art deep learning algorithms in NP biosynthesis pathway design. It provides an in-depth discussion of databases like Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and UniProt, which are essential for model training, along with chemical databases such as Reaxys, SciFinder, and PubChem for transfer learning to expand models' understanding of the broader chemical space. It evaluates the potential and challenges of sequence-to-sequence and graph-to-graph translation models for accurate single-step prediction. Additionally, it discusses search algorithms for multistep prediction and deep learning algorithms for predicting enzyme function. The review also highlights the pivotal role of deep learning in improving catalytic efficiency through enzyme engineering, which is essential for enhancing NP production. Moreover, it examines the application of large language models in pathway design, enzyme discovery, and enzyme engineering. Finally, it addresses the challenges and prospects associated with template-free approaches, offering insights into potential advancements in NP biosynthesis pathway design.

天然产物（NPs）在药物开发中不可或缺，尤其是在抗击感染、癌症和神经退行性疾病方面。然而，天然产物的有限可用性带来了巨大挑战。无模板的从头生物合成途径设计为 NP 生产提供了一种战略性解决方案，而深度学习则是这一领域的有力工具。本综述深入探讨了最先进的 NP 生物合成途径设计深度学习算法。它深入讨论了对模型训练至关重要的《京都基因组百科全书》（KEGG）、Reactome 和 UniProt 等数据库，以及用于迁移学习的 Reaxys、SciFinder 和 PubChem 等化学数据库，以扩展模型对更广阔化学空间的理解。报告评估了序列到序列和图到图转换模型在单步准确预测方面的潜力和挑战。此外，它还讨论了用于多步预测的搜索算法和用于预测酶功能的深度学习算法。综述还强调了深度学习在通过酶工程提高催化效率方面的关键作用，这对提高 NP 产量至关重要。此外，它还探讨了大型语言模型在途径设计、酶发现和酶工程中的应用。最后，它探讨了与无模板方法相关的挑战和前景，为潜在的 NP 生物合成途径设计进展提供了见解。

{"title":"Deep learning in template-free de novo biosynthetic pathway design of natural products.","authors":"Xueying Xie, Lin Gui, Baixue Qiao, Guohua Wang, Shan Huang, Yuming Zhao, Shanwen Sun","doi":"10.1093/bib/bbae495","DOIUrl":"10.1093/bib/bbae495","url":null,"abstract":"Natural products (NPs) are indispensable in drug development, particularly in combating infections, cancer, and neurodegenerative diseases. However, their limited availability poses significant challenges. Template-free de novo biosynthetic pathway design provides a strategic solution for NP production, with deep learning standing out as a powerful tool in this domain. This review delves into state-of-the-art deep learning algorithms in NP biosynthesis pathway design. It provides an in-depth discussion of databases like Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and UniProt, which are essential for model training, along with chemical databases such as Reaxys, SciFinder, and PubChem for transfer learning to expand models' understanding of the broader chemical space. It evaluates the potential and challenges of sequence-to-sequence and graph-to-graph translation models for accurate single-step prediction. Additionally, it discusses search algorithms for multistep prediction and deep learning algorithms for predicting enzyme function. The review also highlights the pivotal role of deep learning in improving catalytic efficiency through enzyme engineering, which is essential for enhancing NP production. Moreover, it examines the application of large language models in pathway design, enzyme discovery, and enzyme engineering. Finally, it addresses the challenges and prospects associated with template-free approaches, offering insights into potential advancements in NP biosynthesis pathway design.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11456888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142380028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A robust statistical approach for finding informative spatially associated pathways. 寻找信息丰富的空间关联路径的稳健统计方法。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae543

Leqi Tian, Jiashun Xiao, Tianwei Yu

Spatial transcriptomics offers deep insights into cellular functional localization and communication by mapping gene expression to spatial locations. Traditional approaches that focus on selecting spatially variable genes often overlook the complexity of biological pathways and the interactions among genes. Here, we introduce a novel framework that shifts the focus towards directly identifying functional pathways associated with spatial variability by adapting the Brownian distance covariance test in an innovative manner to explore the heterogeneity of biological functions over space. Unlike most other methods, this statistical testing approach is free of gene selection and parameter selection and allows nonlinear and complex dependencies. It allows for a deeper understanding of how cells coordinate their activities across different spatial domains through biological pathways. By analyzing real human and mouse datasets, the method found significant pathways that were associated with spatial variation, as well as different pathway patterns among inner- and edge-cancer regions. This innovative framework offers a new perspective on analyzing spatial transcriptomic data, contributing to our understanding of tissue architecture and disease pathology. The implementation is publicly available at https://github.com/tianlq-prog/STpathway.

空间转录组学通过将基因表达映射到空间位置来深入了解细胞功能定位和交流。传统方法侧重于选择空间可变基因，往往忽视了生物通路的复杂性和基因之间的相互作用。在这里，我们引入了一个新颖的框架，通过以创新的方式调整布朗距离协方差检验来探索生物功能在空间上的异质性，从而将重点转向直接识别与空间变异性相关的功能通路。与大多数其他方法不同的是，这种统计检验方法不受基因选择和参数选择的影响，允许非线性和复杂的依赖关系。它能让人们更深入地了解细胞如何通过生物通路协调它们在不同空间领域的活动。通过分析真实的人类和小鼠数据集，该方法发现了与空间变化相关的重要通路，以及内部和边缘癌症区域的不同通路模式。这一创新框架为分析空间转录组数据提供了一个新视角，有助于我们了解组织结构和疾病病理。该框架的实现可在 https://github.com/tianlq-prog/STpathway 上公开获取。

{"title":"A robust statistical approach for finding informative spatially associated pathways.","authors":"Leqi Tian, Jiashun Xiao, Tianwei Yu","doi":"10.1093/bib/bbae543","DOIUrl":"https://doi.org/10.1093/bib/bbae543","url":null,"abstract":"Spatial transcriptomics offers deep insights into cellular functional localization and communication by mapping gene expression to spatial locations. Traditional approaches that focus on selecting spatially variable genes often overlook the complexity of biological pathways and the interactions among genes. Here, we introduce a novel framework that shifts the focus towards directly identifying functional pathways associated with spatial variability by adapting the Brownian distance covariance test in an innovative manner to explore the heterogeneity of biological functions over space. Unlike most other methods, this statistical testing approach is free of gene selection and parameter selection and allows nonlinear and complex dependencies. It allows for a deeper understanding of how cells coordinate their activities across different spatial domains through biological pathways. By analyzing real human and mouse datasets, the method found significant pathways that were associated with spatial variation, as well as different pathway patterns among inner- and edge-cancer regions. This innovative framework offers a new perspective on analyzing spatial transcriptomic data, contributing to our understanding of tissue architecture and disease pathology. The implementation is publicly available at https://github.com/tianlq-prog/STpathway.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503753/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prototype-based contrastive substructure identification for molecular property prediction. 用于分子特性预测的基于原型的对比子结构识别。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae565

Gaoqi He, Shun Liu, Zhuoran Liu, Changbo Wang, Kai Zhang, Honglin Li

Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.

基于子结构的表征学习已成为对复杂属性图进行特征化的有力方法，在分子性质预测（MPP）方面取得了可喜的成果。然而，现有的 MPP 方法主要依赖人工定义的规则来提取子结构。如何从众多分子图中自适应地识别有意义的子结构，以适应 MPP 任务，仍然是一个有待解决的难题。为此，本文提出了基于原型的自监督子结构识别（Prototype-based cOntrastive Substructure IdentificaTion，POSIT）--一种自监督框架，用于自主发现分子图中的子结构原型，从而指导端到端的分子破碎。在预训练阶段，POSIT 强调子结构识别的两个关键方面：首先，它施加软连接性约束，鼓励生成拓扑上有意义的子结构；其次，它通过原型-子结构对比聚类目标，将生成的子结构与衍生原型对齐，确保聚类内基于属性的相似性。在微调阶段，设计了一种跨尺度关注机制，以整合子结构级信息，增强分子表征。POSIT 框架的有效性通过各种实际数据集的实验结果得到了证明，这些数据集涵盖了分类和回归任务。此外，可视化分析验证了化学先验与已识别子结构的一致性。源代码可通过 https://github.com/VRPharmer/POSIT 公开获取。

{"title":"Prototype-based contrastive substructure identification for molecular property prediction.","authors":"Gaoqi He, Shun Liu, Zhuoran Liu, Changbo Wang, Kai Zhang, Honglin Li","doi":"10.1093/bib/bbae565","DOIUrl":"10.1093/bib/bbae565","url":null,"abstract":"Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533112/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142567282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

COFFEE: consensus single cell-type specific inference for gene regulatory networks. COFFEE：基因调控网络的共识性单细胞类型特异性推断。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae457

Musaddiq K Lodi, Anna Chernikov, Preetam Ghosh

The inference of gene regulatory networks (GRNs) is crucial to understanding the regulatory mechanisms that govern biological processes. GRNs may be represented as edges in a graph, and hence, it have been inferred computationally for scRNA-seq data. A wisdom of crowds approach to integrate edges from several GRNs to create one composite GRN has demonstrated improved performance when compared with individual algorithm implementations on bulk RNA-seq and microarray data. In an effort to extend this approach to scRNA-seq data, we present COFFEE (COnsensus single cell-type speciFic inFerence for gEnE regulatory networks), a Borda voting-based consensus algorithm that integrates information from 10 established GRN inference methods. We conclude that COFFEE has improved performance across synthetic, curated, and experimental datasets when compared with baseline methods. Additionally, we show that a modified version of COFFEE can be leveraged to improve performance on newer cell-type specific GRN inference methods. Overall, our results demonstrate that consensus-based methods with pertinent modifications continue to be valuable for GRN inference at the single cell level. While COFFEE is benchmarked on 10 algorithms, it is a flexible strategy that can incorporate any set of GRN inference algorithms according to user preference. A Python implementation of COFFEE may be found on GitHub: https://github.com/lodimk2/coffee.

基因调控网络（GRN）的推断对于了解生物过程的调控机制至关重要。基因调控网络可以用图中的边来表示，因此可以通过计算来推断 scRNA-seq 数据。与批量 RNA-seq 和微阵列数据上的单个算法实施相比，一种整合多个 GRN 的边以创建一个复合 GRN 的众智方法已证明性能有所提高。为了将这种方法扩展到 scRNA-seq 数据，我们提出了 COFFEE（COnsensus single cell-type speciFic inFerence for gEnE regulatory networks），这是一种基于 Borda 投票的共识算法，它整合了 10 种成熟 GRN 推断方法的信息。我们的结论是，与基线方法相比，COFFEE 在合成数据集、策划数据集和实验数据集上的性能都有所提高。此外，我们还展示了 COFFEE 的改进版，可以利用它来提高更新的特定细胞类型 GRN 推断方法的性能。总之，我们的研究结果表明，经过相关修改的基于共识的方法对于单细胞水平的 GRN 推断仍然很有价值。虽然 COFFEE 以 10 种算法为基准，但它是一种灵活的策略，可以根据用户的偏好纳入任何一组 GRN 推断算法。COFFEE 的 Python 实现可在 GitHub 上找到：https://github.com/lodimk2/coffee。

{"title":"COFFEE: consensus single cell-type specific inference for gene regulatory networks.","authors":"Musaddiq K Lodi, Anna Chernikov, Preetam Ghosh","doi":"10.1093/bib/bbae457","DOIUrl":"10.1093/bib/bbae457","url":null,"abstract":"The inference of gene regulatory networks (GRNs) is crucial to understanding the regulatory mechanisms that govern biological processes. GRNs may be represented as edges in a graph, and hence, it have been inferred computationally for scRNA-seq data. A wisdom of crowds approach to integrate edges from several GRNs to create one composite GRN has demonstrated improved performance when compared with individual algorithm implementations on bulk RNA-seq and microarray data. In an effort to extend this approach to scRNA-seq data, we present COFFEE (COnsensus single cell-type speciFic inFerence for gEnE regulatory networks), a Borda voting-based consensus algorithm that integrates information from 10 established GRN inference methods. We conclude that COFFEE has improved performance across synthetic, curated, and experimental datasets when compared with baseline methods. Additionally, we show that a modified version of COFFEE can be leveraged to improve performance on newer cell-type specific GRN inference methods. Overall, our results demonstrate that consensus-based methods with pertinent modifications continue to be valuable for GRN inference at the single cell level. While COFFEE is benchmarked on 10 algorithms, it is a flexible strategy that can incorporate any set of GRN inference algorithms according to user preference. A Python implementation of COFFEE may be found on GitHub: https://github.com/lodimk2/coffee.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11418232/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142280435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diagnostics of viral infections using high-throughput genome sequencing data. 利用高通量基因组测序数据诊断病毒感染。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae501

Haochen Ning, Ian Boyes, Ibrahim Numanagić, Michael Rott, Li Xing, Xuekui Zhang

Plant viral infections cause significant economic losses, totalling $350 billion USD in 2021. With no treatment for virus-infected plants, accurate and efficient diagnosis is crucial to preventing and controlling these diseases. High-throughput sequencing (HTS) enables cost-efficient identification of known and unknown viruses. However, existing diagnostic pipelines face challenges. First, many methods depend on subjectively chosen parameter values, undermining their robustness across various data sources. Second, artifacts (e.g. false peaks) in the mapped sequence data can lead to incorrect diagnostic results. While some methods require manual or subjective verification to address these artifacts, others overlook them entirely, affecting the overall method performance and leading to imprecise or labour-intensive outcomes. To address these challenges, we introduce IIMI, a new automated analysis pipeline using machine learning to diagnose infections from 1583 plant viruses with HTS data. It adopts a data-driven approach for parameter selection, reducing subjectivity, and automatically filters out regions affected by artifacts, thus improving accuracy. Testing with in-house and published data shows IIMI's superiority over existing methods. Besides a prediction model, IIMI also provides resources on plant virus genomes, including annotations of regions prone to artifacts. The method is available as an R package (iimi) on CRAN and will integrate with the web application www.virtool.ca, enhancing accessibility and user convenience.

植物病毒感染造成了巨大的经济损失，2021 年的损失总额将达到 3500 亿美元。由于受病毒感染的植物无法得到治疗，准确有效的诊断对预防和控制这些疾病至关重要。高通量测序 (HTS) 可以经济高效地鉴定已知和未知病毒。然而，现有的诊断管道面临着挑战。首先，许多方法依赖于主观选择的参数值，这就削弱了它们在不同数据源中的稳健性。其次，映射序列数据中的伪影（如假峰）会导致不正确的诊断结果。有些方法需要人工或主观验证来处理这些伪影，而有些方法则完全忽略了它们，从而影响了方法的整体性能，导致结果不精确或劳动密集型结果。为了应对这些挑战，我们引入了 IIMI，这是一种利用机器学习诊断 1583 种植物病毒感染的 HTS 数据的新型自动分析管道。它采用数据驱动的方法进行参数选择，减少了主观性，并自动过滤掉受伪影影响的区域，从而提高了准确性。利用内部数据和已发表数据进行的测试表明，IIMI 优于现有方法。除预测模型外，IIMI 还提供了植物病毒基因组资源，包括易受人工影响区域的注释。该方法以 R 软件包（iimi）的形式在 CRAN 上提供，并将与网络应用程序 www.virtool.ca 集成，以提高可访问性和用户便利性。

{"title":"Diagnostics of viral infections using high-throughput genome sequencing data.","authors":"Haochen Ning, Ian Boyes, Ibrahim Numanagić, Michael Rott, Li Xing, Xuekui Zhang","doi":"10.1093/bib/bbae501","DOIUrl":"https://doi.org/10.1093/bib/bbae501","url":null,"abstract":"Plant viral infections cause significant economic losses, totalling $350 billion USD in 2021. With no treatment for virus-infected plants, accurate and efficient diagnosis is crucial to preventing and controlling these diseases. High-throughput sequencing (HTS) enables cost-efficient identification of known and unknown viruses. However, existing diagnostic pipelines face challenges. First, many methods depend on subjectively chosen parameter values, undermining their robustness across various data sources. Second, artifacts (e.g. false peaks) in the mapped sequence data can lead to incorrect diagnostic results. While some methods require manual or subjective verification to address these artifacts, others overlook them entirely, affecting the overall method performance and leading to imprecise or labour-intensive outcomes. To address these challenges, we introduce IIMI, a new automated analysis pipeline using machine learning to diagnose infections from 1583 plant viruses with HTS data. It adopts a data-driven approach for parameter selection, reducing subjectivity, and automatically filters out regions affected by artifacts, thus improving accuracy. Testing with in-house and published data shows IIMI's superiority over existing methods. Besides a prediction model, IIMI also provides resources on plant virus genomes, including annotations of regions prone to artifacts. The method is available as an R package (iimi) on CRAN and will integrate with the web application www.virtool.ca, enhancing accessibility and user convenience.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11483527/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142486027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep learning model for protein multi-label subcellular localization and function prediction based on multi-task collaborative training. 基于多任务协作训练的蛋白质多标签亚细胞定位和功能预测深度学习模型。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae568

Peihao Bai, Guanghui Li, Jiawei Luo, Cheng Liang

The functional study of proteins is a critical task in modern biology, playing a pivotal role in understanding the mechanisms of pathogenesis, developing new drugs, and discovering novel drug targets. However, existing computational models for subcellular localization face significant challenges, such as reliance on known Gene Ontology (GO) annotation databases or overlooking the relationship between GO annotations and subcellular localization. To address these issues, we propose DeepMTC, an end-to-end deep learning-based multi-task collaborative training model. DeepMTC integrates the interrelationship between subcellular localization and the functional annotation of proteins, leveraging multi-task collaborative training to eliminate dependence on known GO databases. This strategy gives DeepMTC a distinct advantage in predicting newly discovered proteins without prior functional annotations. First, DeepMTC leverages pre-trained language model with high accuracy to obtain the 3D structure and sequence features of proteins. Additionally, it employs a graph transformer module to encode protein sequence features, addressing the problem of long-range dependencies in graph neural networks. Finally, DeepMTC uses a functional cross-attention mechanism to efficiently combine upstream learned functional features to perform the subcellular localization task. The experimental results demonstrate that DeepMTC outperforms state-of-the-art models in both protein function prediction and subcellular localization. Moreover, interpretability experiments revealed that DeepMTC can accurately identify the key residues and functional domains of proteins, confirming its superior performance. The code and dataset of DeepMTC are freely available at https://github.com/ghli16/DeepMTC.

蛋白质的功能研究是现代生物学的一项关键任务，在了解发病机制、开发新药和发现新的药物靶点方面发挥着举足轻重的作用。然而，现有的亚细胞定位计算模型面临着巨大的挑战，例如依赖于已知的基因本体（GO）注释数据库，或者忽视了 GO 注释与亚细胞定位之间的关系。为了解决这些问题，我们提出了基于深度学习的端到端多任务协作训练模型 DeepMTC。DeepMTC 整合了亚细胞定位与蛋白质功能注释之间的相互关系，利用多任务协作训练消除了对已知 GO 数据库的依赖。这一策略使 DeepMTC 在预测没有预先功能注释的新发现蛋白质时具有明显优势。首先，DeepMTC 利用预先训练的高精度语言模型来获取蛋白质的三维结构和序列特征。此外，它还采用了图转换器模块来编码蛋白质序列特征，从而解决了图神经网络中的长程依赖性问题。最后，DeepMTC 利用功能交叉注意机制，有效地结合上游学习到的功能特征来完成亚细胞定位任务。实验结果表明，DeepMTC 在蛋白质功能预测和亚细胞定位方面都优于最先进的模型。此外，可解释性实验表明，DeepMTC 能准确识别蛋白质的关键残基和功能域，从而证实了其卓越的性能。DeepMTC 的代码和数据集可在 https://github.com/ghli16/DeepMTC 免费获取。

{"title":"Deep learning model for protein multi-label subcellular localization and function prediction based on multi-task collaborative training.","authors":"Peihao Bai, Guanghui Li, Jiawei Luo, Cheng Liang","doi":"10.1093/bib/bbae568","DOIUrl":"10.1093/bib/bbae568","url":null,"abstract":"The functional study of proteins is a critical task in modern biology, playing a pivotal role in understanding the mechanisms of pathogenesis, developing new drugs, and discovering novel drug targets. However, existing computational models for subcellular localization face significant challenges, such as reliance on known Gene Ontology (GO) annotation databases or overlooking the relationship between GO annotations and subcellular localization. To address these issues, we propose DeepMTC, an end-to-end deep learning-based multi-task collaborative training model. DeepMTC integrates the interrelationship between subcellular localization and the functional annotation of proteins, leveraging multi-task collaborative training to eliminate dependence on known GO databases. This strategy gives DeepMTC a distinct advantage in predicting newly discovered proteins without prior functional annotations. First, DeepMTC leverages pre-trained language model with high accuracy to obtain the 3D structure and sequence features of proteins. Additionally, it employs a graph transformer module to encode protein sequence features, addressing the problem of long-range dependencies in graph neural networks. Finally, DeepMTC uses a functional cross-attention mechanism to efficiently combine upstream learned functional features to perform the subcellular localization task. The experimental results demonstrate that DeepMTC outperforms state-of-the-art models in both protein function prediction and subcellular localization. Moreover, interpretability experiments revealed that DeepMTC can accurately identify the key residues and functional domains of proteins, confirming its superior performance. The code and dataset of DeepMTC are freely available at https://github.com/ghli16/DeepMTC.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11531862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142567262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model ensembling as a tool to form interpretable multi-omic predictors of cancer pharmacosensitivity. 以模型组合为工具，形成可解释的癌症药敏性多组学预测指标。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae567

Sébastien De Landtsheer, Apurva Badkas, Dagmar Kulms, Thomas Sauter

Stratification of patients diagnosed with cancer has become a major goal in personalized oncology. One important aspect is the accurate prediction of the response to various drugs. It is expected that the molecular characteristics of the cancer cells contain enough information to retrieve specific signatures, allowing for accurate predictions based solely on these multi-omic data. Ideally, these predictions should be explainable to clinicians, in order to be integrated in the patients care. We propose a machine-learning framework based on ensemble learning to integrate multi-omic data and predict sensitivity to an array of commonly used and experimental compounds, including chemotoxic compounds and targeted kinase inhibitors. We trained a set of classifiers on the different parts of our dataset to produce omic-specific signatures, then trained a random forest classifier on these signatures to predict drug responsiveness. We used the Cancer Cell Line Encyclopedia dataset, comprising multi-omic and drug sensitivity measurements for hundreds of cell lines, to build the predictive models, and validated the results using nested cross-validation. Our results show good performance for several compounds (Area under the Receiver-Operating Curve >79%) across the most frequent cancer types. Furthermore, the simplicity of our approach allows to examine which omic layers have a greater importance in the models and identify new putative markers of drug responsiveness. We propose several models based on small subsets of transcriptional markers with the potential to become useful tools in personalized oncology, paving the way for clinicians to use the molecular characteristics of the tumors to predict sensitivity to therapeutic compounds.

对确诊为癌症的患者进行分层已成为个性化肿瘤学的一个主要目标。其中一个重要方面是准确预测对各种药物的反应。预计癌细胞的分子特征包含足够的信息来检索特定特征，从而可以仅根据这些多原子数据进行准确预测。理想情况下，这些预测结果应能向临床医生解释，以便纳入患者护理中。我们提出了一种基于集合学习的机器学习框架，以整合多组学数据并预测对一系列常用和实验化合物（包括化学毒性化合物和靶向激酶抑制剂）的敏感性。我们在数据集的不同部分训练了一组分类器，以生成omic特异性特征，然后在这些特征上训练了一个随机森林分类器，以预测药物反应性。我们使用《癌症细胞系百科全书》数据集来建立预测模型，该数据集包含数百种细胞系的多组学和药物敏感性测量结果，并使用嵌套交叉验证对结果进行了验证。我们的结果表明，在最常见的癌症类型中，有几种化合物具有良好的性能（接收曲线下面积大于 79%）。此外，我们的方法非常简单，因此可以检查模型中哪些指标层更重要，并确定药物反应性的新假定标记。我们提出了几个基于小型转录标记子集的模型，它们有可能成为个性化肿瘤学的有用工具，为临床医生利用肿瘤的分子特征预测对治疗化合物的敏感性铺平道路。

{"title":"Model ensembling as a tool to form interpretable multi-omic predictors of cancer pharmacosensitivity.","authors":"Sébastien De Landtsheer, Apurva Badkas, Dagmar Kulms, Thomas Sauter","doi":"10.1093/bib/bbae567","DOIUrl":"10.1093/bib/bbae567","url":null,"abstract":"Stratification of patients diagnosed with cancer has become a major goal in personalized oncology. One important aspect is the accurate prediction of the response to various drugs. It is expected that the molecular characteristics of the cancer cells contain enough information to retrieve specific signatures, allowing for accurate predictions based solely on these multi-omic data. Ideally, these predictions should be explainable to clinicians, in order to be integrated in the patients care. We propose a machine-learning framework based on ensemble learning to integrate multi-omic data and predict sensitivity to an array of commonly used and experimental compounds, including chemotoxic compounds and targeted kinase inhibitors. We trained a set of classifiers on the different parts of our dataset to produce omic-specific signatures, then trained a random forest classifier on these signatures to predict drug responsiveness. We used the Cancer Cell Line Encyclopedia dataset, comprising multi-omic and drug sensitivity measurements for hundreds of cell lines, to build the predictive models, and validated the results using nested cross-validation. Our results show good performance for several compounds (Area under the Receiver-Operating Curve >79%) across the most frequent cancer types. Furthermore, the simplicity of our approach allows to examine which omic layers have a greater importance in the models and identify new putative markers of drug responsiveness. We propose several models based on small subsets of transcriptional markers with the potential to become useful tools in personalized oncology, paving the way for clinicians to use the molecular characteristics of the tumors to predict sensitivity to therapeutic compounds.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11532660/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142567268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

siRNADiscovery: a graph neural network for siRNA efficacy prediction via deep RNA sequence analysis. siRNADiscovery：通过深度 RNA 序列分析预测 siRNA 药效的图神经网络。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae563

Rongzhuo Long, Ziyu Guo, Da Han, Boxiang Liu, Xudong Yuan, Guangyong Chen, Pheng-Ann Heng, Liang Zhang

The clinical adoption of small interfering RNAs (siRNAs) has prompted the development of various computational strategies for siRNA design, from traditional data analysis to advanced machine learning techniques. However, previous studies have inadequately considered the full complexity of the siRNA silencing mechanism, neglecting critical elements such as siRNA positioning on mRNA, RNA base-pairing probabilities, and RNA-AGO2 interactions, thereby limiting the insight and accuracy of existing models. Here, we introduce siRNADiscovery, a Graph Neural Network (GNN) framework that leverages both non-empirical and empirical rule-based features of siRNA and mRNA to effectively capture the complex dynamics of gene silencing. On multiple internal datasets, siRNADiscovery achieves state-of-the-art performance. Significantly, siRNADiscovery also outperforms existing methodologies in in vitro studies and on an externally validated dataset. Additionally, we develop a new data-splitting methodology that addresses the data leakage issue, a frequently overlooked problem in previous studies, ensuring the robustness and stability of our model under various experimental settings. Through rigorous testing, siRNADiscovery has demonstrated remarkable predictive accuracy and robustness, making significant contributions to the field of gene silencing. Furthermore, our approach to redefining data-splitting standards aims to set new benchmarks for future research in the domain of predictive biological modeling for siRNA.

小干扰 RNA（siRNA）的临床应用促使人们开发了各种 siRNA 设计计算策略，从传统的数据分析到先进的机器学习技术。然而，以往的研究没有充分考虑 siRNA 沉默机制的全部复杂性，忽略了 siRNA 在 mRNA 上的定位、RNA 碱基配对概率以及 RNA-AGO2 相互作用等关键因素，从而限制了现有模型的洞察力和准确性。在这里，我们介绍了 siRNADiscovery，这是一种图神经网络（GNN）框架，它利用 siRNA 和 mRNA 的非经验和经验规则特征，有效捕捉基因沉默的复杂动态。在多个内部数据集上，siRNADiscovery 实现了最先进的性能。值得注意的是，siRNADiscovery 在体外研究和外部验证数据集上的表现也优于现有方法。此外，我们还开发了一种新的数据分割方法，解决了以往研究中经常忽视的数据泄露问题，确保了我们的模型在各种实验环境下的鲁棒性和稳定性。通过严格的测试，siRNADiscovery 显示出了非凡的预测准确性和稳健性，为基因沉默领域做出了重大贡献。此外，我们重新定义数据分割标准的方法旨在为 siRNA 预测生物学建模领域的未来研究树立新的标杆。

{"title":"siRNADiscovery: a graph neural network for siRNA efficacy prediction via deep RNA sequence analysis.","authors":"Rongzhuo Long, Ziyu Guo, Da Han, Boxiang Liu, Xudong Yuan, Guangyong Chen, Pheng-Ann Heng, Liang Zhang","doi":"10.1093/bib/bbae563","DOIUrl":"10.1093/bib/bbae563","url":null,"abstract":"The clinical adoption of small interfering RNAs (siRNAs) has prompted the development of various computational strategies for siRNA design, from traditional data analysis to advanced machine learning techniques. However, previous studies have inadequately considered the full complexity of the siRNA silencing mechanism, neglecting critical elements such as siRNA positioning on mRNA, RNA base-pairing probabilities, and RNA-AGO2 interactions, thereby limiting the insight and accuracy of existing models. Here, we introduce siRNADiscovery, a Graph Neural Network (GNN) framework that leverages both non-empirical and empirical rule-based features of siRNA and mRNA to effectively capture the complex dynamics of gene silencing. On multiple internal datasets, siRNADiscovery achieves state-of-the-art performance. Significantly, siRNADiscovery also outperforms existing methodologies in in vitro studies and on an externally validated dataset. Additionally, we develop a new data-splitting methodology that addresses the data leakage issue, a frequently overlooked problem in previous studies, ensuring the robustness and stability of our model under various experimental settings. Through rigorous testing, siRNADiscovery has demonstrated remarkable predictive accuracy and robustness, making significant contributions to the field of gene silencing. Furthermore, our approach to redefining data-splitting standards aims to set new benchmarks for future research in the domain of predictive biological modeling for siRNA.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539000/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142582071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TIPS: a novel pathway-guided joint model for transcriptome-wide association studies. TIPS：用于全转录组关联研究的新型通路引导联合模型。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae587

Neng Wang, Zhenyao Ye, Tianzhou Ma

In the past two decades, genome-wide association studies (GWAS) have pinpointed numerous SNPs linked to human diseases and traits, yet many of these SNPs are in non-coding regions and hard to interpret. Transcriptome-wide association studies (TWAS) integrate GWAS and expression reference panels to identify the associations at gene level with tissue specificity, potentially improving the interpretability. However, the list of individual genes identified from univariate TWAS contains little unifying biological theme, leaving the underlying mechanisms largely elusive. In this paper, we propose a novel multivariate TWAS method that Incorporates Pathway or gene Set information, namely TIPS, to identify genes and pathways most associated with complex polygenic traits. We jointly modeled the imputation and association steps in TWAS, incorporated a sparse group lasso penalty in the model to induce selection at both gene and pathway levels and developed an expectation-maximization algorithm to estimate the parameters for the penalized likelihood. We applied our method to three different complex traits: systolic and diastolic blood pressure, as well as a brain aging biomarker white matter brain age gap in UK Biobank and identified critical biologically relevant pathways and genes associated with these traits. These pathways cannot be detected by traditional univariate TWAS + pathway enrichment analysis approach, showing the power of our model. We also conducted comprehensive simulations with varying heritability levels and genetic architectures and showed our method outperformed other established TWAS methods in feature selection, statistical power, and prediction. The R package that implements TIPS is available at https://github.com/nwang123/TIPS.

在过去的二十年里，全基因组关联研究（GWAS）确定了许多与人类疾病和性状相关的 SNPs，但其中许多 SNPs 位于非编码区，难以解释。全转录组关联研究（TWAS）整合了全基因组关联研究和表达参考面板，以确定基因水平上的关联性和组织特异性，从而提高了可解释性。然而，通过单变量 TWAS 确定的单个基因列表几乎不包含统一的生物学主题，使得潜在机制在很大程度上难以捉摸。在本文中，我们提出了一种结合通路或基因组信息的新型多元 TWAS 方法（即 TIPS），以确定与复杂多基因性状最相关的基因和通路。我们对 TWAS 中的估算和关联步骤进行了联合建模，在模型中加入了稀疏组套索惩罚，以诱导基因和通路水平上的选择，并开发了一种期望最大化算法来估计惩罚似然的参数。我们将我们的方法应用于三种不同的复杂性状：收缩压和舒张压，以及英国生物库中的脑老化生物标志物白质脑年龄差距，并确定了与这些性状相关的关键生物相关通路和基因。传统的单变量 TWAS + 通路富集分析方法无法检测到这些通路，这显示了我们模型的强大功能。我们还对不同的遗传率水平和遗传结构进行了综合模拟，结果表明我们的方法在特征选择、统计能力和预测方面都优于其他成熟的 TWAS 方法。实现 TIPS 的 R 软件包可从 https://github.com/nwang123/TIPS 获取。

{"title":"TIPS: a novel pathway-guided joint model for transcriptome-wide association studies.","authors":"Neng Wang, Zhenyao Ye, Tianzhou Ma","doi":"10.1093/bib/bbae587","DOIUrl":"10.1093/bib/bbae587","url":null,"abstract":"In the past two decades, genome-wide association studies (GWAS) have pinpointed numerous SNPs linked to human diseases and traits, yet many of these SNPs are in non-coding regions and hard to interpret. Transcriptome-wide association studies (TWAS) integrate GWAS and expression reference panels to identify the associations at gene level with tissue specificity, potentially improving the interpretability. However, the list of individual genes identified from univariate TWAS contains little unifying biological theme, leaving the underlying mechanisms largely elusive. In this paper, we propose a novel multivariate TWAS method that Incorporates Pathway or gene Set information, namely TIPS, to identify genes and pathways most associated with complex polygenic traits. We jointly modeled the imputation and association steps in TWAS, incorporated a sparse group lasso penalty in the model to induce selection at both gene and pathway levels and developed an expectation-maximization algorithm to estimate the parameters for the penalized likelihood. We applied our method to three different complex traits: systolic and diastolic blood pressure, as well as a brain aging biomarker white matter brain age gap in UK Biobank and identified critical biologically relevant pathways and genes associated with these traits. These pathways cannot be detected by traditional univariate TWAS + pathway enrichment analysis approach, showing the power of our model. We also conducted comprehensive simulations with varying heritability levels and genetic architectures and showed our method outperformed other established TWAS methods in feature selection, statistical power, and prediction. The R package that implements TIPS is available at https://github.com/nwang123/TIPS.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568880/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142643860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AESurv: autoencoder survival analysis for accurate early prediction of coronary heart disease. AESurv：用于准确早期预测冠心病的自动编码器生存分析。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae479

Yike Shen, Arce Domingo-Relloso, Allison Kupsco, Marianthi-Anna Kioumourtzoglou, Maria Tellez-Plaza, Jason G Umans, Amanda M Fretts, Ying Zhang, Peter F Schnatz, Ramon Casanova, Lisa Warsinger Martin, Steve Horvath, JoAnn E Manson, Shelley A Cole, Haotian Wu, Eric A Whitsel, Andrea A Baccarelli, Ana Navas-Acien, Feng Gao

Coronary heart disease (CHD) is one of the leading causes of mortality and morbidity in the United States. Accurate time-to-event CHD prediction models with high-dimensional DNA methylation and clinical features may assist with early prediction and intervention strategies. We developed a state-of-the-art deep learning autoencoder survival analysis model (AESurv) to effectively analyze high-dimensional blood DNA methylation features and traditional clinical risk factors by learning low-dimensional representation of participants for time-to-event CHD prediction. We demonstrated the utility of our model in two cohort studies: the Strong Heart Study cohort (SHS), a prospective cohort studying cardiovascular disease and its risk factors among American Indians adults; the Women's Health Initiative (WHI), a prospective cohort study including randomized clinical trials and observational study to improve postmenopausal women's health with one of the main focuses on cardiovascular disease. Our AESurv model effectively learned participant representations in low-dimensional latent space and achieved better model performance (concordance index-C index of 0.864 ± 0.009 and time-to-event mean area under the receiver operating characteristic curve-AUROC of 0.905 ± 0.009) than other survival analysis models (Cox proportional hazard, Cox proportional hazard deep neural network survival analysis, random survival forest, and gradient boosting survival analysis models) in the SHS. We further validated the AESurv model in WHI and also achieved the best model performance. The AESurv model can be used for accurate CHD prediction and assist health care professionals and patients to perform early intervention strategies. We suggest using AESurv model for future time-to-event CHD prediction based on DNA methylation features.

冠心病（CHD）是美国人死亡和发病的主要原因之一。具有高维DNA甲基化和临床特征的精确的冠心病时间到事件预测模型可能有助于早期预测和干预策略。我们开发了一种最先进的深度学习自动编码器生存分析模型（AESurv），通过学习参与者的低维表征，有效分析高维血液DNA甲基化特征和传统临床风险因素，从而进行时间到事件的冠心病预测。我们在两项队列研究中证明了我们的模型的实用性：强心研究队列（SHS）是一项前瞻性队列研究，研究对象是美国印第安人中的心血管疾病及其风险因素；妇女健康倡议（WHI）是一项前瞻性队列研究，包括随机临床试验和观察研究，旨在改善绝经后妇女的健康状况，重点之一是心血管疾病。与其他生存分析模型（Cox比例危险、Cox比例危险深度神经网络生存分析、随机生存森林和梯度提升生存分析模型）相比，我们的AESurv模型在SHS中有效地学习了低维潜在空间中的参与者表征，并取得了更好的模型性能（一致性指数-C指数为0.864 ± 0.009，时间-事件平均接收者操作特征曲线下面积-AUROC为0.905 ± 0.009）。我们在 WHI 中进一步验证了 AESurv 模型，也取得了最佳模型性能。AESurv 模型可用于准确预测心脏病，并帮助医护人员和患者实施早期干预策略。我们建议将来使用 AESurv 模型进行基于 DNA 甲基化特征的从时间到事件的心脏病预测。

{"title":"AESurv: autoencoder survival analysis for accurate early prediction of coronary heart disease.","authors":"Yike Shen, Arce Domingo-Relloso, Allison Kupsco, Marianthi-Anna Kioumourtzoglou, Maria Tellez-Plaza, Jason G Umans, Amanda M Fretts, Ying Zhang, Peter F Schnatz, Ramon Casanova, Lisa Warsinger Martin, Steve Horvath, JoAnn E Manson, Shelley A Cole, Haotian Wu, Eric A Whitsel, Andrea A Baccarelli, Ana Navas-Acien, Feng Gao","doi":"10.1093/bib/bbae479","DOIUrl":"https://doi.org/10.1093/bib/bbae479","url":null,"abstract":"Coronary heart disease (CHD) is one of the leading causes of mortality and morbidity in the United States. Accurate time-to-event CHD prediction models with high-dimensional DNA methylation and clinical features may assist with early prediction and intervention strategies. We developed a state-of-the-art deep learning autoencoder survival analysis model (AESurv) to effectively analyze high-dimensional blood DNA methylation features and traditional clinical risk factors by learning low-dimensional representation of participants for time-to-event CHD prediction. We demonstrated the utility of our model in two cohort studies: the Strong Heart Study cohort (SHS), a prospective cohort studying cardiovascular disease and its risk factors among American Indians adults; the Women's Health Initiative (WHI), a prospective cohort study including randomized clinical trials and observational study to improve postmenopausal women's health with one of the main focuses on cardiovascular disease. Our AESurv model effectively learned participant representations in low-dimensional latent space and achieved better model performance (concordance index-C index of 0.864 ± 0.009 and time-to-event mean area under the receiver operating characteristic curve-AUROC of 0.905 ± 0.009) than other survival analysis models (Cox proportional hazard, Cox proportional hazard deep neural network survival analysis, random survival forest, and gradient boosting survival analysis models) in the SHS. We further validated the AESurv model in WHI and also achieved the best model performance. The AESurv model can be used for accurate CHD prediction and assist health care professionals and patients to perform early intervention strategies. We suggest using AESurv model for future time-to-event CHD prediction based on DNA methylation features.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11424508/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0