Briefings in bioinformatics最新文献_第7页

Multi-view learning framework for predicting unknown types of cancer markers via directed graph neural networks fitting regulatory networks. 通过有向图神经网络拟合调控网络预测未知类型癌症标记物的多视角学习框架。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae546

Xin-Fei Wang, Lan Huang, Yan Wang, Ren-Chu Guan, Zhu-Hong You, Nan Sheng, Xu-Ping Xie, Wen-Ju Hou

The discovery of diagnostic and therapeutic biomarkers for complex diseases, especially cancer, has always been a central and long-term challenge in molecular association prediction research, offering promising avenues for advancing the understanding of complex diseases. To this end, researchers have developed various network-based prediction techniques targeting specific molecular associations. However, limitations imposed by reductionism and network representation learning have led existing studies to narrowly focus on high prediction efficiency within single association type, thereby glossing over the discovery of unknown types of associations. Additionally, effectively utilizing network structure to fit the interaction properties of regulatory networks and combining specific case biomarker validations remains an unresolved issue in cancer biomarker prediction methods. To overcome these limitations, we propose a multi-view learning framework, CeRVE, based on directed graph neural networks (DGNN) for predicting unknown type cancer biomarkers. CeRVE effectively extracts and integrates subgraph information through multi-view feature learning. Subsequently, CeRVE utilizes DGNN to simulate the entire regulatory network, propagating node attribute features and extracting various interaction relationships between molecules. Furthermore, CeRVE constructed a comparative analysis matrix of three cancers and adjacent normal tissues through The Cancer Genome Atlas and identified multiple types of potential cancer biomarkers through differential expression analysis of mRNA, microRNA, and long noncoding RNA. Computational testing of multiple types of biomarkers for 72 cancers demonstrates that CeRVE exhibits superior performance in cancer biomarker prediction, providing a powerful tool and insightful approach for AI-assisted disease biomarker discovery.

发现复杂疾病（尤其是癌症）的诊断和治疗生物标志物一直是分子关联预测研究的核心和长期挑战，这为促进对复杂疾病的了解提供了大有可为的途径。为此，研究人员针对特定的分子关联开发了各种基于网络的预测技术。然而，还原论和网络表征学习的局限性导致现有研究狭隘地关注单一关联类型的高预测效率，从而忽略了未知关联类型的发现。此外，有效利用网络结构来适应调控网络的相互作用特性，并结合具体案例进行生物标志物验证，仍然是癌症生物标志物预测方法中一个尚未解决的问题。为了克服这些局限性，我们提出了一种基于有向图神经网络（DGNN）的多视角学习框架 CeRVE，用于预测未知类型的癌症生物标记物。CeRVE 通过多视图特征学习有效地提取和整合了子图信息。随后，CeRVE 利用有向图神经网络模拟整个调控网络，传播节点属性特征并提取分子间的各种相互作用关系。此外，CeRVE 还通过癌症基因组图谱构建了三种癌症和相邻正常组织的对比分析矩阵，并通过 mRNA、microRNA 和长非编码 RNA 的差异表达分析，确定了多种类型的潜在癌症生物标记物。对72种癌症的多种类型生物标志物的计算测试表明，CeRVE在癌症生物标志物预测方面表现出卓越的性能，为人工智能辅助疾病生物标志物的发现提供了一个强大的工具和具有洞察力的方法。

{"title":"Multi-view learning framework for predicting unknown types of cancer markers via directed graph neural networks fitting regulatory networks.","authors":"Xin-Fei Wang, Lan Huang, Yan Wang, Ren-Chu Guan, Zhu-Hong You, Nan Sheng, Xu-Ping Xie, Wen-Ju Hou","doi":"10.1093/bib/bbae546","DOIUrl":"10.1093/bib/bbae546","url":null,"abstract":"The discovery of diagnostic and therapeutic biomarkers for complex diseases, especially cancer, has always been a central and long-term challenge in molecular association prediction research, offering promising avenues for advancing the understanding of complex diseases. To this end, researchers have developed various network-based prediction techniques targeting specific molecular associations. However, limitations imposed by reductionism and network representation learning have led existing studies to narrowly focus on high prediction efficiency within single association type, thereby glossing over the discovery of unknown types of associations. Additionally, effectively utilizing network structure to fit the interaction properties of regulatory networks and combining specific case biomarker validations remains an unresolved issue in cancer biomarker prediction methods. To overcome these limitations, we propose a multi-view learning framework, CeRVE, based on directed graph neural networks (DGNN) for predicting unknown type cancer biomarkers. CeRVE effectively extracts and integrates subgraph information through multi-view feature learning. Subsequently, CeRVE utilizes DGNN to simulate the entire regulatory network, propagating node attribute features and extracting various interaction relationships between molecules. Furthermore, CeRVE constructed a comparative analysis matrix of three cancers and adjacent normal tissues through The Cancer Genome Atlas and identified multiple types of potential cancer biomarkers through differential expression analysis of mRNA, microRNA, and long noncoding RNA. Computational testing of multiple types of biomarkers for 72 cancers demonstrates that CeRVE exhibits superior performance in cancer biomarker prediction, providing a powerful tool and insightful approach for AI-assisted disease biomarker discovery.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11514060/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A consensus-based classification workflow to determine genetically inferred ancestry from comprehensive genomic profiling of patients with solid tumors. 基于共识的分类工作流程，从实体瘤患者的综合基因组图谱中确定基因推断祖先。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae557

Zachary D Wallen, Mary K Nesline, Sarabjot Pabla, Shuang Gao, Erik Vanroey, Stephanie B Hastings, Heidi Ko, Kyle C Strickland, Rebecca A Previs, Shengle Zhang, Jeffrey M Conroy, Taylor J Jensen, Elizabeth George, Marcia Eisenberg, Brian Caveney, Pratheesh Sathyan, Shakti Ramkissoon, Eric A Severson

Disparities in cancer diagnosis, treatment, and outcomes based on self-identified race and ethnicity (SIRE) are well documented, yet these variables have historically been excluded from clinical research. Without SIRE, genetic ancestry can be inferred using single-nucleotide polymorphisms (SNPs) detected from tumor DNA using comprehensive genomic profiling (CGP). However, factors inherent to CGP of tumor DNA increase the difficulty of identifying ancestry-informative SNPs, and current workflows for inferring genetic ancestry from CGP need improvements in key areas of the ancestry inference process. This study used genomic data from 4274 diverse reference subjects and CGP data from 491 patients with solid tumors and SIRE to develop and validate a workflow to obtain accurate genetically inferred ancestry (GIA) from CGP sequencing results. We use consensus-based classification to derive confident ancestral inferences from an expanded reference dataset covering eight world populations (African, Admixed American, Central Asian/Siberian, European, East Asian, Middle Eastern, Oceania, South Asian). Our GIA calls were highly concordant with SIRE (95%) and aligned well with reference populations of inferred ancestries. Further, our workflow could expand on SIRE by (i) detecting the ancestry of patients that usually lack appropriate racial categories, (ii) determining what patients have mixed ancestry, and (iii) resolving ancestries of patients in heterogeneous racial categories and who had missing SIRE. Accurate GIA provides needed information to enable ancestry-aware biomarker research, ensure the inclusion of underrepresented groups in clinical research, and increase the diverse representation of patient populations eligible for precision medicine therapies and trials.

基于自我认同的种族和民族（SIRE）在癌症诊断、治疗和预后方面的差异已被充分记录在案，但这些变量历来被排除在临床研究之外。在没有 SIRE 的情况下，可以利用综合基因组分析（CGP）从肿瘤 DNA 中检测到的单核苷酸多态性（SNPs）来推断遗传血统。然而，肿瘤 DNA CGP 的固有因素增加了鉴定具有祖先信息的 SNP 的难度，目前从 CGP 推断遗传祖先的工作流程需要在祖先推断过程的关键领域进行改进。本研究使用了来自 4274 名不同参考对象的基因组数据和来自 491 名实体瘤和 SIRE 患者的 CGP 数据，开发并验证了从 CGP 测序结果中获得准确遗传祖先推断（GIA）的工作流程。我们采用基于共识的分类方法，从涵盖世界八大人群（非洲人、美洲混血人、中亚/西伯利亚人、欧洲人、东亚人、中东人、大洋洲人、南亚人）的扩展参考数据集中得出可靠的祖先推断。我们的 GIA 调用与 SIRE 高度一致（95%），并与推断祖先的参考人群非常吻合。此外，我们的工作流程还可以通过以下方式扩展 SIRE：(i) 检测通常缺乏适当种族类别的患者的祖先；(ii) 确定哪些患者具有混合祖先；(iii) 解决异质种族类别和 SIRE 缺失的患者的祖先问题。准确的 GIA 可提供所需的信息，以开展具有祖先意识的生物标记物研究，确保将代表性不足的群体纳入临床研究，并提高有资格接受精准医学疗法和试验的患者群体的多样性代表性。

{"title":"A consensus-based classification workflow to determine genetically inferred ancestry from comprehensive genomic profiling of patients with solid tumors.","authors":"Zachary D Wallen, Mary K Nesline, Sarabjot Pabla, Shuang Gao, Erik Vanroey, Stephanie B Hastings, Heidi Ko, Kyle C Strickland, Rebecca A Previs, Shengle Zhang, Jeffrey M Conroy, Taylor J Jensen, Elizabeth George, Marcia Eisenberg, Brian Caveney, Pratheesh Sathyan, Shakti Ramkissoon, Eric A Severson","doi":"10.1093/bib/bbae557","DOIUrl":"10.1093/bib/bbae557","url":null,"abstract":"Disparities in cancer diagnosis, treatment, and outcomes based on self-identified race and ethnicity (SIRE) are well documented, yet these variables have historically been excluded from clinical research. Without SIRE, genetic ancestry can be inferred using single-nucleotide polymorphisms (SNPs) detected from tumor DNA using comprehensive genomic profiling (CGP). However, factors inherent to CGP of tumor DNA increase the difficulty of identifying ancestry-informative SNPs, and current workflows for inferring genetic ancestry from CGP need improvements in key areas of the ancestry inference process. This study used genomic data from 4274 diverse reference subjects and CGP data from 491 patients with solid tumors and SIRE to develop and validate a workflow to obtain accurate genetically inferred ancestry (GIA) from CGP sequencing results. We use consensus-based classification to derive confident ancestral inferences from an expanded reference dataset covering eight world populations (African, Admixed American, Central Asian/Siberian, European, East Asian, Middle Eastern, Oceania, South Asian). Our GIA calls were highly concordant with SIRE (95%) and aligned well with reference populations of inferred ancestries. Further, our workflow could expand on SIRE by (i) detecting the ancestry of patients that usually lack appropriate racial categories, (ii) determining what patients have mixed ancestry, and (iii) resolving ancestries of patients in heterogeneous racial categories and who had missing SIRE. Accurate GIA provides needed information to enable ancestry-aware biomarker research, ensure the inclusion of underrepresented groups in clinical research, and increase the diverse representation of patient populations eligible for precision medicine therapies and trials.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11521331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142543787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis. 图形对比学习是高级 scRNA-seq 数据分析的多功能基础。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae558

Zhenhao Zhang, Yuxi Liu, Meichen Xiao, Kun Wang, Yu Huang, Jiang Bian, Ruolin Yang, Fuyi Li

Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a robust framework for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.

单细胞 RNA 测序（scRNA-seq）可在单细胞水平上深入了解整个转录组的基因表达。细胞聚类在 scRNA-seq 数据分析中早已确立，用于识别具有相似表达谱的细胞群。然而，细胞聚类在技术上具有挑战性，因为原始 scRNA-seq 数据存在各种分析问题，包括高维度和丢弃值。现有研究已经开发出了利用 scRNA-seq 数据进行细胞聚类的深度学习模型，如图机器学习模型和基于对比学习的模型，并将细胞聚类的无监督学习总结为人类可理解的格式。虽然在细胞聚类方面取得了长足的进步，但我们还没有找到一个简单而有效的框架来学习稳健聚类所需的高质量表征。在本研究中，我们提出了 scSimGCL，这是一个基于图对比学习范式的新型框架，用于图神经网络的自我监督预训练。该框架有助于生成对细胞聚类至关重要的高质量表征。我们的 scSimGCL 结合了细胞-细胞图结构和对比学习，以提高细胞聚类的性能。在模拟和真实 scRNA-seq 数据集上的大量实验结果表明了所提出的 scSimGCL 的优越性。此外，聚类赋值分析证实了 scSimGCL 的普遍适用性，包括最先进的聚类算法。此外，消融研究和超参数分析表明，我们的网络架构在自我监督学习设置中具有决策稳健性的功效。对于开发细胞聚类工具的从业人员来说，所提出的 scSimGCL 可以作为一个稳健的框架。scSimGCL 的源代码可在 https://github.com/zhangzh1328/scSimGCL 公开获取。

{"title":"Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis.","authors":"Zhenhao Zhang, Yuxi Liu, Meichen Xiao, Kun Wang, Yu Huang, Jiang Bian, Ruolin Yang, Fuyi Li","doi":"10.1093/bib/bbae558","DOIUrl":"10.1093/bib/bbae558","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a robust framework for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11530284/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BFAST: joint dimension reduction and spatial clustering with Bayesian factor analysis for zero-inflated spatial transcriptomics data. BFAST：利用贝叶斯因子分析对零膨胀空间转录组学数据进行联合降维和空间聚类。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae594

Yang Xu, Dian Lv, Xuanxuan Zou, Liang Wu, Xun Xu, Xin Zhao

The development of spatially resolved transcriptomics (ST) technologies has made it possible to measure gene expression profiles coupled with cellular spatial context and assist biologists in comprehensively characterizing cellular phenotype heterogeneity and tissue microenvironment. Spatial clustering is vital for biological downstream analysis. However, due to high noise and dropout events, clustering spatial transcriptomics data poses numerous challenges due to the lack of effective algorithms. Here we develop a novel method, jointly performing dimension reduction and spatial clustering with Bayesian Factor Analysis for zero-inflated Spatial Transcriptomics data (BFAST). BFAST has showcased exceptional performance on simulation data and real spatial transcriptomics datasets, as proven by benchmarking against currently available methods. It effectively extracts more biologically informative low-dimensional features compared to traditional dimensionality reduction approaches, thereby enhancing the accuracy and precision of clustering.

空间分辨转录组学（ST）技术的发展使测量基因表达谱与细胞空间背景相结合成为可能，并帮助生物学家全面描述细胞表型异质性和组织微环境。空间聚类对于生物下游分析至关重要。然而，由于高噪声和丢弃事件，空间转录组学数据的聚类因缺乏有效算法而面临诸多挑战。在此，我们开发了一种新方法，利用贝叶斯因子分析对零膨胀空间转录组学数据（BFAST）联合进行降维和空间聚类。BFAST 在模拟数据和真实空间转录组学数据集上表现出了卓越的性能，这一点已通过与现有方法的基准测试得到了证明。与传统的降维方法相比，它能有效提取更多具有生物信息的低维特征，从而提高聚类的准确性和精确度。

{"title":"BFAST: joint dimension reduction and spatial clustering with Bayesian factor analysis for zero-inflated spatial transcriptomics data.","authors":"Yang Xu, Dian Lv, Xuanxuan Zou, Liang Wu, Xun Xu, Xin Zhao","doi":"10.1093/bib/bbae594","DOIUrl":"10.1093/bib/bbae594","url":null,"abstract":"The development of spatially resolved transcriptomics (ST) technologies has made it possible to measure gene expression profiles coupled with cellular spatial context and assist biologists in comprehensively characterizing cellular phenotype heterogeneity and tissue microenvironment. Spatial clustering is vital for biological downstream analysis. However, due to high noise and dropout events, clustering spatial transcriptomics data poses numerous challenges due to the lack of effective algorithms. Here we develop a novel method, jointly performing dimension reduction and spatial clustering with Bayesian Factor Analysis for zero-inflated Spatial Transcriptomics data (BFAST). BFAST has showcased exceptional performance on simulation data and real spatial transcriptomics datasets, as proven by benchmarking against currently available methods. It effectively extracts more biologically informative low-dimensional features compared to traditional dimensionality reduction approaches, thereby enhancing the accuracy and precision of clustering.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11570543/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scMGATGRN: a multiview graph attention network-based method for inferring gene regulatory networks from single-cell transcriptomic data. scMGATGRN：一种基于多视图图注意网络的方法，用于从单细胞转录组数据中推断基因调控网络。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae526

Lin Yuan, Ling Zhao, Yufeng Jiang, Zhen Shen, Qinhu Zhang, Ming Zhang, Chun-Hou Zheng, De-Shuang Huang

The gene regulatory network (GRN) plays a vital role in understanding the structure and dynamics of cellular systems, revealing complex regulatory relationships, and exploring disease mechanisms. Recently, deep learning (DL)-based methods have been proposed to infer GRNs from single-cell transcriptomic data and achieved impressive performance. However, these methods do not fully utilize graph topological information and high-order neighbor information from multiple receptive fields. To overcome those limitations, we propose a novel model based on multiview graph attention network, namely, scMGATGRN, to infer GRNs. scMGATGRN mainly consists of GAT, multiview, and view-level attention mechanism. GAT can extract essential features of the gene regulatory network. The multiview model can simultaneously utilize local feature information and high-order neighbor feature information of nodes in the gene regulatory network. The view-level attention mechanism dynamically adjusts the relative importance of node embedding representations and efficiently aggregates node embedding representations from two views. To verify the effectiveness of scMGATGRN, we compared its performance with 10 methods (five shallow learning algorithms and five state-of-the-art DL-based methods) on seven benchmark single-cell RNA sequencing (scRNA-seq) datasets from five cell lines (two in human and three in mouse) with four different kinds of ground-truth networks. The experimental results not only show that scMGATGRN outperforms competing methods but also demonstrate the potential of this model in inferring GRNs. The code and data of scMGATGRN are made freely available on GitHub (https://github.com/nathanyl/scMGATGRN).

基因调控网络（GRN）在理解细胞系统的结构和动态、揭示复杂的调控关系以及探索疾病机理方面发挥着至关重要的作用。最近，有人提出了基于深度学习（DL）的方法来从单细胞转录组数据中推断基因调控网络，并取得了令人瞩目的成绩。然而，这些方法并没有充分利用图拓扑信息和来自多个感受野的高阶邻居信息。为了克服这些局限，我们提出了一种基于多视图图注意网络的新型模型，即 scMGATGRN，来推断 GRN。GAT 可以提取基因调控网络的基本特征。多视图模型可以同时利用基因调控网络中节点的局部特征信息和高阶相邻特征信息。视图级注意力机制可动态调整节点嵌入表征的相对重要性，并有效聚合来自两个视图的节点嵌入表征。为了验证 scMGATGRN 的有效性，我们在 7 个基准单细胞 RNA 测序（scRNA-seq）数据集上比较了 scMGATGRN 和 10 种方法（5 种浅层学习算法和 5 种基于 DL 的先进方法）的性能，这些数据集来自 5 个细胞系（2 个人类细胞系和 3 个小鼠细胞系）和 4 种不同的地面实况网络。实验结果不仅表明 scMGATGRN 优于其他竞争方法，还证明了该模型在推断 GRN 方面的潜力。scMGATGRN 的代码和数据可在 GitHub（https://github.com/nathanyl/scMGATGRN）上免费获取。

{"title":"scMGATGRN: a multiview graph attention network-based method for inferring gene regulatory networks from single-cell transcriptomic data.","authors":"Lin Yuan, Ling Zhao, Yufeng Jiang, Zhen Shen, Qinhu Zhang, Ming Zhang, Chun-Hou Zheng, De-Shuang Huang","doi":"10.1093/bib/bbae526","DOIUrl":"https://doi.org/10.1093/bib/bbae526","url":null,"abstract":"The gene regulatory network (GRN) plays a vital role in understanding the structure and dynamics of cellular systems, revealing complex regulatory relationships, and exploring disease mechanisms. Recently, deep learning (DL)-based methods have been proposed to infer GRNs from single-cell transcriptomic data and achieved impressive performance. However, these methods do not fully utilize graph topological information and high-order neighbor information from multiple receptive fields. To overcome those limitations, we propose a novel model based on multiview graph attention network, namely, scMGATGRN, to infer GRNs. scMGATGRN mainly consists of GAT, multiview, and view-level attention mechanism. GAT can extract essential features of the gene regulatory network. The multiview model can simultaneously utilize local feature information and high-order neighbor feature information of nodes in the gene regulatory network. The view-level attention mechanism dynamically adjusts the relative importance of node embedding representations and efficiently aggregates node embedding representations from two views. To verify the effectiveness of scMGATGRN, we compared its performance with 10 methods (five shallow learning algorithms and five state-of-the-art DL-based methods) on seven benchmark single-cell RNA sequencing (scRNA-seq) datasets from five cell lines (two in human and three in mouse) with four different kinds of ground-truth networks. The experimental results not only show that scMGATGRN outperforms competing methods but also demonstrate the potential of this model in inferring GRNs. The code and data of scMGATGRN are made freely available on GitHub (https://github.com/nathanyl/scMGATGRN).","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484520/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Statistical analysis of multiple regions-of-interest in multiplexed spatial proteomics data. 对多重空间蛋白质组学数据中的多个兴趣区进行统计分析。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae522

Sarah Samorodnitsky, Michael C Wu

Multiplexed spatial proteomics reveals the spatial organization of cells in tumors, which is associated with important clinical outcomes such as survival and treatment response. This spatial organization is often summarized using spatial summary statistics, including Ripley's K and Besag's L. However, if multiple regions of the same tumor are imaged, it is unclear how to synthesize the relationship with a single patient-level endpoint. We evaluate extant approaches for accommodating multiple images within the context of associating summary statistics with outcomes. First, we consider averaging-based approaches wherein multiple summaries for a single sample are combined in a weighted mean. We then propose a novel class of ensemble testing approaches in which we simulate random weights used to aggregate summaries, test for an association with outcomes, and combine the $P$-values. We systematically evaluate the performance of these approaches via simulation and application to data from non-small cell lung cancer, colorectal cancer, and triple negative breast cancer. We find that the optimal strategy varies, but a simple weighted average of the summary statistics based on the number of cells in each image often offers the highest power and controls type I error effectively. When the size of the imaged regions varies, incorporating this variation into the weighted aggregation may yield additional power in cases where the varying size is informative. Ensemble testing (but not resampling) offered high power and type I error control across conditions in our simulated data sets.

多重空间蛋白质组学揭示了肿瘤细胞的空间组织，这与生存和治疗反应等重要临床结果相关。然而，如果对同一肿瘤的多个区域进行成像，如何将其与单一患者水平终点的关系综合起来还不清楚。我们对现有的方法进行了评估，以便在将汇总统计数据与结果相关联的情况下，将多幅图像纳入其中。首先，我们考虑了基于平均值的方法，即将单个样本的多个摘要合并为一个加权平均值。然后，我们提出了一类新颖的集合测试方法，即模拟用于汇总摘要的随机权重，测试与结果的关联，并合并 $P$ 值。我们通过模拟和应用非小细胞肺癌、结直肠癌和三阴性乳腺癌的数据，系统地评估了这些方法的性能。我们发现，最佳策略各不相同，但基于每幅图像中细胞数量的简单加权平均汇总统计通常能提供最高的功率，并有效控制 I 型误差。当成像区域的大小发生变化时，将这种变化纳入加权汇总可能会在不同大小具有信息量的情况下产生额外的功率。在我们的模拟数据集中，集合测试（但不是重采样）在各种条件下都能提供较高的功率和 I 型误差控制。

{"title":"Statistical analysis of multiple regions-of-interest in multiplexed spatial proteomics data.","authors":"Sarah Samorodnitsky, Michael C Wu","doi":"10.1093/bib/bbae522","DOIUrl":"10.1093/bib/bbae522","url":null,"abstract":"Multiplexed spatial proteomics reveals the spatial organization of cells in tumors, which is associated with important clinical outcomes such as survival and treatment response. This spatial organization is often summarized using spatial summary statistics, including Ripley's K and Besag's L. However, if multiple regions of the same tumor are imaged, it is unclear how to synthesize the relationship with a single patient-level endpoint. We evaluate extant approaches for accommodating multiple images within the context of associating summary statistics with outcomes. First, we consider averaging-based approaches wherein multiple summaries for a single sample are combined in a weighted mean. We then propose a novel class of ensemble testing approaches in which we simulate random weights used to aggregate summaries, test for an association with outcomes, and combine the $P$-values. We systematically evaluate the performance of these approaches via simulation and application to data from non-small cell lung cancer, colorectal cancer, and triple negative breast cancer. We find that the optimal strategy varies, but a simple weighted average of the summary statistics based on the number of cells in each image often offers the highest power and controls type I error effectively. When the size of the imaged regions varies, incorporating this variation into the weighted aggregation may yield additional power in cases where the varying size is informative. Ensemble testing (but not resampling) offered high power and type I error control across conditions in our simulated data sets.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11491162/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep AutoEncoders. 利用深度自动编码器进行异常检测，预测 SARS-CoV-2 株系的优势。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae535

Simone Rancati, Giovanna Nicora, Mattia Prosperi, Riccardo Bellazzi, Marco Salemi, Simone Marini

The COVID-19 pandemic is marked by the successive emergence of new SARS-CoV-2 variants, lineages, and sublineages that outcompete earlier strains, largely due to factors like increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system, to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute >10% of all the viral sequences added to the GISAID, a public database supporting viral genetic sequence sharing, in a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of ~4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01%-3%), with median lead times of 4-17 weeks, and predicts FDLs between ~5 and ~25 times better than a baseline approach. For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness and may provide significant insights for the optimization of public health 'pre-emptive' intervention strategies.

在 COVID-19 大流行中，新的 SARS-CoV-2 变异株、系和亚系相继出现，它们在很大程度上由于传播性增强和免疫逃逸等因素而取代了早期的毒株。我们提出了一种无监督深度学习异常检测系统 DeepAutoCoV，用于预测未来的优势毒株（FDLs）。我们将 FDLs 定义为病毒（亚）品系，它们将在特定一周内占到 GISAID（支持病毒基因序列共享的公共数据库）中所有病毒序列的 >10%。DeepAutoCoV 是通过对全球和特定国家的数据集进行训练和验证的，这些数据集来自约 4 年时间里采样的 1600 多万个穗状病毒蛋白质序列。DeepAutoCoV 以极低的频率（0.01%-3%）成功标记了 FDL，中位前置时间为 4-17 周，预测 FDL 的效果比基线方法好 5-25 倍。例如，当 B.1.617.2 疫苗参考毒株的频率仅为 0.01% 时，它就被标记为 FDL，这比 COVID-19 更新疫苗考虑该毒株的频率早了一年多。此外，DeepAutoCoV 还能精确定位可能与适应性增强有关的特定突变，从而输出可解释的结果，并为优化公共卫生 "先发制人 "干预策略提供重要启示。

{"title":"Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep AutoEncoders.","authors":"Simone Rancati, Giovanna Nicora, Mattia Prosperi, Riccardo Bellazzi, Marco Salemi, Simone Marini","doi":"10.1093/bib/bbae535","DOIUrl":"10.1093/bib/bbae535","url":null,"abstract":"The COVID-19 pandemic is marked by the successive emergence of new SARS-CoV-2 variants, lineages, and sublineages that outcompete earlier strains, largely due to factors like increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system, to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute >10% of all the viral sequences added to the GISAID, a public database supporting viral genetic sequence sharing, in a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of ~4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01%-3%), with median lead times of 4-17 weeks, and predicts FDLs between ~5 and ~25 times better than a baseline approach. For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness and may provide significant insights for the optimization of public health 'pre-emptive' intervention strategies.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11500442/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

m6ATM: a deep learning framework for demystifying the m6A epitranscriptome with Nanopore long-read RNA-seq data. m6ATM：利用 Nanopore 长读程 RNA-seq 数据解密 m6A 表转录组的深度学习框架。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae529

Boyi Yu, Genta Nagae, Yutaka Midorikawa, Kenji Tatsuno, Bhaskar Dasgupta, Hiroyuki Aburatani, Hiroki Ueda

N6-methyladenosine (m6A) is one of the most abundant and well-known modifications in messenger RNAs since its discovery in the 1970s. Recent studies have demonstrated that m6A is involved in various biological processes, such as alternative splicing and RNA degradation, playing an important role in a variety of diseases. To better understand the role of m6A, transcriptome-wide m6A profiling data are indispensable. In recent years, the Oxford Nanopore Technology Direct RNA Sequencing (DRS) platform has shown promise for RNA modification detection based on current disruptions measured in transcripts. However, decoding current intensity data into modification profiles remains a challenging task. Here, we introduce the m6A Transcriptome-wide Mapper (m6ATM), a novel Python-based computational pipeline that applies deep neural networks to predict m6A sites at a single-base resolution using DRS data. The m6ATM model architecture incorporates a WaveNet encoder and a dual-stream multiple-instance learning model to extract features from specific target sites and characterize the m6A epitranscriptome. For validation, m6ATM achieved an accuracy of 80% to 98% across in vitro transcription datasets containing varying m6A modification ratios and outperformed other tools in benchmarking with human cell line data. Moreover, we demonstrated the versatility of m6ATM in providing reliable stoichiometric information and used it to pinpoint PEG10 as a potential m6A target transcript in liver cancer cells. In conclusion, m6ATM is a high-performance m6A detection tool, and our results pave the way for future advancements in epitranscriptomic research.

自 20 世纪 70 年代发现以来，N6-甲基腺苷（m6A）是信使 RNA 中最丰富、最广为人知的修饰之一。最近的研究表明，m6A 参与了多种生物过程，如替代剪接和 RNA 降解，在多种疾病中发挥着重要作用。要更好地了解 m6A 的作用，全转录组 m6A 图谱数据必不可少。近年来，牛津纳米孔技术公司（Oxford Nanopore Technology）的直接 RNA 测序（DRS）平台已显示出基于转录本中测量到的电流中断进行 RNA 修饰检测的前景。然而，将电流强度数据解码为修饰图谱仍是一项具有挑战性的任务。在这里，我们介绍了 m6A Transcriptome-wide Mapper (m6ATM)，这是一种基于 Python 的新型计算管道，它应用深度神经网络，利用 DRS 数据以单碱基分辨率预测 m6A 位点。m6ATM 模型架构包含一个 WaveNet 编码器和一个双流多实例学习模型，用于从特定目标位点提取特征并描述 m6A 表转录组。在验证方面，m6ATM 在包含不同 m6A 修饰比例的体外转录数据集上的准确率达到了 80% 到 98%，在使用人类细胞系数据进行基准测试时的表现优于其他工具。此外，我们还证明了 m6ATM 在提供可靠的化学计量信息方面的多功能性，并利用它将 PEG10 确定为肝癌细胞中潜在的 m6A 目标转录本。总之，m6ATM 是一种高性能的 m6A 检测工具，我们的研究结果为未来表转录组学研究的发展铺平了道路。

{"title":"m6ATM: a deep learning framework for demystifying the m6A epitranscriptome with Nanopore long-read RNA-seq data.","authors":"Boyi Yu, Genta Nagae, Yutaka Midorikawa, Kenji Tatsuno, Bhaskar Dasgupta, Hiroyuki Aburatani, Hiroki Ueda","doi":"10.1093/bib/bbae529","DOIUrl":"https://doi.org/10.1093/bib/bbae529","url":null,"abstract":"N6-methyladenosine (m6A) is one of the most abundant and well-known modifications in messenger RNAs since its discovery in the 1970s. Recent studies have demonstrated that m6A is involved in various biological processes, such as alternative splicing and RNA degradation, playing an important role in a variety of diseases. To better understand the role of m6A, transcriptome-wide m6A profiling data are indispensable. In recent years, the Oxford Nanopore Technology Direct RNA Sequencing (DRS) platform has shown promise for RNA modification detection based on current disruptions measured in transcripts. However, decoding current intensity data into modification profiles remains a challenging task. Here, we introduce the m6A Transcriptome-wide Mapper (m6ATM), a novel Python-based computational pipeline that applies deep neural networks to predict m6A sites at a single-base resolution using DRS data. The m6ATM model architecture incorporates a WaveNet encoder and a dual-stream multiple-instance learning model to extract features from specific target sites and characterize the m6A epitranscriptome. For validation, m6ATM achieved an accuracy of 80% to 98% across in vitro transcription datasets containing varying m6A modification ratios and outperformed other tools in benchmarking with human cell line data. Moreover, we demonstrated the versatility of m6ATM in providing reliable stoichiometric information and used it to pinpoint PEG10 as a potential m6A target transcript in liver cancer cells. In conclusion, m6ATM is a high-performance m6A detection tool, and our results pave the way for future advancements in epitranscriptomic research.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495873/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice. RiceSNP-BST：预测水稻生物胁迫相关 SNP 的深度学习框架。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae599

Jiajun Xu, Yujia Gao, Quan Lu, Renyi Zhang, Jianfeng Gui, Xiaoshuang Liu, Zhenyu Yue

Rice consistently faces significant threats from biotic stresses, such as fungi, bacteria, pests, and viruses. Consequently, accurately and rapidly identifying previously unknown single-nucleotide polymorphisms (SNPs) in the rice genome is a critical challenge for rice research and the development of resistant varieties. However, the limited availability of high-quality rice genotype data has hindered this research. Deep learning has transformed biological research by facilitating the prediction and analysis of SNPs in biological sequence data. Convolutional neural networks are especially effective in extracting structural and local features from DNA sequences, leading to significant advancements in genomics. Nevertheless, the expanding catalog of genome-wide association studies provides valuable biological insights for rice research. Expanding on this idea, we introduce RiceSNP-BST, an automatic architecture search framework designed to predict SNPs associated with rice biotic stress traits (BST-associated SNPs) by integrating multidimensional features. Notably, the model successfully innovates the datasets, offering more precision than state-of-the-art methods while demonstrating good performance on an independent test set and cross-species datasets. Additionally, we extracted features from the original DNA sequences and employed causal inference to enhance the biological interpretability of the model. This study highlights the potential of RiceSNP-BST in advancing genome prediction in rice. Furthermore, a user-friendly web server for RiceSNP-BST (http://rice-snp-bst.aielab.cc) has been developed to support broader genome research.

水稻一直面临着真菌、细菌、害虫和病毒等生物胁迫的严重威胁。因此，准确、快速地鉴定水稻基因组中先前未知的单核苷酸多态性（SNPs）是水稻研究和抗病品种开发面临的关键挑战。然而，高质量水稻基因型数据的有限可用性阻碍了这项研究。深度学习促进了生物序列数据中 SNP 的预测和分析，从而改变了生物学研究。卷积神经网络在从 DNA 序列中提取结构和局部特征方面尤为有效，从而在基因组学领域取得了重大进展。然而，不断扩大的全基因组关联研究为水稻研究提供了宝贵的生物学见解。基于这一想法，我们引入了 RiceSNP-BST，这是一个自动结构搜索框架，旨在通过整合多维特征来预测与水稻生物胁迫性状相关的 SNPs（BST 相关 SNPs）。值得注意的是，该模型成功地对数据集进行了创新，与最先进的方法相比精度更高，同时在独立测试集和跨物种数据集上表现出良好的性能。此外，我们还从原始 DNA 序列中提取了特征，并采用因果推理来增强模型的生物学可解释性。这项研究凸显了 RiceSNP-BST 在推进水稻基因组预测方面的潜力。此外，我们还为 RiceSNP-BST 开发了一个用户友好型网络服务器 (http://rice-snp-bst.aielab.cc)，以支持更广泛的基因组研究。

{"title":"RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice.","authors":"Jiajun Xu, Yujia Gao, Quan Lu, Renyi Zhang, Jianfeng Gui, Xiaoshuang Liu, Zhenyu Yue","doi":"10.1093/bib/bbae599","DOIUrl":"10.1093/bib/bbae599","url":null,"abstract":"Rice consistently faces significant threats from biotic stresses, such as fungi, bacteria, pests, and viruses. Consequently, accurately and rapidly identifying previously unknown single-nucleotide polymorphisms (SNPs) in the rice genome is a critical challenge for rice research and the development of resistant varieties. However, the limited availability of high-quality rice genotype data has hindered this research. Deep learning has transformed biological research by facilitating the prediction and analysis of SNPs in biological sequence data. Convolutional neural networks are especially effective in extracting structural and local features from DNA sequences, leading to significant advancements in genomics. Nevertheless, the expanding catalog of genome-wide association studies provides valuable biological insights for rice research. Expanding on this idea, we introduce RiceSNP-BST, an automatic architecture search framework designed to predict SNPs associated with rice biotic stress traits (BST-associated SNPs) by integrating multidimensional features. Notably, the model successfully innovates the datasets, offering more precision than state-of-the-art methods while demonstrating good performance on an independent test set and cross-species datasets. Additionally, we extracted features from the original DNA sequences and employed causal inference to enhance the biological interpretability of the model. This study highlights the potential of RiceSNP-BST in advancing genome prediction in rice. Furthermore, a user-friendly web server for RiceSNP-BST (http://rice-snp-bst.aielab.cc) has been developed to support broader genome research.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142675005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A review of feature selection strategies utilizing graph data structures and Knowledge Graphs. 利用图数据结构和知识图谱的特征选择策略综述。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae521

Sisi Shao, Pedro Henrique Ribeiro, Christina M Ramirez, Jason H Moore

Feature selection in Knowledge Graphs (KGs) is increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection (FS) within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in FS for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in FS techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG FS, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic FS algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.

知识图谱（KG）中的特征选择越来越多地应用于各种领域，包括生物医学研究、自然语言处理（NLP）和个性化推荐系统。本文深入探讨了知识图谱中的特征选择（FS）方法，强调了它们在提高机器学习（ML）模型效率、假设生成和可解释性方面的作用。通过这一全面回顾，我们旨在促进 KGs 特征选择的进一步创新，为在各个领域建立更有洞察力、更高效、更可解释的分析模型铺平道路。我们的探索揭示了可行性分析技术中可扩展性、准确性和可解释性的极端重要性，倡导整合领域知识来完善选择过程。我们强调了多目标优化和跨学科合作在推进 KG FS 方面蓬勃发展的潜力，强调了此类方法对精准医疗等领域的变革性影响。论文最后描绘了未来的发展方向，包括开发可扩展的动态 FS 算法，以及整合可解释的人工智能原则，以提高 KG 驱动模型的透明度和信任度。

{"title":"A review of feature selection strategies utilizing graph data structures and Knowledge Graphs.","authors":"Sisi Shao, Pedro Henrique Ribeiro, Christina M Ramirez, Jason H Moore","doi":"10.1093/bib/bbae521","DOIUrl":"10.1093/bib/bbae521","url":null,"abstract":"Feature selection in Knowledge Graphs (KGs) is increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection (FS) within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in FS for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in FS techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG FS, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic FS algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11551862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0