Many complex diseases exhibit pronounced sex differences that can affect both the initial risk of developing the disease, as well as clinical disease symptoms, molecular manifestations, disease progression, and the risk of developing comorbidities. Despite this, computational studies of molecular data for complex diseases often treat sex as a confounding variable, aiming to filter out sex-specific effects rather than attempting to interpret them. A more systematic, in-depth exploration of sex-specific disease mechanisms could significantly improve our understanding of pathological and protective processes with sex-dependent profiles. This survey discusses dedicated bioinformatics approaches for the study of molecular sex differences in complex diseases. It highlights that, beyond classical statistical methods, approaches are needed that integrate prior knowledge of relevant hormone signaling interactions, gene regulatory networks, and sex linkage of genes to provide a mechanistic interpretation of sex-dependent alterations in disease. The review examines and compares the advantages, pitfalls and limitations of various conventional statistical and systems-level mechanistic analyses for this purpose, including tailored pathway and network analysis techniques. Overall, this survey highlights the potential of specialized bioinformatics techniques to systematically investigate molecular sex differences in complex diseases, to inform biomarker signature modeling, and to guide more personalized treatment approaches.
{"title":"Bioinformatics approaches for studying molecular sex differences in complex diseases.","authors":"Rebecca Ting Jiin Loo, Mohamed Soudy, Francesco Nasta, Mirco Macchi, Enrico Glaab","doi":"10.1093/bib/bbae499","DOIUrl":"https://doi.org/10.1093/bib/bbae499","url":null,"abstract":"<p><p>Many complex diseases exhibit pronounced sex differences that can affect both the initial risk of developing the disease, as well as clinical disease symptoms, molecular manifestations, disease progression, and the risk of developing comorbidities. Despite this, computational studies of molecular data for complex diseases often treat sex as a confounding variable, aiming to filter out sex-specific effects rather than attempting to interpret them. A more systematic, in-depth exploration of sex-specific disease mechanisms could significantly improve our understanding of pathological and protective processes with sex-dependent profiles. This survey discusses dedicated bioinformatics approaches for the study of molecular sex differences in complex diseases. It highlights that, beyond classical statistical methods, approaches are needed that integrate prior knowledge of relevant hormone signaling interactions, gene regulatory networks, and sex linkage of genes to provide a mechanistic interpretation of sex-dependent alterations in disease. The review examines and compares the advantages, pitfalls and limitations of various conventional statistical and systems-level mechanistic analyses for this purpose, including tailored pathway and network analysis techniques. Overall, this survey highlights the potential of specialized bioinformatics techniques to systematically investigate molecular sex differences in complex diseases, to inform biomarker signature modeling, and to guide more personalized treatment approaches.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lonneke Scheffer, Eric Emanuel Reber, Brij Bhushan Mehta, Milena Pavlović, Maria Chernigovskaya, Eve Richardson, Rahmad Akbar, Fridtjof Lund-Johansen, Victor Greiff, Ingrid Hobæk Haff, Geir Kjetil Sandve
Adaptive immune receptors, such as antibodies and T-cell receptors, recognize foreign threats with exquisite specificity. A major challenge in adaptive immunology is discovering the rules governing immune receptor-antigen binding in order to predict the antigen binding status of previously unseen immune receptors. Many studies assume that the antigen binding status of an immune receptor may be determined by the presence of a short motif in the complementarity determining region 3 (CDR3), disregarding other amino acids. To test this assumption, we present a method to discover short motifs which show high precision in predicting antigen binding and generalize well to unseen simulated and experimental data. Our analysis of a mutagenesis-based antibody dataset reveals 11 336 position-specific, mostly gapped motifs of 3-5 amino acids that retain high precision on independently generated experimental data. Using a subset of only 178 motifs, a simple classifier was made that on the independently generated dataset outperformed a deep learning model proposed specifically for such datasets. In conclusion, our findings support the notion that for some antibodies, antigen binding may be largely determined by a short CDR3 motif. As more experimental data emerge, our methodology could serve as a foundation for in-depth investigations into antigen binding signals.
{"title":"Predictability of antigen binding based on short motifs in the antibody CDRH3.","authors":"Lonneke Scheffer, Eric Emanuel Reber, Brij Bhushan Mehta, Milena Pavlović, Maria Chernigovskaya, Eve Richardson, Rahmad Akbar, Fridtjof Lund-Johansen, Victor Greiff, Ingrid Hobæk Haff, Geir Kjetil Sandve","doi":"10.1093/bib/bbae537","DOIUrl":"https://doi.org/10.1093/bib/bbae537","url":null,"abstract":"<p><p>Adaptive immune receptors, such as antibodies and T-cell receptors, recognize foreign threats with exquisite specificity. A major challenge in adaptive immunology is discovering the rules governing immune receptor-antigen binding in order to predict the antigen binding status of previously unseen immune receptors. Many studies assume that the antigen binding status of an immune receptor may be determined by the presence of a short motif in the complementarity determining region 3 (CDR3), disregarding other amino acids. To test this assumption, we present a method to discover short motifs which show high precision in predicting antigen binding and generalize well to unseen simulated and experimental data. Our analysis of a mutagenesis-based antibody dataset reveals 11 336 position-specific, mostly gapped motifs of 3-5 amino acids that retain high precision on independently generated experimental data. Using a subset of only 178 motifs, a simple classifier was made that on the independently generated dataset outperformed a deep learning model proposed specifically for such datasets. In conclusion, our findings support the notion that for some antibodies, antigen binding may be largely determined by a short CDR3 motif. As more experimental data emerge, our methodology could serve as a foundation for in-depth investigations into antigen binding signals.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495870/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transcriptional factors (TFs) in bacteria play a crucial role in gene regulation by binding to specific DNA sequences, thereby assisting in the activation or repression of genes. Despite their central role, deciphering shape recognition of bacterial TFs-DNA interactions remains an intricate challenge. A deeper understanding of DNA secondary structures could greatly enhance our knowledge of how TFs recognize and interact with DNA, thereby elucidating their biological function. In this study, we employed machine learning algorithms to predict transcription factor binding sites (TFBS) and classify them as directed-repeat (DR) or inverted-repeat (IR). To accomplish this, we divided the set of TFBS nucleotide sequences by size, ranging from 8 to 20 base pairs, and converted them into thermodynamic data known as DNA duplex stability (DDS). Our results demonstrate that the Random Forest algorithm accurately predicts TFBS with an average accuracy of over 82% and effectively distinguishes between IR and DR with an accuracy of 89%. Interestingly, upon converting the base pairs of several TFBS-IR into DDS values, we observed a symmetric profile typical of the palindromic structure associated with these architectures. This study presents a novel TFBS prediction model based on a DDS characteristic that may indicate how respective proteins interact with base pairs, thus providing insights into molecular mechanisms underlying bacterial TFs-DNA interaction.
细菌中的转录因子(TFs)通过与特定的 DNA 序列结合,从而帮助激活或抑制基因,在基因调控中发挥着至关重要的作用。尽管细菌转录因子起着核心作用,但破译细菌转录因子与 DNA 之间相互作用的形状识别仍然是一项复杂的挑战。加深对 DNA 二级结构的理解可大大增进我们对 TFs 如何识别 DNA 并与之相互作用的了解,从而阐明它们的生物学功能。在这项研究中,我们采用了机器学习算法来预测转录因子结合位点(TFBS),并将其分为定向重复位点(DR)和反向重复位点(IR)。为此,我们将 TFBS 核苷酸序列集按大小(从 8 个碱基对到 20 个碱基对不等)进行了划分,并将其转换成热力学数据,即 DNA 双工稳定性(DDS)。结果表明,随机森林算法能准确预测 TFBS,平均准确率超过 82%,并能有效区分 IR 和 DR,准确率高达 89%。有趣的是,在将几个 TFBS-IR 的碱基对转换成 DDS 值时,我们观察到了与这些结构相关的典型的回文结构的对称轮廓。本研究提出了一种基于 DDS 特征的新型 TFBS 预测模型,该模型可显示各自的蛋白质如何与碱基对相互作用,从而为了解细菌 TFs-DNA 相互作用的分子机制提供启示。
{"title":"Predicting bacterial transcription factor binding sites through machine learning and structural characterization based on DNA duplex stability.","authors":"André Borges Farias, Gustavo Sganzerla Martinez, Edgardo Galán-Vásquez, Marisa Fabiana Nicolás, Ernesto Pérez-Rueda","doi":"10.1093/bib/bbae581","DOIUrl":"10.1093/bib/bbae581","url":null,"abstract":"<p><p>Transcriptional factors (TFs) in bacteria play a crucial role in gene regulation by binding to specific DNA sequences, thereby assisting in the activation or repression of genes. Despite their central role, deciphering shape recognition of bacterial TFs-DNA interactions remains an intricate challenge. A deeper understanding of DNA secondary structures could greatly enhance our knowledge of how TFs recognize and interact with DNA, thereby elucidating their biological function. In this study, we employed machine learning algorithms to predict transcription factor binding sites (TFBS) and classify them as directed-repeat (DR) or inverted-repeat (IR). To accomplish this, we divided the set of TFBS nucleotide sequences by size, ranging from 8 to 20 base pairs, and converted them into thermodynamic data known as DNA duplex stability (DDS). Our results demonstrate that the Random Forest algorithm accurately predicts TFBS with an average accuracy of over 82% and effectively distinguishes between IR and DR with an accuracy of 89%. Interestingly, upon converting the base pairs of several TFBS-IR into DDS values, we observed a symmetric profile typical of the palindromic structure associated with these architectures. This study presents a novel TFBS prediction model based on a DDS characteristic that may indicate how respective proteins interact with base pairs, thus providing insights into molecular mechanisms underlying bacterial TFs-DNA interaction.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562833/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin-Fei Wang, Lan Huang, Yan Wang, Ren-Chu Guan, Zhu-Hong You, Nan Sheng, Xu-Ping Xie, Wen-Ju Hou
The discovery of diagnostic and therapeutic biomarkers for complex diseases, especially cancer, has always been a central and long-term challenge in molecular association prediction research, offering promising avenues for advancing the understanding of complex diseases. To this end, researchers have developed various network-based prediction techniques targeting specific molecular associations. However, limitations imposed by reductionism and network representation learning have led existing studies to narrowly focus on high prediction efficiency within single association type, thereby glossing over the discovery of unknown types of associations. Additionally, effectively utilizing network structure to fit the interaction properties of regulatory networks and combining specific case biomarker validations remains an unresolved issue in cancer biomarker prediction methods. To overcome these limitations, we propose a multi-view learning framework, CeRVE, based on directed graph neural networks (DGNN) for predicting unknown type cancer biomarkers. CeRVE effectively extracts and integrates subgraph information through multi-view feature learning. Subsequently, CeRVE utilizes DGNN to simulate the entire regulatory network, propagating node attribute features and extracting various interaction relationships between molecules. Furthermore, CeRVE constructed a comparative analysis matrix of three cancers and adjacent normal tissues through The Cancer Genome Atlas and identified multiple types of potential cancer biomarkers through differential expression analysis of mRNA, microRNA, and long noncoding RNA. Computational testing of multiple types of biomarkers for 72 cancers demonstrates that CeRVE exhibits superior performance in cancer biomarker prediction, providing a powerful tool and insightful approach for AI-assisted disease biomarker discovery.
{"title":"Multi-view learning framework for predicting unknown types of cancer markers via directed graph neural networks fitting regulatory networks.","authors":"Xin-Fei Wang, Lan Huang, Yan Wang, Ren-Chu Guan, Zhu-Hong You, Nan Sheng, Xu-Ping Xie, Wen-Ju Hou","doi":"10.1093/bib/bbae546","DOIUrl":"10.1093/bib/bbae546","url":null,"abstract":"<p><p>The discovery of diagnostic and therapeutic biomarkers for complex diseases, especially cancer, has always been a central and long-term challenge in molecular association prediction research, offering promising avenues for advancing the understanding of complex diseases. To this end, researchers have developed various network-based prediction techniques targeting specific molecular associations. However, limitations imposed by reductionism and network representation learning have led existing studies to narrowly focus on high prediction efficiency within single association type, thereby glossing over the discovery of unknown types of associations. Additionally, effectively utilizing network structure to fit the interaction properties of regulatory networks and combining specific case biomarker validations remains an unresolved issue in cancer biomarker prediction methods. To overcome these limitations, we propose a multi-view learning framework, CeRVE, based on directed graph neural networks (DGNN) for predicting unknown type cancer biomarkers. CeRVE effectively extracts and integrates subgraph information through multi-view feature learning. Subsequently, CeRVE utilizes DGNN to simulate the entire regulatory network, propagating node attribute features and extracting various interaction relationships between molecules. Furthermore, CeRVE constructed a comparative analysis matrix of three cancers and adjacent normal tissues through The Cancer Genome Atlas and identified multiple types of potential cancer biomarkers through differential expression analysis of mRNA, microRNA, and long noncoding RNA. Computational testing of multiple types of biomarkers for 72 cancers demonstrates that CeRVE exhibits superior performance in cancer biomarker prediction, providing a powerful tool and insightful approach for AI-assisted disease biomarker discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11514060/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zachary D Wallen, Mary K Nesline, Sarabjot Pabla, Shuang Gao, Erik Vanroey, Stephanie B Hastings, Heidi Ko, Kyle C Strickland, Rebecca A Previs, Shengle Zhang, Jeffrey M Conroy, Taylor J Jensen, Elizabeth George, Marcia Eisenberg, Brian Caveney, Pratheesh Sathyan, Shakti Ramkissoon, Eric A Severson
Disparities in cancer diagnosis, treatment, and outcomes based on self-identified race and ethnicity (SIRE) are well documented, yet these variables have historically been excluded from clinical research. Without SIRE, genetic ancestry can be inferred using single-nucleotide polymorphisms (SNPs) detected from tumor DNA using comprehensive genomic profiling (CGP). However, factors inherent to CGP of tumor DNA increase the difficulty of identifying ancestry-informative SNPs, and current workflows for inferring genetic ancestry from CGP need improvements in key areas of the ancestry inference process. This study used genomic data from 4274 diverse reference subjects and CGP data from 491 patients with solid tumors and SIRE to develop and validate a workflow to obtain accurate genetically inferred ancestry (GIA) from CGP sequencing results. We use consensus-based classification to derive confident ancestral inferences from an expanded reference dataset covering eight world populations (African, Admixed American, Central Asian/Siberian, European, East Asian, Middle Eastern, Oceania, South Asian). Our GIA calls were highly concordant with SIRE (95%) and aligned well with reference populations of inferred ancestries. Further, our workflow could expand on SIRE by (i) detecting the ancestry of patients that usually lack appropriate racial categories, (ii) determining what patients have mixed ancestry, and (iii) resolving ancestries of patients in heterogeneous racial categories and who had missing SIRE. Accurate GIA provides needed information to enable ancestry-aware biomarker research, ensure the inclusion of underrepresented groups in clinical research, and increase the diverse representation of patient populations eligible for precision medicine therapies and trials.
基于自我认同的种族和民族(SIRE)在癌症诊断、治疗和预后方面的差异已被充分记录在案,但这些变量历来被排除在临床研究之外。在没有 SIRE 的情况下,可以利用综合基因组分析(CGP)从肿瘤 DNA 中检测到的单核苷酸多态性(SNPs)来推断遗传血统。然而,肿瘤 DNA CGP 的固有因素增加了鉴定具有祖先信息的 SNP 的难度,目前从 CGP 推断遗传祖先的工作流程需要在祖先推断过程的关键领域进行改进。本研究使用了来自 4274 名不同参考对象的基因组数据和来自 491 名实体瘤和 SIRE 患者的 CGP 数据,开发并验证了从 CGP 测序结果中获得准确遗传祖先推断(GIA)的工作流程。我们采用基于共识的分类方法,从涵盖世界八大人群(非洲人、美洲混血人、中亚/西伯利亚人、欧洲人、东亚人、中东人、大洋洲人、南亚人)的扩展参考数据集中得出可靠的祖先推断。我们的 GIA 调用与 SIRE 高度一致(95%),并与推断祖先的参考人群非常吻合。此外,我们的工作流程还可以通过以下方式扩展 SIRE:(i) 检测通常缺乏适当种族类别的患者的祖先;(ii) 确定哪些患者具有混合祖先;(iii) 解决异质种族类别和 SIRE 缺失的患者的祖先问题。准确的 GIA 可提供所需的信息,以开展具有祖先意识的生物标记物研究,确保将代表性不足的群体纳入临床研究,并提高有资格接受精准医学疗法和试验的患者群体的多样性代表性。
{"title":"A consensus-based classification workflow to determine genetically inferred ancestry from comprehensive genomic profiling of patients with solid tumors.","authors":"Zachary D Wallen, Mary K Nesline, Sarabjot Pabla, Shuang Gao, Erik Vanroey, Stephanie B Hastings, Heidi Ko, Kyle C Strickland, Rebecca A Previs, Shengle Zhang, Jeffrey M Conroy, Taylor J Jensen, Elizabeth George, Marcia Eisenberg, Brian Caveney, Pratheesh Sathyan, Shakti Ramkissoon, Eric A Severson","doi":"10.1093/bib/bbae557","DOIUrl":"10.1093/bib/bbae557","url":null,"abstract":"<p><p>Disparities in cancer diagnosis, treatment, and outcomes based on self-identified race and ethnicity (SIRE) are well documented, yet these variables have historically been excluded from clinical research. Without SIRE, genetic ancestry can be inferred using single-nucleotide polymorphisms (SNPs) detected from tumor DNA using comprehensive genomic profiling (CGP). However, factors inherent to CGP of tumor DNA increase the difficulty of identifying ancestry-informative SNPs, and current workflows for inferring genetic ancestry from CGP need improvements in key areas of the ancestry inference process. This study used genomic data from 4274 diverse reference subjects and CGP data from 491 patients with solid tumors and SIRE to develop and validate a workflow to obtain accurate genetically inferred ancestry (GIA) from CGP sequencing results. We use consensus-based classification to derive confident ancestral inferences from an expanded reference dataset covering eight world populations (African, Admixed American, Central Asian/Siberian, European, East Asian, Middle Eastern, Oceania, South Asian). Our GIA calls were highly concordant with SIRE (95%) and aligned well with reference populations of inferred ancestries. Further, our workflow could expand on SIRE by (i) detecting the ancestry of patients that usually lack appropriate racial categories, (ii) determining what patients have mixed ancestry, and (iii) resolving ancestries of patients in heterogeneous racial categories and who had missing SIRE. Accurate GIA provides needed information to enable ancestry-aware biomarker research, ensure the inclusion of underrepresented groups in clinical research, and increase the diverse representation of patient populations eligible for precision medicine therapies and trials.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11521331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142543787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenhao Zhang, Yuxi Liu, Meichen Xiao, Kun Wang, Yu Huang, Jiang Bian, Ruolin Yang, Fuyi Li
Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a robust framework for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.
{"title":"Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis.","authors":"Zhenhao Zhang, Yuxi Liu, Meichen Xiao, Kun Wang, Yu Huang, Jiang Bian, Ruolin Yang, Fuyi Li","doi":"10.1093/bib/bbae558","DOIUrl":"10.1093/bib/bbae558","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a robust framework for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11530284/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development of spatially resolved transcriptomics (ST) technologies has made it possible to measure gene expression profiles coupled with cellular spatial context and assist biologists in comprehensively characterizing cellular phenotype heterogeneity and tissue microenvironment. Spatial clustering is vital for biological downstream analysis. However, due to high noise and dropout events, clustering spatial transcriptomics data poses numerous challenges due to the lack of effective algorithms. Here we develop a novel method, jointly performing dimension reduction and spatial clustering with Bayesian Factor Analysis for zero-inflated Spatial Transcriptomics data (BFAST). BFAST has showcased exceptional performance on simulation data and real spatial transcriptomics datasets, as proven by benchmarking against currently available methods. It effectively extracts more biologically informative low-dimensional features compared to traditional dimensionality reduction approaches, thereby enhancing the accuracy and precision of clustering.
{"title":"BFAST: joint dimension reduction and spatial clustering with Bayesian factor analysis for zero-inflated spatial transcriptomics data.","authors":"Yang Xu, Dian Lv, Xuanxuan Zou, Liang Wu, Xun Xu, Xin Zhao","doi":"10.1093/bib/bbae594","DOIUrl":"10.1093/bib/bbae594","url":null,"abstract":"<p><p>The development of spatially resolved transcriptomics (ST) technologies has made it possible to measure gene expression profiles coupled with cellular spatial context and assist biologists in comprehensively characterizing cellular phenotype heterogeneity and tissue microenvironment. Spatial clustering is vital for biological downstream analysis. However, due to high noise and dropout events, clustering spatial transcriptomics data poses numerous challenges due to the lack of effective algorithms. Here we develop a novel method, jointly performing dimension reduction and spatial clustering with Bayesian Factor Analysis for zero-inflated Spatial Transcriptomics data (BFAST). BFAST has showcased exceptional performance on simulation data and real spatial transcriptomics datasets, as proven by benchmarking against currently available methods. It effectively extracts more biologically informative low-dimensional features compared to traditional dimensionality reduction approaches, thereby enhancing the accuracy and precision of clustering.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11570543/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rice consistently faces significant threats from biotic stresses, such as fungi, bacteria, pests, and viruses. Consequently, accurately and rapidly identifying previously unknown single-nucleotide polymorphisms (SNPs) in the rice genome is a critical challenge for rice research and the development of resistant varieties. However, the limited availability of high-quality rice genotype data has hindered this research. Deep learning has transformed biological research by facilitating the prediction and analysis of SNPs in biological sequence data. Convolutional neural networks are especially effective in extracting structural and local features from DNA sequences, leading to significant advancements in genomics. Nevertheless, the expanding catalog of genome-wide association studies provides valuable biological insights for rice research. Expanding on this idea, we introduce RiceSNP-BST, an automatic architecture search framework designed to predict SNPs associated with rice biotic stress traits (BST-associated SNPs) by integrating multidimensional features. Notably, the model successfully innovates the datasets, offering more precision than state-of-the-art methods while demonstrating good performance on an independent test set and cross-species datasets. Additionally, we extracted features from the original DNA sequences and employed causal inference to enhance the biological interpretability of the model. This study highlights the potential of RiceSNP-BST in advancing genome prediction in rice. Furthermore, a user-friendly web server for RiceSNP-BST (http://rice-snp-bst.aielab.cc) has been developed to support broader genome research.
水稻一直面临着真菌、细菌、害虫和病毒等生物胁迫的严重威胁。因此,准确、快速地鉴定水稻基因组中先前未知的单核苷酸多态性(SNPs)是水稻研究和抗病品种开发面临的关键挑战。然而,高质量水稻基因型数据的有限可用性阻碍了这项研究。深度学习促进了生物序列数据中 SNP 的预测和分析,从而改变了生物学研究。卷积神经网络在从 DNA 序列中提取结构和局部特征方面尤为有效,从而在基因组学领域取得了重大进展。然而,不断扩大的全基因组关联研究为水稻研究提供了宝贵的生物学见解。基于这一想法,我们引入了 RiceSNP-BST,这是一个自动结构搜索框架,旨在通过整合多维特征来预测与水稻生物胁迫性状相关的 SNPs(BST 相关 SNPs)。值得注意的是,该模型成功地对数据集进行了创新,与最先进的方法相比精度更高,同时在独立测试集和跨物种数据集上表现出良好的性能。此外,我们还从原始 DNA 序列中提取了特征,并采用因果推理来增强模型的生物学可解释性。这项研究凸显了 RiceSNP-BST 在推进水稻基因组预测方面的潜力。此外,我们还为 RiceSNP-BST 开发了一个用户友好型网络服务器 (http://rice-snp-bst.aielab.cc),以支持更广泛的基因组研究。
{"title":"RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice.","authors":"Jiajun Xu, Yujia Gao, Quan Lu, Renyi Zhang, Jianfeng Gui, Xiaoshuang Liu, Zhenyu Yue","doi":"10.1093/bib/bbae599","DOIUrl":"https://doi.org/10.1093/bib/bbae599","url":null,"abstract":"<p><p>Rice consistently faces significant threats from biotic stresses, such as fungi, bacteria, pests, and viruses. Consequently, accurately and rapidly identifying previously unknown single-nucleotide polymorphisms (SNPs) in the rice genome is a critical challenge for rice research and the development of resistant varieties. However, the limited availability of high-quality rice genotype data has hindered this research. Deep learning has transformed biological research by facilitating the prediction and analysis of SNPs in biological sequence data. Convolutional neural networks are especially effective in extracting structural and local features from DNA sequences, leading to significant advancements in genomics. Nevertheless, the expanding catalog of genome-wide association studies provides valuable biological insights for rice research. Expanding on this idea, we introduce RiceSNP-BST, an automatic architecture search framework designed to predict SNPs associated with rice biotic stress traits (BST-associated SNPs) by integrating multidimensional features. Notably, the model successfully innovates the datasets, offering more precision than state-of-the-art methods while demonstrating good performance on an independent test set and cross-species datasets. Additionally, we extracted features from the original DNA sequences and employed causal inference to enhance the biological interpretability of the model. This study highlights the potential of RiceSNP-BST in advancing genome prediction in rice. Furthermore, a user-friendly web server for RiceSNP-BST (http://rice-snp-bst.aielab.cc) has been developed to support broader genome research.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142675005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Yuan, Ling Zhao, Yufeng Jiang, Zhen Shen, Qinhu Zhang, Ming Zhang, Chun-Hou Zheng, De-Shuang Huang
The gene regulatory network (GRN) plays a vital role in understanding the structure and dynamics of cellular systems, revealing complex regulatory relationships, and exploring disease mechanisms. Recently, deep learning (DL)-based methods have been proposed to infer GRNs from single-cell transcriptomic data and achieved impressive performance. However, these methods do not fully utilize graph topological information and high-order neighbor information from multiple receptive fields. To overcome those limitations, we propose a novel model based on multiview graph attention network, namely, scMGATGRN, to infer GRNs. scMGATGRN mainly consists of GAT, multiview, and view-level attention mechanism. GAT can extract essential features of the gene regulatory network. The multiview model can simultaneously utilize local feature information and high-order neighbor feature information of nodes in the gene regulatory network. The view-level attention mechanism dynamically adjusts the relative importance of node embedding representations and efficiently aggregates node embedding representations from two views. To verify the effectiveness of scMGATGRN, we compared its performance with 10 methods (five shallow learning algorithms and five state-of-the-art DL-based methods) on seven benchmark single-cell RNA sequencing (scRNA-seq) datasets from five cell lines (two in human and three in mouse) with four different kinds of ground-truth networks. The experimental results not only show that scMGATGRN outperforms competing methods but also demonstrate the potential of this model in inferring GRNs. The code and data of scMGATGRN are made freely available on GitHub (https://github.com/nathanyl/scMGATGRN).
{"title":"scMGATGRN: a multiview graph attention network-based method for inferring gene regulatory networks from single-cell transcriptomic data.","authors":"Lin Yuan, Ling Zhao, Yufeng Jiang, Zhen Shen, Qinhu Zhang, Ming Zhang, Chun-Hou Zheng, De-Shuang Huang","doi":"10.1093/bib/bbae526","DOIUrl":"https://doi.org/10.1093/bib/bbae526","url":null,"abstract":"<p><p>The gene regulatory network (GRN) plays a vital role in understanding the structure and dynamics of cellular systems, revealing complex regulatory relationships, and exploring disease mechanisms. Recently, deep learning (DL)-based methods have been proposed to infer GRNs from single-cell transcriptomic data and achieved impressive performance. However, these methods do not fully utilize graph topological information and high-order neighbor information from multiple receptive fields. To overcome those limitations, we propose a novel model based on multiview graph attention network, namely, scMGATGRN, to infer GRNs. scMGATGRN mainly consists of GAT, multiview, and view-level attention mechanism. GAT can extract essential features of the gene regulatory network. The multiview model can simultaneously utilize local feature information and high-order neighbor feature information of nodes in the gene regulatory network. The view-level attention mechanism dynamically adjusts the relative importance of node embedding representations and efficiently aggregates node embedding representations from two views. To verify the effectiveness of scMGATGRN, we compared its performance with 10 methods (five shallow learning algorithms and five state-of-the-art DL-based methods) on seven benchmark single-cell RNA sequencing (scRNA-seq) datasets from five cell lines (two in human and three in mouse) with four different kinds of ground-truth networks. The experimental results not only show that scMGATGRN outperforms competing methods but also demonstrate the potential of this model in inferring GRNs. The code and data of scMGATGRN are made freely available on GitHub (https://github.com/nathanyl/scMGATGRN).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484520/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiplexed spatial proteomics reveals the spatial organization of cells in tumors, which is associated with important clinical outcomes such as survival and treatment response. This spatial organization is often summarized using spatial summary statistics, including Ripley's K and Besag's L. However, if multiple regions of the same tumor are imaged, it is unclear how to synthesize the relationship with a single patient-level endpoint. We evaluate extant approaches for accommodating multiple images within the context of associating summary statistics with outcomes. First, we consider averaging-based approaches wherein multiple summaries for a single sample are combined in a weighted mean. We then propose a novel class of ensemble testing approaches in which we simulate random weights used to aggregate summaries, test for an association with outcomes, and combine the $P$-values. We systematically evaluate the performance of these approaches via simulation and application to data from non-small cell lung cancer, colorectal cancer, and triple negative breast cancer. We find that the optimal strategy varies, but a simple weighted average of the summary statistics based on the number of cells in each image often offers the highest power and controls type I error effectively. When the size of the imaged regions varies, incorporating this variation into the weighted aggregation may yield additional power in cases where the varying size is informative. Ensemble testing (but not resampling) offered high power and type I error control across conditions in our simulated data sets.
多重空间蛋白质组学揭示了肿瘤细胞的空间组织,这与生存和治疗反应等重要临床结果相关。然而,如果对同一肿瘤的多个区域进行成像,如何将其与单一患者水平终点的关系综合起来还不清楚。我们对现有的方法进行了评估,以便在将汇总统计数据与结果相关联的情况下,将多幅图像纳入其中。首先,我们考虑了基于平均值的方法,即将单个样本的多个摘要合并为一个加权平均值。然后,我们提出了一类新颖的集合测试方法,即模拟用于汇总摘要的随机权重,测试与结果的关联,并合并 $P$ 值。我们通过模拟和应用非小细胞肺癌、结直肠癌和三阴性乳腺癌的数据,系统地评估了这些方法的性能。我们发现,最佳策略各不相同,但基于每幅图像中细胞数量的简单加权平均汇总统计通常能提供最高的功率,并有效控制 I 型误差。当成像区域的大小发生变化时,将这种变化纳入加权汇总可能会在不同大小具有信息量的情况下产生额外的功率。在我们的模拟数据集中,集合测试(但不是重采样)在各种条件下都能提供较高的功率和 I 型误差控制。
{"title":"Statistical analysis of multiple regions-of-interest in multiplexed spatial proteomics data.","authors":"Sarah Samorodnitsky, Michael C Wu","doi":"10.1093/bib/bbae522","DOIUrl":"10.1093/bib/bbae522","url":null,"abstract":"<p><p>Multiplexed spatial proteomics reveals the spatial organization of cells in tumors, which is associated with important clinical outcomes such as survival and treatment response. This spatial organization is often summarized using spatial summary statistics, including Ripley's K and Besag's L. However, if multiple regions of the same tumor are imaged, it is unclear how to synthesize the relationship with a single patient-level endpoint. We evaluate extant approaches for accommodating multiple images within the context of associating summary statistics with outcomes. First, we consider averaging-based approaches wherein multiple summaries for a single sample are combined in a weighted mean. We then propose a novel class of ensemble testing approaches in which we simulate random weights used to aggregate summaries, test for an association with outcomes, and combine the $P$-values. We systematically evaluate the performance of these approaches via simulation and application to data from non-small cell lung cancer, colorectal cancer, and triple negative breast cancer. We find that the optimal strategy varies, but a simple weighted average of the summary statistics based on the number of cells in each image often offers the highest power and controls type I error effectively. When the size of the imaged regions varies, incorporating this variation into the weighted aggregation may yield additional power in cases where the varying size is informative. Ensemble testing (but not resampling) offered high power and type I error control across conditions in our simulated data sets.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11491162/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}