Pub Date : 2024-10-15DOI: 10.1186/s12859-024-05898-0
Lucas Schneider, Peter Minary
Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.
碱基编辑是一种增强型基因编辑方法,可实现单个核苷酸的精确转化,具有治疗罕见疾病的潜力。碱基编辑器的设计过程是劳动密集型的,结果也不容易预测。要用于临床,碱基编辑必须准确、高效。因此,必须尽量减少旁观者突变。近年来,预测碱基编辑结果的计算模型已经开发出来。然而,这些模型的整体稳健性和性能有限。提高性能的方法之一是在多样化、特征丰富的大型数据集上训练模型,而碱基编辑领域并不存在这样的数据集。因此,我们开发了一个 MySQL 数据库 BE-dataHIVE,它涵盖了超过 46 万个 gRNA 目标组合。当前版本的 BE-dataHIVE 包含来自五项研究的数据,并丰富了熔化温度和能量项。此外,还为机器学习计算了多种不同的数据结构,并可直接使用。该数据库可通过我们的网站 https://be-datahive.com/ 或 API 访问,因此适合从业人员和机器学习研究人员使用。
{"title":"Be-dataHIVE: a base editing database.","authors":"Lucas Schneider, Peter Minary","doi":"10.1186/s12859-024-05898-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05898-0","url":null,"abstract":"<p><p>Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476525/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.
Results: In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.
Conclusions: Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.
{"title":"LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks.","authors":"Bing Zhang, Haoyu Wang, Chao Ma, Hai Huang, Zhou Fang, Jiaxing Qu","doi":"10.1186/s12859-024-05950-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05950-z","url":null,"abstract":"<p><strong>Background: </strong>Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.</p><p><strong>Results: </strong>In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.</p><p><strong>Conclusions: </strong>Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11481433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.
Results: DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.
Conclusions: DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.
{"title":"DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.","authors":"Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin","doi":"10.1186/s12859-024-05955-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05955-8","url":null,"abstract":"<p><strong>Background: </strong>The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.</p><p><strong>Results: </strong>DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.</p><p><strong>Conclusions: </strong>DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1186/s12859-024-05925-0
Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid
Background: Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.
Results: This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( ), mean absolute error ( ), coefficient of determination ( ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.
Conclusion: This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.
{"title":"A multi-task graph deep learning model to predict drugs combination of synergy and sensitivity scores.","authors":"Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid","doi":"10.1186/s12859-024-05925-0","DOIUrl":"10.1186/s12859-024-05925-0","url":null,"abstract":"<p><strong>Background: </strong>Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.</p><p><strong>Results: </strong>This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( <math><mrow><mi>MSE</mi></mrow> </math> ), mean absolute error ( <math><mrow><mi>MAE</mi></mrow> </math> ), coefficient of determination ( <math> <msup><mrow><mi>R</mi></mrow> <mn>2</mn></msup> </math> ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.</p><p><strong>Conclusion: </strong>This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468365/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142399244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1186/s12859-024-05896-2
Fei-Man Hsu, Paul Horton
Background: Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.
Method: We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.
Conclusions: Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.
背景:一些转录因子(例如 MYC)与可能甲基化的 DNA 位点结合。这可能会增加结合的特异性,因为这些位点(1)在基因组中的代表性极低,(2)以低甲基化或高甲基化的形式提供额外的组织特异性信息。幸运的是,亚硫酸氢盐测序数据可用于研究这一现象:我们开发了 MethylSeqLogo,它是序列标识的一种扩展,其中包含了一些新元素,用于显示 DNA 甲基化和一组结合位点中每个位置上代表性不足的二聚体。我们的方法显示 DNA 双链的信息,并考虑到适当的序列上下文(CpG 或其他)和基因组区域(启动子或全基因组),以正确评估预期的背景二聚体频率和甲基化水平。MethylSeqLogo 保留了序列徽标的语义--一列中核苷酸的相对高度代表它们在结合位点中的比例,而每列的绝对高度代表信息(相对熵),所有列加起来的高度代表总信息 结果:我们展示的图表说明了使用 MethylSeqLogo 总结几个 CpG 结合转录因子数据的实用性。图标显示,未甲基化的 CpG 结合位点是 MYC 和 ZBTB33 等转录因子的特征,而其他一些 CpG 结合转录因子(如 CEBPB)则呈现甲基化中性:结论:我们的软件使用户能够探索亚硫酸氢盐和 ChIP 测序数据集,并在此过程中获得具有发表质量的数据。
{"title":"MethylSeqLogo: DNA methylation smart sequence logos.","authors":"Fei-Man Hsu, Paul Horton","doi":"10.1186/s12859-024-05896-2","DOIUrl":"10.1186/s12859-024-05896-2","url":null,"abstract":"<p><strong>Background: </strong>Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.</p><p><strong>Method: </strong>We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.</p><p><strong>Conclusions: </strong>Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11462690/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-08DOI: 10.1186/s12859-024-05936-x
Xavier Bledsoe, Eric R Gamazon
Background: We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.
Results: We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.
Conclusions: Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.
背景:我们介绍的 NeuroimaGene 资源是一个 R 软件包,旨在帮助研究人员识别与精神和神经健康相关的基因和神经特征。虽然最近的研究已经确定了数百个基因是神经和精神疾病病理生理学的潜在组成部分,但解释这种变异的生理后果仍具有挑战性。将神经影像数据与分子研究结果相结合是应对这一挑战的一个步骤。除了与分子变异和临床表型有关联外,神经影像学特征还能为认知过程提供内在信息。NeuroimaGene 为了解疾病相关基因与大脑中间结构的关系提供了一种工具:我们创建了 NeuroimaGene,它是一个用户友好、开放存取的 R 软件包,现在可供公众使用。它的主要功能是识别受用户提供的基因或基因组的基因调控表达影响的神经影像衍生大脑特征。该资源可用于:(1) 鉴定与大脑结构和功能相关的单个基因或基因组;(2) 识别目标基因的表达与神经相关的大脑或身体区域;(3) 估算受用户定义的基因组(如队列水平基因关联研究产生的基因组)影响最大的大脑特征;(4) 生成发表水平、可修改的重要发现可视化图谱。我们从已有的分析中确定了中风相关基因的神经相关性,从而证明了该资源的实用性:结论:在从基因到基于大脑的诊断表型的过程中,将神经学数据作为中间表型进行整合,可提高分子研究的可解释性,并丰富我们对疾病病理生理学的理解。NeuroimaGene R 软件包旨在协助这一过程,并可公开使用。
{"title":"NeuroimaGene: an R package for assessing the neurological correlates of genetically regulated gene expression.","authors":"Xavier Bledsoe, Eric R Gamazon","doi":"10.1186/s12859-024-05936-x","DOIUrl":"10.1186/s12859-024-05936-x","url":null,"abstract":"<p><strong>Background: </strong>We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.</p><p><strong>Results: </strong>We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.</p><p><strong>Conclusions: </strong>Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11463069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-08DOI: 10.1186/s12859-024-05915-2
Bin Baek, Hyunju Lee
Background: Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.
Results: This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.
Conclusions: Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.
{"title":"Crossfeat: a transformer-based cross-feature learning model for predicting drug side effect frequency.","authors":"Bin Baek, Hyunju Lee","doi":"10.1186/s12859-024-05915-2","DOIUrl":"10.1186/s12859-024-05915-2","url":null,"abstract":"<p><strong>Background: </strong>Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.</p><p><strong>Results: </strong>This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.</p><p><strong>Conclusions: </strong>Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11459996/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-05DOI: 10.1186/s12859-024-05886-4
Daniel Chafamo, Vignesh Shanmugam, Neriman Tokcan
In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .
{"title":"C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.","authors":"Daniel Chafamo, Vignesh Shanmugam, Neriman Tokcan","doi":"10.1186/s12859-024-05886-4","DOIUrl":"10.1186/s12859-024-05886-4","url":null,"abstract":"<p><p>In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to <math><mrow><mn>2.4</mn> <mo>×</mo></mrow> </math> better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11456250/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142378997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-04DOI: 10.1186/s12859-024-05940-1
Yuhua Fan, Patrik Waldmann
Purpose: More accurate prediction of phenotype traits can increase the success of genomic selection in both plant and animal breeding studies and provide more reliable disease risk prediction in humans. Traditional approaches typically use regression models based on linear assumptions between the genetic markers and the traits of interest. Non-linear models have been considered as an alternative tool for modeling genomic interactions (i.e. non-additive effects) and other subtle non-linear patterns between markers and phenotype. Deep learning has become a state-of-the-art non-linear prediction method for sound, image and language data. However, genomic data is better represented in a tabular format. The existing literature on deep learning for tabular data proposes a wide range of novel architectures and reports successful results on various datasets. Tabular deep learning applications in genome-wide prediction (GWP) are still rare. In this work, we perform an overview of the main families of recent deep learning architectures for tabular data and apply them to multi-trait regression and multi-class classification for GWP on real gene datasets.
Methods: The study involves an extensive overview of recent deep learning architectures for tabular data learning: NODE, TabNet, TabR, TabTransformer, FT-Transformer, AutoInt, GANDALF, SAINT and LassoNet. These architectures are applied to multi-trait GWP. Comprehensive benchmarks of various tabular deep learning methods are conducted to identify best practices and determine their effectiveness compared to traditional methods.
Results: Extensive experimental results on several genomic datasets (three for multi-trait regression and two for multi-class classification) highlight LassoNet as a standout performer, surpassing both other tabular deep learning models and the highly efficient tree based LightGBM method in terms of both best prediction accuracy and computing efficiency.
Conclusion: Through series of evaluations on real-world genomic datasets, the study identifies LassoNet as a standout performer, surpassing decision tree methods like LightGBM and other tabular deep learning architectures in terms of both predictive accuracy and computing efficiency. Moreover, the inherent variable selection property of LassoNet provides a systematic way to find important genetic markers that contribute to phenotype expression.
{"title":"Tabular deep learning: a comparative study applied to multi-task genome-wide prediction.","authors":"Yuhua Fan, Patrik Waldmann","doi":"10.1186/s12859-024-05940-1","DOIUrl":"10.1186/s12859-024-05940-1","url":null,"abstract":"<p><strong>Purpose: </strong>More accurate prediction of phenotype traits can increase the success of genomic selection in both plant and animal breeding studies and provide more reliable disease risk prediction in humans. Traditional approaches typically use regression models based on linear assumptions between the genetic markers and the traits of interest. Non-linear models have been considered as an alternative tool for modeling genomic interactions (i.e. non-additive effects) and other subtle non-linear patterns between markers and phenotype. Deep learning has become a state-of-the-art non-linear prediction method for sound, image and language data. However, genomic data is better represented in a tabular format. The existing literature on deep learning for tabular data proposes a wide range of novel architectures and reports successful results on various datasets. Tabular deep learning applications in genome-wide prediction (GWP) are still rare. In this work, we perform an overview of the main families of recent deep learning architectures for tabular data and apply them to multi-trait regression and multi-class classification for GWP on real gene datasets.</p><p><strong>Methods: </strong>The study involves an extensive overview of recent deep learning architectures for tabular data learning: NODE, TabNet, TabR, TabTransformer, FT-Transformer, AutoInt, GANDALF, SAINT and LassoNet. These architectures are applied to multi-trait GWP. Comprehensive benchmarks of various tabular deep learning methods are conducted to identify best practices and determine their effectiveness compared to traditional methods.</p><p><strong>Results: </strong>Extensive experimental results on several genomic datasets (three for multi-trait regression and two for multi-class classification) highlight LassoNet as a standout performer, surpassing both other tabular deep learning models and the highly efficient tree based LightGBM method in terms of both best prediction accuracy and computing efficiency.</p><p><strong>Conclusion: </strong>Through series of evaluations on real-world genomic datasets, the study identifies LassoNet as a standout performer, surpassing decision tree methods like LightGBM and other tabular deep learning architectures in terms of both predictive accuracy and computing efficiency. Moreover, the inherent variable selection property of LassoNet provides a systematic way to find important genetic markers that contribute to phenotype expression.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11452967/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-03DOI: 10.1186/s12859-024-05945-w
Pratima Chatterjee, Prasun Ghosal, Sahadeb Shit, Arindam Biswas, Saurav Mallik, Sarah Allabun, Manal Othman, Almubarak Hassan Ali, E Elshiekh, Ben Othman Soufiene
Background: Several computational and mathematical models of protein synthesis have been explored to accomplish the quantitative analysis of protein synthesis components and polysome structure. The effect of gene sequence (coding and non-coding region) in protein synthesis, mutation in gene sequence, and functional model of ribosome needs to be explored to investigate the relationship among protein synthesis components further. Ribosomal computing is implemented by imitating the functional property of protein synthesis.
Result: In the proposed work, a general framework of ribosomal computing is demonstrated by developing a computational model to present the relationship between biological details of protein synthesis and computing principles. Here, mathematical abstractions are chosen carefully without probing into intricate chemical details of the micro-operations of protein synthesis for ease of understanding. This model demonstrates the cause and effect of ribosome stalling during protein synthesis and the relationship between functional protein and gene sequence. Moreover, it also reveals the computing nature of ribosome molecules and other protein synthesis components. The effect of gene mutation on protein synthesis is also explored in this model.
Conclusion: The computational model for ribosomal computing is implemented in this work. The proposed model demonstrates the relationship among gene sequences and protein synthesis components. This model also helps to implement a simulation environment (a simulator) for generating protein chains from gene sequences and can spot the problem during protein synthesis. Thus, this simulator can identify a disease that can happen due to a protein synthesis problem and suggest precautions for it.
{"title":"Ribosomal computing: implementation of the computational method.","authors":"Pratima Chatterjee, Prasun Ghosal, Sahadeb Shit, Arindam Biswas, Saurav Mallik, Sarah Allabun, Manal Othman, Almubarak Hassan Ali, E Elshiekh, Ben Othman Soufiene","doi":"10.1186/s12859-024-05945-w","DOIUrl":"10.1186/s12859-024-05945-w","url":null,"abstract":"<p><strong>Background: </strong>Several computational and mathematical models of protein synthesis have been explored to accomplish the quantitative analysis of protein synthesis components and polysome structure. The effect of gene sequence (coding and non-coding region) in protein synthesis, mutation in gene sequence, and functional model of ribosome needs to be explored to investigate the relationship among protein synthesis components further. Ribosomal computing is implemented by imitating the functional property of protein synthesis.</p><p><strong>Result: </strong>In the proposed work, a general framework of ribosomal computing is demonstrated by developing a computational model to present the relationship between biological details of protein synthesis and computing principles. Here, mathematical abstractions are chosen carefully without probing into intricate chemical details of the micro-operations of protein synthesis for ease of understanding. This model demonstrates the cause and effect of ribosome stalling during protein synthesis and the relationship between functional protein and gene sequence. Moreover, it also reveals the computing nature of ribosome molecules and other protein synthesis components. The effect of gene mutation on protein synthesis is also explored in this model.</p><p><strong>Conclusion: </strong>The computational model for ribosomal computing is implemented in this work. The proposed model demonstrates the relationship among gene sequences and protein synthesis components. This model also helps to implement a simulation environment (a simulator) for generating protein chains from gene sequences and can spot the problem during protein synthesis. Thus, this simulator can identify a disease that can happen due to a protein synthesis problem and suggest precautions for it.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11448306/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142364277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}