首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Be-dataHIVE: a base editing database. Be-dataHIVE:基础编辑数据库。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05898-0
Lucas Schneider, Peter Minary

Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.

碱基编辑是一种增强型基因编辑方法,可实现单个核苷酸的精确转化,具有治疗罕见疾病的潜力。碱基编辑器的设计过程是劳动密集型的,结果也不容易预测。要用于临床,碱基编辑必须准确、高效。因此,必须尽量减少旁观者突变。近年来,预测碱基编辑结果的计算模型已经开发出来。然而,这些模型的整体稳健性和性能有限。提高性能的方法之一是在多样化、特征丰富的大型数据集上训练模型,而碱基编辑领域并不存在这样的数据集。因此,我们开发了一个 MySQL 数据库 BE-dataHIVE,它涵盖了超过 46 万个 gRNA 目标组合。当前版本的 BE-dataHIVE 包含来自五项研究的数据,并丰富了熔化温度和能量项。此外,还为机器学习计算了多种不同的数据结构,并可直接使用。该数据库可通过我们的网站 https://be-datahive.com/ 或 API 访问,因此适合从业人员和机器学习研究人员使用。
{"title":"Be-dataHIVE: a base editing database.","authors":"Lucas Schneider, Peter Minary","doi":"10.1186/s12859-024-05898-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05898-0","url":null,"abstract":"<p><p>Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476525/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks. LDAGM:基于多视角异构网络的图卷积自动编码器和多层感知器预测 lncRNA 与疾病的关联。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05950-z
Bing Zhang, Haoyu Wang, Chao Ma, Hai Huang, Zhou Fang, Jiaxing Qu

Background: Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.

Results: In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.

Conclusions: Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.

背景:长非编码RNA(long non-coding RNAs,lncRNAs)可以预防、诊断和治疗多种复杂的人类疾病,建立一种有效预测lncRNA-疾病关联的方法至关重要:本文提出了一种基于图卷积自动编码器和多层感知器模型的 lncRNA 与疾病关联关系预测方法,命名为 LDAGM。该方法首先提取了 lncRNA 和 miRNA 的功能相似性和高斯交互图谱核相似性,以及疾病的语义相似性和高斯交互图谱核相似性。然后,它构建了六个同质网络,并使用深度拓扑特征提取方法将它们深度融合。融合后的网络有助于对原始关联关系进行特征补充和深度挖掘,捕捉节点之间的深层联系。接下来,通过将获得的深度拓扑特征与 lncRNA、疾病和 miRNA 相互作用的相似性网络相结合,我们构建了一个多视角异构网络模型。图卷积自动编码器用于非线性特征提取。最后,将提取的非线性特征与多视角异构网络的深度拓扑特征相结合,得到 lncRNA-疾病配对的最终特征表示。使用多层感知器模型对 lncRNA 与疾病的关联关系进行预测。为了提高多层感知器模型的性能和稳定性,我们在多层感知器模型中引入了一个名为聚合层的隐藏层。通过门控机制,它可以控制多层感知器模型中各隐藏层之间的信息流,从而实现各隐藏层的最佳特征提取:参数分析、消融研究和对比实验验证了该方法的有效性,案例研究验证了该方法在预测 lncRNA 与疾病关联关系方面的准确性。
{"title":"LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks.","authors":"Bing Zhang, Haoyu Wang, Chao Ma, Hai Huang, Zhou Fang, Jiaxing Qu","doi":"10.1186/s12859-024-05950-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05950-z","url":null,"abstract":"<p><strong>Background: </strong>Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.</p><p><strong>Results: </strong>In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.</p><p><strong>Conclusions: </strong>Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11481433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification. DNASimCLR:基于对比学习的基因序列数据分类深度学习方法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-14 DOI: 10.1186/s12859-024-05955-8
Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin

Background: The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.

Results: DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.

Conclusions: DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.

背景:深度神经网络模型的快速发展大大提高了从微生物序列数据中提取特征的能力,这对于解决生物学难题至关重要。然而,标注微生物数据的稀缺性和复杂性给监督学习方法带来了巨大困难。为了解决这些问题,我们提出了 DNASimCLR,这是一种无监督框架,旨在高效提取基因序列数据特征:DNASimCLR 利用卷积神经网络和基于对比学习的 SimCLR 框架,从不同的微生物基因序列中提取复杂的特征。预训练在两个经典的大规模无标签数据集上进行,包括元基因组和病毒基因序列。随后的分类任务是利用之前获得的模型对预训练模型进行微调。我们的实验证明,DNASimCLR 在基因序列分类方面至少可以与最先进的技术相媲美。对于基于卷积神经网络的方法,DNASimCLR 超越了现有的最新方法,明确确立了其优于最先进的基于 CNN 的特征提取技术的地位。此外,该模型在分析生物序列数据的各种任务中表现出卓越的性能,展示了其强大的适应性:DNASimCLR 是一种用于基因序列分类的稳健且与数据库无关的解决方案。它的多功能性使其在涉及新基因序列或以前未见过的基因序列的情况下表现出色,成为基因组学中各种应用的重要工具。
{"title":"DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.","authors":"Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin","doi":"10.1186/s12859-024-05955-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05955-8","url":null,"abstract":"<p><strong>Background: </strong>The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.</p><p><strong>Results: </strong>DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.</p><p><strong>Conclusions: </strong>DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-task graph deep learning model to predict drugs combination of synergy and sensitivity scores. 多任务图深度学习模型,用于预测协同作用和敏感性得分的药物组合。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-10 DOI: 10.1186/s12859-024-05925-0
Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid

Background: Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.

Results: This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( MSE ), mean absolute error ( MAE ), coefficient of determination ( R 2 ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.

Conclusion: This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.

背景:事实证明,联合用药是治疗癌症等具有挑战性疾病的现实技术,既能提高疗效,又能减轻副作用。为了实现这些联合疗法的治疗目标,必须采用多靶点药物组合,以最大限度地提高疗效和协同效应:本文提出的 "MultiComb "是一种多任务深度学习(MTDL)模型,旨在同时预测药物组合的协同作用和敏感性。该模型利用图卷积网络来表示两种药物的简化分子输入线段(SMILES),生成它们各自的特征。此外,三个完全连接的子网络还能提取癌细胞系的特征。然后,这些药物和细胞系特征被连接起来,并通过注意力机制进行处理,从而为目标任务输出两个优化的特征表示。交叉缝合模型学习这些任务之间的关系。最后,将每个学习到的任务特征输入全连接子网络,以预测协同性和敏感性得分。我们使用 O'Neil 基准数据集对所提出的模型进行了验证,该数据集包含 38 种独特的药物,组合成 17,901 对药物组合,并在 37 种独特的癌细胞中进行了测试。该模型的性能测试采用了一些指标,如均方误差(MSE)、平均绝对误差(MAE)、决定系数(R 2)、斯皮尔曼和皮尔逊评分。在上述指标中,拟议模型的平均协同得分分别为 232.37、9.59、0.57、0.76 和 0.73。此外,平均灵敏度得分分别为 15.59、2.74、0.90、0.95 和 0.95:本文提出了一种 MTDL 模型,用于预测针对特定癌细胞系的药物组合的协同作用和敏感性得分。与现有方法相比,MTDL 模型表现出更优越的性能,提供了更好的结果。
{"title":"A multi-task graph deep learning model to predict drugs combination of synergy and sensitivity scores.","authors":"Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid","doi":"10.1186/s12859-024-05925-0","DOIUrl":"10.1186/s12859-024-05925-0","url":null,"abstract":"<p><strong>Background: </strong>Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.</p><p><strong>Results: </strong>This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( <math><mrow><mi>MSE</mi></mrow> </math> ), mean absolute error ( <math><mrow><mi>MAE</mi></mrow> </math> ), coefficient of determination ( <math> <msup><mrow><mi>R</mi></mrow> <mn>2</mn></msup> </math> ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.</p><p><strong>Conclusion: </strong>This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468365/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142399244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MethylSeqLogo: DNA methylation smart sequence logos. MethylSeqLogo:DNA 甲基化智能序列标识。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-09 DOI: 10.1186/s12859-024-05896-2
Fei-Man Hsu, Paul Horton

Background: Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.

Method: We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.

Conclusions: Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.

背景:一些转录因子(例如 MYC)与可能甲基化的 DNA 位点结合。这可能会增加结合的特异性,因为这些位点(1)在基因组中的代表性极低,(2)以低甲基化或高甲基化的形式提供额外的组织特异性信息。幸运的是,亚硫酸氢盐测序数据可用于研究这一现象:我们开发了 MethylSeqLogo,它是序列标识的一种扩展,其中包含了一些新元素,用于显示 DNA 甲基化和一组结合位点中每个位置上代表性不足的二聚体。我们的方法显示 DNA 双链的信息,并考虑到适当的序列上下文(CpG 或其他)和基因组区域(启动子或全基因组),以正确评估预期的背景二聚体频率和甲基化水平。MethylSeqLogo 保留了序列徽标的语义--一列中核苷酸的相对高度代表它们在结合位点中的比例,而每列的绝对高度代表信息(相对熵),所有列加起来的高度代表总信息 结果:我们展示的图表说明了使用 MethylSeqLogo 总结几个 CpG 结合转录因子数据的实用性。图标显示,未甲基化的 CpG 结合位点是 MYC 和 ZBTB33 等转录因子的特征,而其他一些 CpG 结合转录因子(如 CEBPB)则呈现甲基化中性:结论:我们的软件使用户能够探索亚硫酸氢盐和 ChIP 测序数据集,并在此过程中获得具有发表质量的数据。
{"title":"MethylSeqLogo: DNA methylation smart sequence logos.","authors":"Fei-Man Hsu, Paul Horton","doi":"10.1186/s12859-024-05896-2","DOIUrl":"10.1186/s12859-024-05896-2","url":null,"abstract":"<p><strong>Background: </strong>Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.</p><p><strong>Method: </strong>We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.</p><p><strong>Conclusions: </strong>Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11462690/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NeuroimaGene: an R package for assessing the neurological correlates of genetically regulated gene expression. NeuroimaGene:用于评估基因调控基因表达的神经相关性的 R 软件包。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-08 DOI: 10.1186/s12859-024-05936-x
Xavier Bledsoe, Eric R Gamazon

Background: We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.

Results: We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.

Conclusions: Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.

背景:我们介绍的 NeuroimaGene 资源是一个 R 软件包,旨在帮助研究人员识别与精神和神经健康相关的基因和神经特征。虽然最近的研究已经确定了数百个基因是神经和精神疾病病理生理学的潜在组成部分,但解释这种变异的生理后果仍具有挑战性。将神经影像数据与分子研究结果相结合是应对这一挑战的一个步骤。除了与分子变异和临床表型有关联外,神经影像学特征还能为认知过程提供内在信息。NeuroimaGene 为了解疾病相关基因与大脑中间结构的关系提供了一种工具:我们创建了 NeuroimaGene,它是一个用户友好、开放存取的 R 软件包,现在可供公众使用。它的主要功能是识别受用户提供的基因或基因组的基因调控表达影响的神经影像衍生大脑特征。该资源可用于:(1) 鉴定与大脑结构和功能相关的单个基因或基因组;(2) 识别目标基因的表达与神经相关的大脑或身体区域;(3) 估算受用户定义的基因组(如队列水平基因关联研究产生的基因组)影响最大的大脑特征;(4) 生成发表水平、可修改的重要发现可视化图谱。我们从已有的分析中确定了中风相关基因的神经相关性,从而证明了该资源的实用性:结论:在从基因到基于大脑的诊断表型的过程中,将神经学数据作为中间表型进行整合,可提高分子研究的可解释性,并丰富我们对疾病病理生理学的理解。NeuroimaGene R 软件包旨在协助这一过程,并可公开使用。
{"title":"NeuroimaGene: an R package for assessing the neurological correlates of genetically regulated gene expression.","authors":"Xavier Bledsoe, Eric R Gamazon","doi":"10.1186/s12859-024-05936-x","DOIUrl":"10.1186/s12859-024-05936-x","url":null,"abstract":"<p><strong>Background: </strong>We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.</p><p><strong>Results: </strong>We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.</p><p><strong>Conclusions: </strong>Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11463069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Crossfeat: a transformer-based cross-feature learning model for predicting drug side effect frequency. Crossfeat:基于变换器的交叉特征学习模型,用于预测药物副作用频率。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-08 DOI: 10.1186/s12859-024-05915-2
Bin Baek, Hyunju Lee

Background: Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.

Results: This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.

Conclusions: Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.

背景:安全的药物治疗需要了解潜在的副作用。识别药物副作用的频率可以降低用药风险。然而,现有的预测药物副作用频率的计算方法严重依赖于已知的药物副作用频率信息。因此,这些方法在预测新药副作用频率时面临挑战。虽然有一些方法可以预测新药的副作用频率,但由于排除了药物与副作用的关系,这些方法的性能并不可靠:本研究提出的 CrossFeat 是一种基于卷积神经网络-变换器架构的交叉特征学习模型,即使在缺乏药物副作用关系信息的情况下,也能预测新药的副作用发生率和频率。CrossFeat 在其转换器架构中促进了药物和副作用信息的同步学习。这种同时进行的信息交换使药物能够了解其相关的副作用,而副作用也能同时获得相应药物的信息。这种双向学习可以全面整合药物和副作用知识。我们的五倍交叉验证实验表明,CrossFeat 在预测新药副作用频率方面优于现有的研究,而无需先验知识:结论:我们的模型为预测药物副作用频率提供了一种很有前景的方法,特别是对于先验信息有限的新药。CrossFeat 在交叉验证实验中的优异表现,以及案例研究和消融实验的证据,凸显了它的有效性。
{"title":"Crossfeat: a transformer-based cross-feature learning model for predicting drug side effect frequency.","authors":"Bin Baek, Hyunju Lee","doi":"10.1186/s12859-024-05915-2","DOIUrl":"10.1186/s12859-024-05915-2","url":null,"abstract":"<p><strong>Background: </strong>Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.</p><p><strong>Results: </strong>This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.</p><p><strong>Conclusions: </strong>Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11459996/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data. C-ziptf:零膨胀多维基因组学数据的稳定张量因式分解。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-05 DOI: 10.1186/s12859-024-05886-4
Daniel Chafamo, Vignesh Shanmugam, Neriman Tokcan

In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to 2.4 × better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .

在过去的二十年里,基因组学取得了长足的进步,其中单细胞 RNA 测序(scRNA-seq)是一个重要的里程碑。ScRNA-seq 提供了对细胞多样性的无与伦比的洞察力,并促进了跨越多种条件和样本的多样化研究,从而产生了大量复杂的多维基因组学数据。这凸显了对能够处理此类基因组学数据的复杂性和多维性的稳健方法的需求。此外,单细胞数据还因捕获效率低和丢失效应等问题而面临稀疏性问题。张量因子化(TF)已成为从多维基因组学数据中揭示复杂模式的强大工具。基于最大似然估计的经典张量因式分解方法在处理零膨胀计数数据时非常吃力,而张量因式分解固有的随机性使结果解释和可重复性变得更加复杂。我们的论文介绍了零膨胀泊松张量因式分解(ZIPTF),这是一种用于高维零膨胀计数数据因式分解的新方法。我们还介绍了共识-ZIPTF(Consensus-ZIPTF),它将 ZIPTF 与基于共识的方法合并,以解决随机性问题。我们在合成的零膨胀计数数据、模拟的 scRNA-seq 数据和真实的多样本多条件 scRNA-seq 数据集上评估了我们提出的方法。ZIPTF 始终优于基线矩阵和张量因式分解方法,在零膨胀数据方面显示出更高的重建精度。在处理高概率的过零数据时,ZIPTF 的准确度最高可提高 2.4 倍。此外,C-ZIPTF 还显著增强了因式分解的一致性。在合成和真实 scRNA-seq 数据上进行测试时,ZIPTF 和 C-ZIPTF 始终能发现已知的、具有生物学意义的基因表达程序。访问我们的数据和代码:https://github.com/klarman-cell-observatory/scBTF 和 https://github.com/klarman-cell-observatory/scbtf_experiments 。
{"title":"C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.","authors":"Daniel Chafamo, Vignesh Shanmugam, Neriman Tokcan","doi":"10.1186/s12859-024-05886-4","DOIUrl":"10.1186/s12859-024-05886-4","url":null,"abstract":"<p><p>In the past two decades, genomics has advanced significantly, with single-cell RNA-sequencing (scRNA-seq) marking a pivotal milestone. ScRNA-seq provides unparalleled insights into cellular diversity and has spurred diverse studies across multiple conditions and samples, resulting in an influx of complex multidimensional genomics data. This highlights the need for robust methodologies capable of handling the complexity and multidimensionality of such genomics data. Furthermore, single-cell data grapples with sparsity due to issues like low capture efficiency and dropout effects. Tensor factorizations (TF) have emerged as powerful tools to unravel the complex patterns from multi-dimensional genomics data. Classic TF methods, based on maximum likelihood estimation, struggle with zero-inflated count data, while the inherent stochasticity in TFs further complicates result interpretation and reproducibility. Our paper introduces Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel method for high-dimensional zero-inflated count data factorization. We also present Consensus-ZIPTF (C-ZIPTF), merging ZIPTF with a consensus-based approach to address stochasticity. We evaluate our proposed methods on synthetic zero-inflated count data, simulated scRNA-seq data, and real multi-sample multi-condition scRNA-seq datasets. ZIPTF consistently outperforms baseline matrix and tensor factorization methods, displaying enhanced reconstruction accuracy for zero-inflated data. When dealing with high probabilities of excess zeros, ZIPTF achieves up to <math><mrow><mn>2.4</mn> <mo>×</mo></mrow> </math> better accuracy. Moreover, C-ZIPTF notably enhances the factorization's consistency. When tested on synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently uncover known and biologically meaningful gene expression programs. Access our data and code at: https://github.com/klarman-cell-observatory/scBTF and https://github.com/klarman-cell-observatory/scbtf_experiments .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11456250/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142378997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tabular deep learning: a comparative study applied to multi-task genome-wide prediction. 表格式深度学习:应用于多任务全基因组预测的比较研究。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-04 DOI: 10.1186/s12859-024-05940-1
Yuhua Fan, Patrik Waldmann

Purpose: More accurate prediction of phenotype traits can increase the success of genomic selection in both plant and animal breeding studies and provide more reliable disease risk prediction in humans. Traditional approaches typically use regression models based on linear assumptions between the genetic markers and the traits of interest. Non-linear models have been considered as an alternative tool for modeling genomic interactions (i.e. non-additive effects) and other subtle non-linear patterns between markers and phenotype. Deep learning has become a state-of-the-art non-linear prediction method for sound, image and language data. However, genomic data is better represented in a tabular format. The existing literature on deep learning for tabular data proposes a wide range of novel architectures and reports successful results on various datasets. Tabular deep learning applications in genome-wide prediction (GWP) are still rare. In this work, we perform an overview of the main families of recent deep learning architectures for tabular data and apply them to multi-trait regression and multi-class classification for GWP on real gene datasets.

Methods: The study involves an extensive overview of recent deep learning architectures for tabular data learning: NODE, TabNet, TabR, TabTransformer, FT-Transformer, AutoInt, GANDALF, SAINT and LassoNet. These architectures are applied to multi-trait GWP. Comprehensive benchmarks of various tabular deep learning methods are conducted to identify best practices and determine their effectiveness compared to traditional methods.

Results: Extensive experimental results on several genomic datasets (three for multi-trait regression and two for multi-class classification) highlight LassoNet as a standout performer, surpassing both other tabular deep learning models and the highly efficient tree based LightGBM method in terms of both best prediction accuracy and computing efficiency.

Conclusion: Through series of evaluations on real-world genomic datasets, the study identifies LassoNet as a standout performer, surpassing decision tree methods like LightGBM and other tabular deep learning architectures in terms of both predictive accuracy and computing efficiency. Moreover, the inherent variable selection property of LassoNet provides a systematic way to find important genetic markers that contribute to phenotype expression.

目的:更准确地预测表型性状可以提高动植物育种研究中基因组选择的成功率,并为人类提供更可靠的疾病风险预测。传统方法通常使用基于遗传标记和相关性状之间线性假设的回归模型。非线性模型被认为是基因组相互作用(即非加成效应)建模以及标记与表型之间其他微妙非线性模式建模的替代工具。深度学习已成为最先进的声音、图像和语言数据非线性预测方法。然而,基因组数据最好以表格形式表示。关于表格式数据深度学习的现有文献提出了各种新颖的架构,并报告了在各种数据集上取得的成功结果。表格式深度学习在全基因组预测(GWP)中的应用还很少见。在这项工作中,我们对近期用于表格式数据的深度学习架构的主要系列进行了综述,并将其应用于真实基因数据集上的全基因组预测的多性状回归和多类分类:本研究广泛综述了近期用于表格数据学习的深度学习架构:NODE、TabNet、TabR、TabTransformer、FT-Transformer、AutoInt、GANDALF、SAINT 和 LassoNet。这些架构适用于多性状 GWP。对各种表格深度学习方法进行了全面的基准测试,以确定最佳实践,并确定它们与传统方法相比的有效性:在多个基因组数据集(3 个用于多性状回归,2 个用于多类分类)上的广泛实验结果表明,LassoNet 表现突出,在最佳预测准确率和计算效率方面都超过了其他表格式深度学习模型和高效的基于树的 LightGBM 方法:通过对真实世界基因组数据集的一系列评估,该研究发现 LassoNet 表现突出,在预测准确率和计算效率方面都超过了 LightGBM 等决策树方法和其他表格式深度学习架构。此外,LassoNet 固有的变量选择特性为找到有助于表型表达的重要遗传标记提供了一种系统方法。
{"title":"Tabular deep learning: a comparative study applied to multi-task genome-wide prediction.","authors":"Yuhua Fan, Patrik Waldmann","doi":"10.1186/s12859-024-05940-1","DOIUrl":"10.1186/s12859-024-05940-1","url":null,"abstract":"<p><strong>Purpose: </strong>More accurate prediction of phenotype traits can increase the success of genomic selection in both plant and animal breeding studies and provide more reliable disease risk prediction in humans. Traditional approaches typically use regression models based on linear assumptions between the genetic markers and the traits of interest. Non-linear models have been considered as an alternative tool for modeling genomic interactions (i.e. non-additive effects) and other subtle non-linear patterns between markers and phenotype. Deep learning has become a state-of-the-art non-linear prediction method for sound, image and language data. However, genomic data is better represented in a tabular format. The existing literature on deep learning for tabular data proposes a wide range of novel architectures and reports successful results on various datasets. Tabular deep learning applications in genome-wide prediction (GWP) are still rare. In this work, we perform an overview of the main families of recent deep learning architectures for tabular data and apply them to multi-trait regression and multi-class classification for GWP on real gene datasets.</p><p><strong>Methods: </strong>The study involves an extensive overview of recent deep learning architectures for tabular data learning: NODE, TabNet, TabR, TabTransformer, FT-Transformer, AutoInt, GANDALF, SAINT and LassoNet. These architectures are applied to multi-trait GWP. Comprehensive benchmarks of various tabular deep learning methods are conducted to identify best practices and determine their effectiveness compared to traditional methods.</p><p><strong>Results: </strong>Extensive experimental results on several genomic datasets (three for multi-trait regression and two for multi-class classification) highlight LassoNet as a standout performer, surpassing both other tabular deep learning models and the highly efficient tree based LightGBM method in terms of both best prediction accuracy and computing efficiency.</p><p><strong>Conclusion: </strong>Through series of evaluations on real-world genomic datasets, the study identifies LassoNet as a standout performer, surpassing decision tree methods like LightGBM and other tabular deep learning architectures in terms of both predictive accuracy and computing efficiency. Moreover, the inherent variable selection property of LassoNet provides a systematic way to find important genetic markers that contribute to phenotype expression.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11452967/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ribosomal computing: implementation of the computational method. 核糖体计算:计算方法的实施。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-03 DOI: 10.1186/s12859-024-05945-w
Pratima Chatterjee, Prasun Ghosal, Sahadeb Shit, Arindam Biswas, Saurav Mallik, Sarah Allabun, Manal Othman, Almubarak Hassan Ali, E Elshiekh, Ben Othman Soufiene

Background: Several computational and mathematical models of protein synthesis have been explored to accomplish the quantitative analysis of protein synthesis components and polysome structure. The effect of gene sequence (coding and non-coding region) in protein synthesis, mutation in gene sequence, and functional model of ribosome needs to be explored to investigate the relationship among protein synthesis components further. Ribosomal computing is implemented by imitating the functional property of protein synthesis.

Result: In the proposed work, a general framework of ribosomal computing is demonstrated by developing a computational model to present the relationship between biological details of protein synthesis and computing principles. Here, mathematical abstractions are chosen carefully without probing into intricate chemical details of the micro-operations of protein synthesis for ease of understanding. This model demonstrates the cause and effect of ribosome stalling during protein synthesis and the relationship between functional protein and gene sequence. Moreover, it also reveals the computing nature of ribosome molecules and other protein synthesis components. The effect of gene mutation on protein synthesis is also explored in this model.

Conclusion: The computational model for ribosomal computing is implemented in this work. The proposed model demonstrates the relationship among gene sequences and protein synthesis components. This model also helps to implement a simulation environment (a simulator) for generating protein chains from gene sequences and can spot the problem during protein synthesis. Thus, this simulator can identify a disease that can happen due to a protein synthesis problem and suggest precautions for it.

背景:为了完成蛋白质合成组分和多聚体结构的定量分析,人们探索了一些蛋白质合成的计算和数学模型。要进一步研究蛋白质合成各组分之间的关系,还需要探讨基因序列(编码区和非编码区)对蛋白质合成的影响、基因序列的突变以及核糖体的功能模型。核糖体计算是通过模仿蛋白质合成的功能特性来实现的:在所提出的工作中,通过开发一个计算模型来展示蛋白质合成的生物学细节与计算原理之间的关系,从而展示了核糖体计算的总体框架。为了便于理解,我们精心选择了数学抽象概念,而没有探究蛋白质合成微观操作的复杂化学细节。该模型展示了蛋白质合成过程中核糖体停滞的因果关系,以及功能蛋白质与基因序列之间的关系。此外,它还揭示了核糖体分子和其他蛋白质合成成分的计算性质。该模型还探讨了基因突变对蛋白质合成的影响:本作品实现了核糖体计算模型。提出的模型展示了基因序列和蛋白质合成元件之间的关系。该模型还有助于建立一个模拟环境(模拟器),根据基因序列生成蛋白质链,并发现蛋白质合成过程中的问题。因此,该模拟器可以识别因蛋白质合成问题而可能导致的疾病,并提出预防措施。
{"title":"Ribosomal computing: implementation of the computational method.","authors":"Pratima Chatterjee, Prasun Ghosal, Sahadeb Shit, Arindam Biswas, Saurav Mallik, Sarah Allabun, Manal Othman, Almubarak Hassan Ali, E Elshiekh, Ben Othman Soufiene","doi":"10.1186/s12859-024-05945-w","DOIUrl":"10.1186/s12859-024-05945-w","url":null,"abstract":"<p><strong>Background: </strong>Several computational and mathematical models of protein synthesis have been explored to accomplish the quantitative analysis of protein synthesis components and polysome structure. The effect of gene sequence (coding and non-coding region) in protein synthesis, mutation in gene sequence, and functional model of ribosome needs to be explored to investigate the relationship among protein synthesis components further. Ribosomal computing is implemented by imitating the functional property of protein synthesis.</p><p><strong>Result: </strong>In the proposed work, a general framework of ribosomal computing is demonstrated by developing a computational model to present the relationship between biological details of protein synthesis and computing principles. Here, mathematical abstractions are chosen carefully without probing into intricate chemical details of the micro-operations of protein synthesis for ease of understanding. This model demonstrates the cause and effect of ribosome stalling during protein synthesis and the relationship between functional protein and gene sequence. Moreover, it also reveals the computing nature of ribosome molecules and other protein synthesis components. The effect of gene mutation on protein synthesis is also explored in this model.</p><p><strong>Conclusion: </strong>The computational model for ribosomal computing is implemented in this work. The proposed model demonstrates the relationship among gene sequences and protein synthesis components. This model also helps to implement a simulation environment (a simulator) for generating protein chains from gene sequences and can spot the problem during protein synthesis. Thus, this simulator can identify a disease that can happen due to a protein synthesis problem and suggest precautions for it.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11448306/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142364277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1