首页 > 最新文献

Current Bioinformatics最新文献

英文 中文
A Comparative Review and Analysis of Computational Predictors forIdentification of Enhancer and their Strength 用于识别增强子及其强度的计算预测因子的比较研究与分析
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-06-04 DOI: 10.2174/0115748936285942240513064919
Mehwish Gill, Muhammad Kabir, Saeed Ahmed, Muhammad Asif Subhani, Maqsood Hayat
Enhancers are the short functional regions (50–1500bp) in the genome, which play aneffective character in activating gene-transcription in the presence of transcription-factors (TFs).Many human diseases, such as cancer and inflammatory bowel disease, are correlated with the enhancers’genetic variations. The precise recognition of the enhancers provides useful insights forunderstanding the pathogenesis of human diseases and their treatments. High-throughput experimentsare considered essential tools for characterizing enhancers; however, these methods are laborious,costly and time-consuming. Computational methods are considered alternative solutions foraccurate and rapid identification of the enhancers. Over the past years, numerous computationalpredictors have been devised for predicting enhancers and their strength. A comprehensive reviewand thorough assessment are indispensable to systematically compare sequence-based enhancer’sbioinformatics tools on their performance. Giving the increasing interest in this domain, we conducteda large-scale analysis and assessment of the state-of-the-art enhancer predictors to evaluatetheir scalability and generalization power. Additionally, we classified the existing approaches intothree main groups: conventional machine-learning, ensemble and deep learning-based approaches.Furthermore, the study has focused on exploring the important factors that are crucial for developingprecise and reliable predictors such as designing trusted benchmark/independent datasets, featurerepresentation schemes, feature selection methods, classification strategies, evaluation metricsand webservers. Finally, the insights from this review are expected to provide important guidelinesto the research community and pharmaceutical companies in general and high-throughput tools forthe detection and characterization of enhancers in particular.
增强子是基因组中的短功能区(50-1500bp),在转录因子(TFs)存在的情况下,增强子在激活基因转录方面发挥着有效作用。许多人类疾病,如癌症和炎症性肠病,都与增强子的基因变异有关。增强子的精确识别为了解人类疾病的发病机理及其治疗提供了有用的见解。高通量实验被认为是表征增强子的基本工具;然而,这些方法费力、费钱、费时。计算方法被认为是准确、快速鉴定增强子的替代方案。在过去几年中,人们设计了许多计算预测器来预测增强子及其强度。要系统地比较基于序列的增强子生物信息学工具的性能,全面回顾和彻底评估是必不可少的。鉴于人们对这一领域的兴趣与日俱增,我们对最先进的增强子预测工具进行了大规模的分析和评估,以评价它们的可扩展性和泛化能力。此外,我们还将现有方法分为三大类:传统机器学习方法、集合方法和基于深度学习的方法。此外,本研究还重点探讨了对开发精确可靠的预测器至关重要的重要因素,如设计可信的基准/独立数据集、特征表示方案、特征选择方法、分类策略、评估指标和网络服务器。最后,本综述的见解有望为研究界和制药公司提供重要指导,特别是为增强子的检测和表征提供高通量工具。
{"title":"A Comparative Review and Analysis of Computational Predictors for\u0000Identification of Enhancer and their Strength","authors":"Mehwish Gill, Muhammad Kabir, Saeed Ahmed, Muhammad Asif Subhani, Maqsood Hayat","doi":"10.2174/0115748936285942240513064919","DOIUrl":"https://doi.org/10.2174/0115748936285942240513064919","url":null,"abstract":"\u0000\u0000Enhancers are the short functional regions (50–1500bp) in the genome, which play an\u0000effective character in activating gene-transcription in the presence of transcription-factors (TFs).\u0000Many human diseases, such as cancer and inflammatory bowel disease, are correlated with the enhancers’\u0000genetic variations. The precise recognition of the enhancers provides useful insights for\u0000understanding the pathogenesis of human diseases and their treatments. High-throughput experiments\u0000are considered essential tools for characterizing enhancers; however, these methods are laborious,\u0000costly and time-consuming. Computational methods are considered alternative solutions for\u0000accurate and rapid identification of the enhancers. Over the past years, numerous computational\u0000predictors have been devised for predicting enhancers and their strength. A comprehensive review\u0000and thorough assessment are indispensable to systematically compare sequence-based enhancer’s\u0000bioinformatics tools on their performance. Giving the increasing interest in this domain, we conducted\u0000a large-scale analysis and assessment of the state-of-the-art enhancer predictors to evaluate\u0000their scalability and generalization power. Additionally, we classified the existing approaches into\u0000three main groups: conventional machine-learning, ensemble and deep learning-based approaches.\u0000Furthermore, the study has focused on exploring the important factors that are crucial for developing\u0000precise and reliable predictors such as designing trusted benchmark/independent datasets, feature\u0000representation schemes, feature selection methods, classification strategies, evaluation metrics\u0000and webservers. Finally, the insights from this review are expected to provide important guidelines\u0000to the research community and pharmaceutical companies in general and high-throughput tools for\u0000the detection and characterization of enhancers in particular.\u0000","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141387227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Deep Learning for Cancer Survival Prediction: A Review 用于癌症生存预测的多模态深度学习:综述
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-05-31 DOI: 10.2174/0115748936289033240424071522
Ge Zhang, Chenwei Ma, Chaokun Yan, Huimin Luo, Jianlin Wang, Wenjuan Liang, Junwei Luo
Background:: Cancer has emerged as the "leading killer" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective:: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods:: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using "Multimodal", "Deep Learning" and "Cancer Survival Prediction" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results:: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion:: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.
背景:癌症已成为人类健康的 "头号杀手":癌症已成为人类健康的 "头号杀手"。生存预测是癌症预后的一个重要分支。其目的是根据患者的病情估计其生存风险。准确、高效的生存预测对癌症患者的治疗和临床管理至关重要,可以避免不必要的痛苦,节约宝贵的医疗资源。深度学习已被广泛应用于癌症诊断、预后和治疗管理。下一代测序成本的降低、相关数据库的不断发展以及多模态深度学习的深入研究,为建立功能更丰富、更准确的生存预测模型提供了契机。目标::目前癌症生存预测领域仍缺乏对多模态深度学习方法的综述。方法我们对多模态深度学习用于癌症生存预测的相关研究进行了统计分析。我们首先过滤了 6 篇已知相关论文中的关键词。然后,我们以 "多模态"、"深度学习 "和 "癌症生存预测 "为关键词,在 PubMed 和 Google Scholar 上搜索了 2018 年至 2022 年的相关论文。然后,我们通过前向和后向引文检索进一步搜索了相关出版物。随后,我们根据这些研究的数据集和方法对其进行了详细的分析和综述。结果我们对 2018 年至 2022 年癌症生存预测的多模态深度学习研究进行了全面的系统综述。结论:::多模态深度学习在大大提高癌症生存预测方面表现出了强大的数据聚合能力和优异的性能。它对促进癌症自动诊断和精准肿瘤学的发展产生了重要的积极影响。
{"title":"Multimodal Deep Learning for Cancer Survival Prediction: A Review","authors":"Ge Zhang, Chenwei Ma, Chaokun Yan, Huimin Luo, Jianlin Wang, Wenjuan Liang, Junwei Luo","doi":"10.2174/0115748936289033240424071522","DOIUrl":"https://doi.org/10.2174/0115748936289033240424071522","url":null,"abstract":"Background:: Cancer has emerged as the \"leading killer\" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective:: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods:: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using \"Multimodal\", \"Deep Learning\" and \"Cancer Survival Prediction\" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results:: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion:: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141198177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning 利用多视角特征融合学习加强药物多肽序列预测
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-05-27 DOI: 10.2174/0115748936294345240510112941
Junyu Zhang, Ronglin Lu, Hongmei Zhou, Xinbo Jiang
Background: Currently, various types of peptides have broad implications for human health and disease. Some drug peptides play significant roles in sensory science, drug research, and cancer biology. The prediction and classification of peptide sequences are of significant importance to various industries. However, predicting peptide sequences through biological experiments is a time-consuming and expensive process. Moreover, the task of protein sequence classification and prediction faces challenges due to the high dimensionality, nonlinearity, and irregularity of protein sequence data, along with the presence of numerous unknown or unlabeled protein sequences. Therefore, an accurate and efficient method for predicting peptide classification is necessary. Methods: In our work, we used two pre-trained models to extract sequence features, TextCNN (Convolutional Neural Networks for Text Classification) and Transformer. We extracted the overall semantic information of the sequences using Transformer Encoder and extracted the local semantic information between sequences using TextCNN and concatenated them into a new feature. Finally, we used the concatenated feature for classification prediction. To validate this approach, we conducted experiments on the BP dataset, THP dataset and DPP-IV dataset and compared them with some pre-trained models. Results: Since TextCNN and Transformer Encoder extract features from different perspectives, the concatenated feature contains multi-view information, which improves the accuracy of the peptide predictor. Conclusion: Ultimately, our model demonstrated superior metrics, highlighting its efficacy in peptide sequence prediction and classification.
背景:目前,各种类型的肽对人类健康和疾病有着广泛的影响。一些药物肽在感官科学、药物研究和癌症生物学中发挥着重要作用。肽序列的预测和分类对各行各业都具有重要意义。然而,通过生物实验预测肽序列是一个耗时且昂贵的过程。此外,由于蛋白质序列数据的高维性、非线性和不规则性,以及存在大量未知或未标记的蛋白质序列,蛋白质序列分类和预测任务面临着挑战。因此,需要一种准确高效的多肽分类预测方法。方法:在我们的工作中,我们使用了两种预先训练好的模型来提取序列特征,即 TextCNN(用于文本分类的卷积神经网络)和 Transformer。我们使用 Transformer 编码器提取序列的整体语义信息,使用 TextCNN 提取序列间的局部语义信息,并将它们串联成一个新特征。最后,我们使用串联特征进行分类预测。为了验证这种方法,我们在 BP 数据集、THP 数据集和 DPP-IV 数据集上进行了实验,并与一些预先训练好的模型进行了比较。实验结果由于 TextCNN 和 Transformer Encoder 从不同角度提取特征,因此合并特征包含了多视角信息,从而提高了肽预测器的准确性。结论最终,我们的模型展示了卓越的指标,突出了其在肽序列预测和分类方面的功效。
{"title":"Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning","authors":"Junyu Zhang, Ronglin Lu, Hongmei Zhou, Xinbo Jiang","doi":"10.2174/0115748936294345240510112941","DOIUrl":"https://doi.org/10.2174/0115748936294345240510112941","url":null,"abstract":"Background: Currently, various types of peptides have broad implications for human health and disease. Some drug peptides play significant roles in sensory science, drug research, and cancer biology. The prediction and classification of peptide sequences are of significant importance to various industries. However, predicting peptide sequences through biological experiments is a time-consuming and expensive process. Moreover, the task of protein sequence classification and prediction faces challenges due to the high dimensionality, nonlinearity, and irregularity of protein sequence data, along with the presence of numerous unknown or unlabeled protein sequences. Therefore, an accurate and efficient method for predicting peptide classification is necessary. Methods: In our work, we used two pre-trained models to extract sequence features, TextCNN (Convolutional Neural Networks for Text Classification) and Transformer. We extracted the overall semantic information of the sequences using Transformer Encoder and extracted the local semantic information between sequences using TextCNN and concatenated them into a new feature. Finally, we used the concatenated feature for classification prediction. To validate this approach, we conducted experiments on the BP dataset, THP dataset and DPP-IV dataset and compared them with some pre-trained models. Results: Since TextCNN and Transformer Encoder extract features from different perspectives, the concatenated feature contains multi-view information, which improves the accuracy of the peptide predictor. Conclusion: Ultimately, our model demonstrated superior metrics, highlighting its efficacy in peptide sequence prediction and classification.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141168940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Exploratory Review on Recent Computational Approaches Devised for MiRNA Disease Association Prediction 最新 MiRNA 疾病关联预测计算方法探索性综述
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-05-20 DOI: 10.2174/0115748936293219240426051148
S Sujamol, E R Vimina, U. Krishnakumar
Recent evidence demonstrated the fundamental role of miRNAs as disease biomarkersand their role in disease progression and pathology. Identifying disease related miRNAs using computationalapproaches has become one of the trending topics in health informatics. Many biologicaldatabases and online tools were developed for uncovering novel disease-related miRNAs. Hence, abrief overview regarding the disease biomarkers, miRNAs as disease biomarkers and their role incomplex disorders is given here. Various methods for calculating miRNA and disease similarities areincluded and the existing machine learning and network based computational approaches for detectingdisease associated miRNAs are reviewed along with the benchmark dataset used. Finally, theperformance matrices, validation measures and online tools used for miRNA Disease Association(MDA) predictions are also outlined.
最近的证据表明,miRNAs 作为疾病生物标志物的基本作用及其在疾病进展和病理学中的作用。利用计算方法鉴定与疾病相关的 miRNA 已成为健康信息学的热门话题之一。许多生物数据库和在线工具都是为发现新型疾病相关 miRNAs 而开发的。因此,本文简要概述了疾病生物标志物、作为疾病生物标志物的 miRNA 及其在复杂疾病中的作用。此外,还介绍了计算 miRNA 与疾病相似性的各种方法,并回顾了用于检测与疾病相关的 miRNA 的现有机器学习和基于网络的计算方法以及所使用的基准数据集。最后,还概述了用于 miRNA 疾病关联(MDA)预测的性能矩阵、验证措施和在线工具。
{"title":"An Exploratory Review on Recent Computational Approaches Devised for MiRNA Disease Association Prediction","authors":"S Sujamol, E R Vimina, U. Krishnakumar","doi":"10.2174/0115748936293219240426051148","DOIUrl":"https://doi.org/10.2174/0115748936293219240426051148","url":null,"abstract":"\u0000\u0000Recent evidence demonstrated the fundamental role of miRNAs as disease biomarkers\u0000and their role in disease progression and pathology. Identifying disease related miRNAs using computational\u0000approaches has become one of the trending topics in health informatics. Many biological\u0000databases and online tools were developed for uncovering novel disease-related miRNAs. Hence, a\u0000brief overview regarding the disease biomarkers, miRNAs as disease biomarkers and their role in\u0000complex disorders is given here. Various methods for calculating miRNA and disease similarities are\u0000included and the existing machine learning and network based computational approaches for detecting\u0000disease associated miRNAs are reviewed along with the benchmark dataset used. Finally, the\u0000performance matrices, validation measures and online tools used for miRNA Disease Association\u0000(MDA) predictions are also outlined.\u0000","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141122550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validating the Distinctiveness of the Omicron Lineage within the SARSCov-2 based on Protein Language Models 基于蛋白质语言模型验证 SARSCov-2 中 Omicron 系的独特性
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-04-30 DOI: 10.2174/0115748936291075240409080924
Ke Dong, Jingyang Gao
Introduction: Variants of concern were identified in severe acute respiratory syndrome coronavirus 2, namely Alpha, Beta, Gamma, Delta, and Omicron. This study explores the mutations of the Omicron lineage and its differences from other lineages through a protein language model. Methods: By inputting the severe acute respiratory syndrome coronavirus 2 wild-type sequence into the protein language model evolving pre-trained models-1v, this study obtained the score for each position mutating to other amino acids and calculated the overall trend of a new variant of concern mutation scores. objective: Analyze the differences in the number of Omicron amino acid mutations compared to the other four VOC mutations using statistical methods, and use the protein language model esm-1v to analyze the specificity of Omicron amino acid mutations. Results: It is found that when the proportion of unobserved mutations to observed mutations is 4:15, Omicron still generates a large number of newly emerging mutations. It was found that the overall score for the Omicron family is low, and the overall ranking for the Omicron family is low. Conclusion: Mutations in the Omicron lineage are different from amino acid mutations in other lineages. The findings of this paper deepen the understanding of the spatial distribution of spike protein amino acid mutations and overall trends of newly emerging mutations corresponding to different variants of concern. This also provides insights into simulating the evolution of the Omicron lineage.
导言:在严重急性呼吸系统综合征冠状病毒 2 中发现了令人担忧的变异,即 Alpha、Beta、Gamma、Delta 和 Omicron。本研究通过蛋白质语言模型探索 Omicron 系的变异及其与其他系的差异。研究方法通过将严重急性呼吸道综合征冠状病毒 2 野生型序列输入进化预训练模型-1v 的蛋白质语言模型,本研究获得了变异为其他氨基酸的每个位置的得分,并计算了关注变异得分的新变体的整体趋势。 目标:分析变异为其他氨基酸的新变体的数量差异:利用统计学方法分析与其他四种 VOC 突变相比,Omicron 氨基酸突变数量的差异,并利用蛋白质语言模型 esm-1v 分析 Omicron 氨基酸突变的特异性。结果发现发现当未观察到的突变与观察到的突变的比例为 4:15 时,Omicron 仍会产生大量新出现的突变。研究发现,Omicron 家族的总体得分较低,Omicron 家族的总体排名也较低。结论Omicron 系的突变不同于其他系的氨基酸突变。本文的研究结果加深了人们对尖峰蛋白氨基酸突变的空间分布和新出现的突变的总体趋势的理解,这些突变对应于不同的关注变体。这也为模拟 Omicron 品系的进化提供了启示。
{"title":"Validating the Distinctiveness of the Omicron Lineage within the SARSCov-2 based on Protein Language Models","authors":"Ke Dong, Jingyang Gao","doi":"10.2174/0115748936291075240409080924","DOIUrl":"https://doi.org/10.2174/0115748936291075240409080924","url":null,"abstract":"Introduction: Variants of concern were identified in severe acute respiratory syndrome coronavirus 2, namely Alpha, Beta, Gamma, Delta, and Omicron. This study explores the mutations of the Omicron lineage and its differences from other lineages through a protein language model. Methods: By inputting the severe acute respiratory syndrome coronavirus 2 wild-type sequence into the protein language model evolving pre-trained models-1v, this study obtained the score for each position mutating to other amino acids and calculated the overall trend of a new variant of concern mutation scores. objective: Analyze the differences in the number of Omicron amino acid mutations compared to the other four VOC mutations using statistical methods, and use the protein language model esm-1v to analyze the specificity of Omicron amino acid mutations. Results: It is found that when the proportion of unobserved mutations to observed mutations is 4:15, Omicron still generates a large number of newly emerging mutations. It was found that the overall score for the Omicron family is low, and the overall ranking for the Omicron family is low. Conclusion: Mutations in the Omicron lineage are different from amino acid mutations in other lineages. The findings of this paper deepen the understanding of the spatial distribution of spike protein amino acid mutations and overall trends of newly emerging mutations corresponding to different variants of concern. This also provides insights into simulating the evolution of the Omicron lineage.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparative Analysis of Deep Generative Model for Industrial Enzyme Design 用于工业酶设计的深度生成模型对比分析
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-04-16 DOI: 10.2174/0115748936303223240404043202
Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang, Fei Guo
: Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. Computational design is a powerful tool, allowing rapid and efficient exploration of sequence space and facilitating the design of novel enzymes tailored to specific conditions and requirements. It is beneficial to de novo design industrial enzymes using computational methods. Currently, only one tool explicitly designed for the enzyme-only generation performs unsatisfactorily. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We investigated the literature related to protein generation. We summarized the computational methods used for sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R, and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.
:虽然酶具有高效催化的优势,但天然酶在工业环境中缺乏稳定性,甚至无法满足所需的催化反应。这促使我们急需从头设计新的酶。计算设计是一种强大的工具,可以快速有效地探索序列空间,促进设计出适合特定条件和要求的新型酶。利用计算方法重新设计工业酶是有益的。目前,只有一种明确为酶生成而设计的工具表现不尽如人意。我们选择了几种通用的蛋白质序列设计工具,并系统地评估了它们应用于特定工业酶的效果。我们调查了与蛋白质生成相关的文献。我们将用于序列生成的计算方法归纳为三类:有结构条件的序列生成、无结构约束的序列生成以及序列和结构的共同生成。为了有效评估六种计算工具生成酶序列的能力,我们首先构建了一个名为 Luc_64 的荧光素酶数据集。然后,我们评估了这些方法在该数据集上生成的酶序列的质量,包括氨基酸分布、EC编号验证等。我们还从序列和结构的角度,使用序列恢复率和均方根偏差(RMSD)评估了基于结构的方法在现有公共数据集上生成的序列。在功能性数据集中,Luc_64、ABACUS-R 和 ProteinMPNN 所生成的序列的氨基酸分布和功能与天然荧光素酶的氨基酸分布和功能非常接近,这表明它们能有效保留酶的基本特征。在这两个基准数据集中,ABACUS-R 和 ProteinMPNN 还表现出最高的序列恢复率,这表明它们具有生成与原始酶结构非常相似的序列的卓越能力。我们的研究为研究人员选择合适的酶序列设计工具提供了重要参考,突出了每种工具在生成准确和功能性酶序列方面的优势和局限性。在我们的评估中,ProteinMPNN 和 ABACUS-R 成为最有效的工具,它们在序列恢复和 RMSD 方面具有很高的准确性,并通过精确的氨基酸分布保持了酶功能的完整性。同时,在特定工业酶基准上对蛋白质通用工具迁移到特定工业酶的性能进行了公平评估。
{"title":"Comparative Analysis of Deep Generative Model for Industrial Enzyme Design","authors":"Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang, Fei Guo","doi":"10.2174/0115748936303223240404043202","DOIUrl":"https://doi.org/10.2174/0115748936303223240404043202","url":null,"abstract":": Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. Computational design is a powerful tool, allowing rapid and efficient exploration of sequence space and facilitating the design of novel enzymes tailored to specific conditions and requirements. It is beneficial to de novo design industrial enzymes using computational methods. Currently, only one tool explicitly designed for the enzyme-only generation performs unsatisfactorily. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We investigated the literature related to protein generation. We summarized the computational methods used for sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R, and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Effective Method to Identify Cooperation Driver Gene Sets 识别合作驱动基因组的有效方法
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-04-15 DOI: 10.2174/0115748936293238240313081211
Wei Zhang, Yifu Zeng, Bihai Zhao, Jie Xiong, Tuanfei Zhu, Jingjing Wang, Guiji Li, Lei Wang
Background: In cancer genomics research, identifying driver genes is a challenging task. Detecting cancer-driver genes can further our understanding of cancer risk factors and promote the development of personalized treatments. Gene mutations show mutual exclusivity and cooccur, and most of the existing methods focus on identifying driver pathways or driver gene sets through the study of mutual exclusivity, that is functionally redundant gene sets. Moreover, less research on cooperation genes with co-occurring mutations has been conducted. Objective: We propose an effective method that combines the two characteristics of genes, cooccurring mutations and the coordinated regulation of proliferation genes, to explore cooperation driver genes. Methods: This study is divided into three stages: (1) constructing a binary gene mutation matrix; (2) combining mutation co-occurrence characteristics to identify the candidate cooperation gene sets; and (3) constructing a gene regulation network to screen the cooperation gene sets that perform synergistically regulating proliferation. Results: The method performance is evaluated on three TCGA cancer datasets, and the experiments showed that it can detect effective cooperation driver gene sets. In further investigations, it was determined that the discovered set of co-driver genes could be used to generate prognostic classifications, which could be biologically significant and provide complementary information to the cancer genome. Conclusion: Our approach is effective in identifying sets of cancer cooperation driver genes, and the results can be used as clinical markers to stratify patients.
背景:在癌症基因组学研究中,确定驱动基因是一项具有挑战性的任务。检测癌症驱动基因可以进一步了解癌症风险因素,促进个性化治疗的开发。基因突变具有互斥性和共存性,现有方法大多侧重于通过研究互斥性(即功能冗余基因集)来识别驱动通路或驱动基因集。此外,对同时发生突变的合作基因的研究较少。研究目的我们提出了一种有效的方法,结合基因的共生突变和增殖基因的协调调控这两个特点来探索合作驱动基因。方法:本研究分为三个阶段:(1)构建二元基因突变矩阵;(2)结合突变共现特征确定候选合作基因集;(3)构建基因调控网络筛选出协同调控增殖的合作基因集。结果:在三个 TCGA 癌症数据集上评估了该方法的性能,实验表明它能检测出有效的合作驱动基因集。在进一步研究中,发现的合作驱动基因集可用于生成预后分类,这可能具有生物学意义,并为癌症基因组提供补充信息。结论我们的方法能有效识别癌症合作驱动基因集,其结果可作为临床标记对患者进行分层。
{"title":"An Effective Method to Identify Cooperation Driver Gene Sets","authors":"Wei Zhang, Yifu Zeng, Bihai Zhao, Jie Xiong, Tuanfei Zhu, Jingjing Wang, Guiji Li, Lei Wang","doi":"10.2174/0115748936293238240313081211","DOIUrl":"https://doi.org/10.2174/0115748936293238240313081211","url":null,"abstract":"Background: In cancer genomics research, identifying driver genes is a challenging task. Detecting cancer-driver genes can further our understanding of cancer risk factors and promote the development of personalized treatments. Gene mutations show mutual exclusivity and cooccur, and most of the existing methods focus on identifying driver pathways or driver gene sets through the study of mutual exclusivity, that is functionally redundant gene sets. Moreover, less research on cooperation genes with co-occurring mutations has been conducted. Objective: We propose an effective method that combines the two characteristics of genes, cooccurring mutations and the coordinated regulation of proliferation genes, to explore cooperation driver genes. Methods: This study is divided into three stages: (1) constructing a binary gene mutation matrix; (2) combining mutation co-occurrence characteristics to identify the candidate cooperation gene sets; and (3) constructing a gene regulation network to screen the cooperation gene sets that perform synergistically regulating proliferation. Results: The method performance is evaluated on three TCGA cancer datasets, and the experiments showed that it can detect effective cooperation driver gene sets. In further investigations, it was determined that the discovered set of co-driver genes could be used to generate prognostic classifications, which could be biologically significant and provide complementary information to the cancer genome. Conclusion: Our approach is effective in identifying sets of cancer cooperation driver genes, and the results can be used as clinical markers to stratify patients.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrated Somatic Mutation Network Diffusion Model for Stratification of Breast Cancer into Different Metabolic Mutation Subtypes 将乳腺癌分为不同代谢突变亚型的体细胞突变网络扩散综合模型
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-04-15 DOI: 10.2174/0115748936298012240322091111
Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo, Lei Yang
Background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. Objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. Methods: The somatic mutation profile of breast cancer patients was analyzed and smoothed by utilizing a network diffusion model within the protein-protein interaction network to construct a comprehensive somatic mutation network diffusion profile. Subsequently, a deep clustering approach was employed to explore metabolic mutation typing in breast cancer based on integrated metabolic pathway information and the somatic mutation network diffusion profile. In addition, we employed deep neural networks and machine learning prediction models to assess the feasibility of predicting drug responses through somatic mutation network diffusion profiles. Results: Significant differences in prognosis and metabolic heterogeneity were observed among the different metabolic mutation subtypes, characterized by distinct alterations in metabolic pathways and genetic mutations, and these mutational features offered potential targets for subtype-specific therapies. Furthermore, there was a strong consistency between the results of the drug response prediction model constructed on the somatic mutation network diffusion profile and the actual observed drug responses. Conclusion: Metabolic mutation typing of cancer assists in guiding patient prognosis and treatment.
背景:体细胞中代谢相关基因的突变可能导致代谢途径的中断,从而使患者表现出不同的分子和病理特征:体细胞中代谢相关基因的突变可能导致代谢途径的中断,从而使患者表现出不同的分子和病理特征。研究目的在本研究中,我们以体细胞突变数据为重点,研究代谢突变分型对乳腺癌患者预后和治疗的指导意义:本研究以体细胞突变数据为重点,探讨代谢突变分型对乳腺癌患者预后和治疗的指导意义。方法利用蛋白质-蛋白质相互作用网络中的网络扩散模型对乳腺癌患者的体细胞突变情况进行分析和平滑处理,从而构建一个全面的体细胞突变网络扩散图。随后,根据综合代谢通路信息和体细胞突变网络扩散图谱,采用深度聚类方法探索乳腺癌的代谢突变分型。此外,我们还采用了深度神经网络和机器学习预测模型来评估通过体细胞突变网络扩散图谱预测药物反应的可行性。结果不同的代谢突变亚型在预后和代谢异质性方面存在显著差异,这些亚型以代谢途径和基因突变的不同改变为特征,这些突变特征为亚型特异性疗法提供了潜在靶点。此外,根据体细胞突变网络扩散特征构建的药物反应预测模型的结果与实际观察到的药物反应之间具有很强的一致性。结论癌症代谢突变分型有助于指导患者的预后和治疗。
{"title":"Integrated Somatic Mutation Network Diffusion Model for Stratification of Breast Cancer into Different Metabolic Mutation Subtypes","authors":"Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo, Lei Yang","doi":"10.2174/0115748936298012240322091111","DOIUrl":"https://doi.org/10.2174/0115748936298012240322091111","url":null,"abstract":"Background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. Objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. Methods: The somatic mutation profile of breast cancer patients was analyzed and smoothed by utilizing a network diffusion model within the protein-protein interaction network to construct a comprehensive somatic mutation network diffusion profile. Subsequently, a deep clustering approach was employed to explore metabolic mutation typing in breast cancer based on integrated metabolic pathway information and the somatic mutation network diffusion profile. In addition, we employed deep neural networks and machine learning prediction models to assess the feasibility of predicting drug responses through somatic mutation network diffusion profiles. Results: Significant differences in prognosis and metabolic heterogeneity were observed among the different metabolic mutation subtypes, characterized by distinct alterations in metabolic pathways and genetic mutations, and these mutational features offered potential targets for subtype-specific therapies. Furthermore, there was a strong consistency between the results of the drug response prediction model constructed on the somatic mutation network diffusion profile and the actual observed drug responses. Conclusion: Metabolic mutation typing of cancer assists in guiding patient prognosis and treatment.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae GB5mCPred:基于 Bootstrap 的随机梯度提升法的 Poaceae 跨物种 5mc 位点预测器
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-04-15 DOI: 10.2174/0115748936285544231221113226
Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak
Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.
背景:5mC 是生命三大领域中最普遍的表观遗传改变之一,它在多种生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变,但需要耗费大量时间和金钱。基于人工智能的硅学方法被用来克服这些障碍:5mC 是生命三大领域中最普遍的表观遗传学改变之一,它在广泛的生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变,但耗时耗钱。基于人工智能的硅学方法被用来克服这些障碍。目的:本研究旨在开发一种基于 ML 的预测器,用于检测 Poaceae 中的 5mC 位点。目标本研究旨在评估用于预测水稻中 5mC 位点的机器学习和深度学习模型。研究方法在本研究中,使用三个不同的特征集对 DNA 序列进行了矢量化--寡核苷酸频率(k = 2)、单核苷酸二进制编码和核苷酸的化学性质。研究了两种深度学习模型--长短期记忆(LSTM)和双向 LSTM(Bi-LSTM),以及九种机器学习模型,包括随机森林、梯度提升、奈夫贝叶斯、回归树、k-近邻、支持向量机、adaboost、多元逻辑回归和人工神经网络。此外,还使用了引导重采样来建立更有效的模型,并使用混合特征选择模块来降低维度和去除向量空间中的无关特征。结果随机森林获得了最高的准确率、特异性和 MCC,即 92.6%、86.41% 和 0.84。梯度提升技术获得了最高灵敏度,即 96.85%。与理想解相似度排序技术(TOPSIS)显示,在准确预测水稻 5mC 位点方面,随机森林、梯度提升和支持向量机这三个模型最佳。我们开发了一个名为 "GB5mCPred "的 R 包,并将其发布在 CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html) 上。此外,我们还基于该算法开发了一个用户友好型预测服务器 (http://cabgrid.res.in:5474/)。结论随机森林、梯度提升和支持向量机的 TOPSIS 分数几乎相等,最终成为最佳的三种模型。主要原因可能在于它们的架构设计,因为它们是渐进式学习模型,能比其他学习模型更正确地捕捉 5mC 位点。
{"title":"GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae","authors":"Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak","doi":"10.2174/0115748936285544231221113226","DOIUrl":"https://doi.org/10.2174/0115748936285544231221113226","url":null,"abstract":"Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiffSeqMol: A Non-Autoregressive Diffusion-Based Approach for Molecular Sequence Generation and Optimization DiffSeqMol:基于非自回归扩散的分子序列生成和优化方法
IF 4 3区 生物学 Q1 Mathematics Pub Date : 2024-04-03 DOI: 10.2174/0115748936285493240307071916
Zixu Wang, Yangyang Chen, Xiulan Guo, Yayang Li, Pengyong Li, Chunyan Li, Xiucai Ye, Tetsuya Sakurai
Background: The application of deep generative models for molecular discovery has witnessed a significant surge in recent years. Currently, the field of molecular generation and molecular optimization is predominantly governed by autoregressive models regardless of how molecular data is represented. However, an emerging paradigm in the generation domain is diffusion models, which treat data non-autoregressively and have achieved significant breakthroughs in areas such as image generation. Methods: The potential and capability of diffusion models in molecular generation and optimization tasks remain largely unexplored. In order to investigate the potential applicability of diffusion models in the domain of molecular exploration, we proposed DiffSeqMol, a molecular sequence generation model, underpinned by diffusion process. Results & Discussion: DiffSeqMol distinguishes itself from traditional autoregressive methods by its capacity to draw samples from random noise and direct generating the entire molecule. Through experiment evaluations, we demonstrated that DiffSeqMol can achieve, even surpass, the performance of established state-of-the-art models on unconditional generation tasks and molecular optimization tasks. Conclusion: Taken together, our results show that DiffSeqMol can be considered a promising molecular generation method. It opens new pathways to traverse the expansive chemical space and to discover novel molecules.
背景:近年来,深度生成模型在分子发现领域的应用出现了大幅增长。目前,分子生成和分子优化领域主要采用自回归模型,而不管分子数据如何表示。然而,生成领域的一个新兴范例是扩散模型,它以非自回归的方式处理数据,并在图像生成等领域取得了重大突破。方法:扩散模型在分子生成和优化任务中的潜力和能力在很大程度上仍未得到探索。为了研究扩散模型在分子探索领域的潜在适用性,我们提出了以扩散过程为基础的分子序列生成模型 DiffSeqMol。结果与讨论DiffSeqMol 有别于传统的自回归方法,它能从随机噪声中提取样本,直接生成整个分子。通过实验评估,我们证明 DiffSeqMol 可以在无条件生成任务和分子优化任务上达到甚至超过现有的最先进模型的性能。结论综上所述,我们的研究结果表明,DiffSeqMol 可被视为一种前景广阔的分子生成方法。它为穿越广阔的化学空间和发现新分子开辟了新的途径。
{"title":"DiffSeqMol: A Non-Autoregressive Diffusion-Based Approach for Molecular Sequence Generation and Optimization","authors":"Zixu Wang, Yangyang Chen, Xiulan Guo, Yayang Li, Pengyong Li, Chunyan Li, Xiucai Ye, Tetsuya Sakurai","doi":"10.2174/0115748936285493240307071916","DOIUrl":"https://doi.org/10.2174/0115748936285493240307071916","url":null,"abstract":"Background: The application of deep generative models for molecular discovery has witnessed a significant surge in recent years. Currently, the field of molecular generation and molecular optimization is predominantly governed by autoregressive models regardless of how molecular data is represented. However, an emerging paradigm in the generation domain is diffusion models, which treat data non-autoregressively and have achieved significant breakthroughs in areas such as image generation. Methods: The potential and capability of diffusion models in molecular generation and optimization tasks remain largely unexplored. In order to investigate the potential applicability of diffusion models in the domain of molecular exploration, we proposed DiffSeqMol, a molecular sequence generation model, underpinned by diffusion process. Results & Discussion: DiffSeqMol distinguishes itself from traditional autoregressive methods by its capacity to draw samples from random noise and direct generating the entire molecule. Through experiment evaluations, we demonstrated that DiffSeqMol can achieve, even surpass, the performance of established state-of-the-art models on unconditional generation tasks and molecular optimization tasks. Conclusion: Taken together, our results show that DiffSeqMol can be considered a promising molecular generation method. It opens new pathways to traverse the expansive chemical space and to discover novel molecules.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Current Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1