GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae

IF 2.4 3区 生物学 Q3 BIOCHEMICAL RESEARCH METHODS Current Bioinformatics Pub Date : 2024-04-15 DOI:10.2174/0115748936285544231221113226
Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak
{"title":"GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae","authors":"Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak","doi":"10.2174/0115748936285544231221113226","DOIUrl":null,"url":null,"abstract":"Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"17 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/0115748936285544231221113226","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GB5mCPred:基于 Bootstrap 的随机梯度提升法的 Poaceae 跨物种 5mc 位点预测器
背景:5mC 是生命三大领域中最普遍的表观遗传改变之一,它在多种生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变,但需要耗费大量时间和金钱。基于人工智能的硅学方法被用来克服这些障碍:5mC 是生命三大领域中最普遍的表观遗传学改变之一,它在广泛的生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变,但耗时耗钱。基于人工智能的硅学方法被用来克服这些障碍。目的:本研究旨在开发一种基于 ML 的预测器,用于检测 Poaceae 中的 5mC 位点。目标本研究旨在评估用于预测水稻中 5mC 位点的机器学习和深度学习模型。研究方法在本研究中,使用三个不同的特征集对 DNA 序列进行了矢量化--寡核苷酸频率(k = 2)、单核苷酸二进制编码和核苷酸的化学性质。研究了两种深度学习模型--长短期记忆(LSTM)和双向 LSTM(Bi-LSTM),以及九种机器学习模型,包括随机森林、梯度提升、奈夫贝叶斯、回归树、k-近邻、支持向量机、adaboost、多元逻辑回归和人工神经网络。此外,还使用了引导重采样来建立更有效的模型,并使用混合特征选择模块来降低维度和去除向量空间中的无关特征。结果随机森林获得了最高的准确率、特异性和 MCC,即 92.6%、86.41% 和 0.84。梯度提升技术获得了最高灵敏度,即 96.85%。与理想解相似度排序技术(TOPSIS)显示,在准确预测水稻 5mC 位点方面,随机森林、梯度提升和支持向量机这三个模型最佳。我们开发了一个名为 "GB5mCPred "的 R 包,并将其发布在 CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html) 上。此外,我们还基于该算法开发了一个用户友好型预测服务器 (http://cabgrid.res.in:5474/)。结论随机森林、梯度提升和支持向量机的 TOPSIS 分数几乎相等,最终成为最佳的三种模型。主要原因可能在于它们的架构设计,因为它们是渐进式学习模型,能比其他学习模型更正确地捕捉 5mC 位点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Current Bioinformatics
Current Bioinformatics 生物-生化研究方法
CiteScore
6.60
自引率
2.50%
发文量
77
审稿时长
>12 weeks
期刊介绍: Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science. The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.
期刊最新文献
Mining Transcriptional Data for Precision Medicine: Bioinformatics Insights into Inflammatory Bowel Disease Prediction of miRNA-disease Associations by Deep Matrix Decomposition Method based on Fused Similarity Information TCM@MPXV: A Resource for Treating Monkeypox Patients in Traditional Chinese Medicine Identifying Key Clinical Indicators Associated with the Risk of Death in Hospitalized COVID-19 Patients A Parallel Implementation for Large-Scale TSR-based 3D Structural Comparisons of Protein and Amino Acid
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1