GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae

IF 2.9 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS Current Bioinformatics Pub Date : 2024-04-15 DOI:10.2174/0115748936285544231221113226

Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak

{"title":"GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae","authors":"Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak","doi":"10.2174/0115748936285544231221113226","DOIUrl":null,"url":null,"abstract":"Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"17 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/0115748936285544231221113226","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GB5mCPred：基于 Bootstrap 的随机梯度提升法的 Poaceae 跨物种 5mc 位点预测器

背景：5mC 是生命三大领域中最普遍的表观遗传改变之一，它在多种生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变，但需要耗费大量时间和金钱。基于人工智能的硅学方法被用来克服这些障碍：5mC 是生命三大领域中最普遍的表观遗传学改变之一，它在广泛的生物功能中发挥着作用。虽然体外技术能更有效地检测表观遗传学改变，但耗时耗钱。基于人工智能的硅学方法被用来克服这些障碍。目的：本研究旨在开发一种基于 ML 的预测器，用于检测 Poaceae 中的 5mC 位点。目标本研究旨在评估用于预测水稻中 5mC 位点的机器学习和深度学习模型。研究方法在本研究中，使用三个不同的特征集对 DNA 序列进行了矢量化--寡核苷酸频率（k = 2）、单核苷酸二进制编码和核苷酸的化学性质。研究了两种深度学习模型--长短期记忆（LSTM）和双向 LSTM（Bi-LSTM），以及九种机器学习模型，包括随机森林、梯度提升、奈夫贝叶斯、回归树、k-近邻、支持向量机、adaboost、多元逻辑回归和人工神经网络。此外，还使用了引导重采样来建立更有效的模型，并使用混合特征选择模块来降低维度和去除向量空间中的无关特征。结果随机森林获得了最高的准确率、特异性和 MCC，即 92.6%、86.41% 和 0.84。梯度提升技术获得了最高灵敏度，即 96.85%。与理想解相似度排序技术（TOPSIS）显示，在准确预测水稻 5mC 位点方面，随机森林、梯度提升和支持向量机这三个模型最佳。我们开发了一个名为 "GB5mCPred "的 R 包，并将其发布在 CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html) 上。此外，我们还基于该算法开发了一个用户友好型预测服务器 (http://cabgrid.res.in:5474/)。结论随机森林、梯度提升和支持向量机的 TOPSIS 分数几乎相等，最终成为最佳的三种模型。主要原因可能在于它们的架构设计，因为它们是渐进式学习模型，能比其他学习模型更正确地捕捉 5mC 位点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Current Bioinformatics 生物-生化研究方法

CiteScore

6.60

自引率

2.50%

发文量

审稿时长

>12 weeks

期刊介绍： Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science. The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.