A hybrid approach for predicting transcription factors.

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Frontiers in bioinformatics Pub Date : 2024-07-25 eCollection Date: 2024-01-01 DOI:10.3389/fbinf.2024.1425419
Sumeet Patiyal, Palak Tiwari, Mohit Ghai, Aman Dhapola, Anjali Dhall, Gajendra P S Raghava
{"title":"A hybrid approach for predicting transcription factors.","authors":"Sumeet Patiyal, Palak Tiwari, Mohit Ghai, Aman Dhapola, Anjali Dhall, Gajendra P S Raghava","doi":"10.3389/fbinf.2024.1425419","DOIUrl":null,"url":null,"abstract":"<p><p>Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of \"TransFacPred\" (https://webs.iiitd.edu.in/raghava/transfacpred).</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1425419"},"PeriodicalIF":3.9000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11306938/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2024.1425419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of "TransFacPred" (https://webs.iiitd.edu.in/raghava/transfacpred).

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
预测转录因子的混合方法。
转录因子是重要的 DNA 结合蛋白,可调节多个基因的转录速率,控制细胞内基因的表达。高精度预测转录因子对于了解细胞分化、细胞内信号转导和细胞周期控制等生物过程非常重要。在这项研究中,我们开发了一种混合方法,结合了基于配准和无配准的方法,以更高的精度预测转录因子。所有模型都在一个包含 19,406 个转录因子和 523,560 个非转录因子蛋白质序列的大型数据集上进行了训练、测试和评估。为避免评估中的偏差,数据集被分为训练数据集和验证/独立数据集,其中 80% 的数据用于训练,其余 20% 用于外部验证。在无配准方法中,使用机器学习技术和基于蛋白质组成的特征来开发模型。在一个独立数据集上,我们的最佳无配准模型获得了 0.97 的 AUC。在基于配准的方法中,我们使用不同截断值的 BLAST 来预测转录因子。虽然基于配准的方法表现出了卓越的性能,但由于存在无命中的情况,它无法覆盖所有转录因子。为了结合这两种方法的优势,我们开发了一种混合方法,将无配准和基于配准的方法结合起来。在混合方法中,我们将免配准方法和基于配准方法的得分相加,在独立数据集上取得了 0.99 的最大 AUC。本研究提出的方法比现有方法表现更好。我们将最佳模型纳入了 "TransFacPred"(https://webs.iiitd.edu.in/raghava/transfacpred)的网络服务器/Python软件包索引/独立软件包中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.60
自引率
0.00%
发文量
0
期刊最新文献
Network-based insights into miR-30a-5p-mediated regulation and EGCG targeting in triple-negative breast cancer. Pan-cancer analyses identify oncogenic drivers, expression signatures, and therapeutic vulnerabilities in RHO GTPase pathway genes. Identification and functional analysis of hub genes in knee osteoarthritis via bioinformatics and experimental validation. In silico identification of novel natural compounds as potential KIFC1 inhibitors for the therapeutic intervention of triple-negative breast cancer. Bioengineering hybrid artificial life.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1