WSHNN:用于识别 DNA 蛋白结合位点的弱监督混合神经网络

Wenzheng Bao, Baitong Chen, Yue Zhang
{"title":"WSHNN:用于识别 DNA 蛋白结合位点的弱监督混合神经网络","authors":"Wenzheng Bao, Baitong Chen, Yue Zhang","doi":"10.2174/0115734099277249240129114123","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.</p><p><strong>Methods: </strong>We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.</p><p><strong>Results: </strong>Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.</p><p><strong>Conclusion: </strong>From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).</p>","PeriodicalId":93961,"journal":{"name":"Current computer-aided drug design","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-Protein Binding Sites.\",\"authors\":\"Wenzheng Bao, Baitong Chen, Yue Zhang\",\"doi\":\"10.2174/0115734099277249240129114123\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.</p><p><strong>Methods: </strong>We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.</p><p><strong>Results: </strong>Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.</p><p><strong>Conclusion: </strong>From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).</p>\",\"PeriodicalId\":93961,\"journal\":{\"name\":\"Current computer-aided drug design\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current computer-aided drug design\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2174/0115734099277249240129114123\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current computer-aided drug design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0115734099277249240129114123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

引言转录因子是控制基因表达的重要生物元件,其主要生物学功能是识别 DNA 序列。随着相关研究的不断深入,人们发现 DNA 蛋白结合的特异性在基因表达、调控,特别是基因治疗中具有重要作用。卷积神经网络(CNN)在预测 DNa 蛋白特异性结合位点方面越来越受欢迎,但其预测准确性有待提高:我们提出了一种将多实例学习(MIL)和名为 WSHNN 的混合神经网络相结合的框架。首先,我们利用滑动窗口将 DNA 序列分割成多个重叠的实例,每个实例包含多个包。然后,使用 K-mer 编码对实例进行编码。然后,通过混合神经网络分别计算同一袋中所有实例的得分:最后,一个全连接网络被用作该袋的最终预测。该框架的预测率为 90.73%,召回率为 82.77%,准确率为 87.17%,F1 分数为 0.8657,MCC 分数为 0.7462。此外,我们还讨论了 K-mer 编码的性能。与其他先进技术相比,该模型在序列信息方面的性能更好:从实验结果来看,双向长短期记忆(Bi-LSTM)能更好地捕捉 DNA 序列之间的长序列关系(代码和数据可访问 https://github.com/baowz12345/Weak_ Super_Network)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-Protein Binding Sites.

Introduction: Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.

Methods: We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.

Results: Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.

Conclusion: From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Study on the Mechanism of Alpinia officinarum Hance in the Improvement of Insulin Resistance through Network Pharmacology, Molecular Docking and in vitro Experimental Verification. Synthesis, Biological Evaluation, Molecular Docking Studies and ADMET Prediction of Oxindole-Based Hybrids for the Treatment of Tuberculosis. Identifying Novel Inhibitors for Dengue NS2B-NS3 Protease by Combining Topological similarity, Molecular Dynamics, MMGBSA and SiteMap Analysis. Discovery of Two GSK3β Inhibitors from Sophora flavescens Ait. using Structure-based Virtual Screening and Bioactivity Evaluation. Berberine Ameliorates High-fat-induced Insulin Resistance in HepG2 Cells by Modulating PPARs Signaling Pathway.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1