{"title":"WSHNN:用于识别 DNA 蛋白结合位点的弱监督混合神经网络","authors":"Wenzheng Bao, Baitong Chen, Yue Zhang","doi":"10.2174/0115734099277249240129114123","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.</p><p><strong>Methods: </strong>We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.</p><p><strong>Results: </strong>Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.</p><p><strong>Conclusion: </strong>From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).</p>","PeriodicalId":93961,"journal":{"name":"Current computer-aided drug design","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-Protein Binding Sites.\",\"authors\":\"Wenzheng Bao, Baitong Chen, Yue Zhang\",\"doi\":\"10.2174/0115734099277249240129114123\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.</p><p><strong>Methods: </strong>We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.</p><p><strong>Results: </strong>Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.</p><p><strong>Conclusion: </strong>From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).</p>\",\"PeriodicalId\":93961,\"journal\":{\"name\":\"Current computer-aided drug design\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current computer-aided drug design\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2174/0115734099277249240129114123\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current computer-aided drug design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0115734099277249240129114123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
引言转录因子是控制基因表达的重要生物元件,其主要生物学功能是识别 DNA 序列。随着相关研究的不断深入,人们发现 DNA 蛋白结合的特异性在基因表达、调控,特别是基因治疗中具有重要作用。卷积神经网络(CNN)在预测 DNa 蛋白特异性结合位点方面越来越受欢迎,但其预测准确性有待提高:我们提出了一种将多实例学习(MIL)和名为 WSHNN 的混合神经网络相结合的框架。首先,我们利用滑动窗口将 DNA 序列分割成多个重叠的实例,每个实例包含多个包。然后,使用 K-mer 编码对实例进行编码。然后,通过混合神经网络分别计算同一袋中所有实例的得分:最后,一个全连接网络被用作该袋的最终预测。该框架的预测率为 90.73%,召回率为 82.77%,准确率为 87.17%,F1 分数为 0.8657,MCC 分数为 0.7462。此外,我们还讨论了 K-mer 编码的性能。与其他先进技术相比,该模型在序列信息方面的性能更好:从实验结果来看,双向长短期记忆(Bi-LSTM)能更好地捕捉 DNA 序列之间的长序列关系(代码和数据可访问 https://github.com/baowz12345/Weak_ Super_Network)。
WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-Protein Binding Sites.
Introduction: Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.
Methods: We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.
Results: Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.
Conclusion: From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).