Hua Zhang , Xiaoqi Yang , Pengliang Chen , Cheng Yang , Bi Chen , Bo Jiang , Guogen Shan
{"title":"CoSEF-DBP:通过双语表征识别 DNA 结合蛋白的卷积范围扩展融合网络","authors":"Hua Zhang , Xiaoqi Yang , Pengliang Chen , Cheng Yang , Bi Chen , Bo Jiang , Guogen Shan","doi":"10.1016/j.eswa.2024.125763","DOIUrl":null,"url":null,"abstract":"<div><div>Precisely recognizing DNA-binding proteins (DBPs) from sequences is crucial for a profound comprehension of the mechanisms governing protein-DNA interactions in various cellular processes. However, traditional in-silico methods for DBP identification encounter several challenges, such as time-consuming evolutionary modeling based on multiple sequence alignments, and intricate feature engineering associated with machine or deep learning approaches. In this paper, we introduce a novel end-to-end predictor for identifying DNA-binding proteins without intricate feature engineering, which innovatively enriches the semantics of amino acid sequences through the fusion of bilingual representations derived from distinct language models. We further design a convolution scope expanding (CoSE) module to widen the receptive fields of convolution kernels, thereby forming protein-level CoSE representation sequences. These representations are subsequently integrated via BiLSTM in conjunction with a simplified capsule network, enhancing the hierarchical feature extraction capability. Extensive experiments confirm that our model surpasses existing baselines across diverse benchmark datasets, notably achieving at least a 5.1% improvement in MCC value on the UniSwiss dataset.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"263 ","pages":"Article 125763"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CoSEF-DBP: Convolution scope expanding fusion network for identifying DNA-binding proteins through bilingual representations\",\"authors\":\"Hua Zhang , Xiaoqi Yang , Pengliang Chen , Cheng Yang , Bi Chen , Bo Jiang , Guogen Shan\",\"doi\":\"10.1016/j.eswa.2024.125763\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Precisely recognizing DNA-binding proteins (DBPs) from sequences is crucial for a profound comprehension of the mechanisms governing protein-DNA interactions in various cellular processes. However, traditional in-silico methods for DBP identification encounter several challenges, such as time-consuming evolutionary modeling based on multiple sequence alignments, and intricate feature engineering associated with machine or deep learning approaches. In this paper, we introduce a novel end-to-end predictor for identifying DNA-binding proteins without intricate feature engineering, which innovatively enriches the semantics of amino acid sequences through the fusion of bilingual representations derived from distinct language models. We further design a convolution scope expanding (CoSE) module to widen the receptive fields of convolution kernels, thereby forming protein-level CoSE representation sequences. These representations are subsequently integrated via BiLSTM in conjunction with a simplified capsule network, enhancing the hierarchical feature extraction capability. Extensive experiments confirm that our model surpasses existing baselines across diverse benchmark datasets, notably achieving at least a 5.1% improvement in MCC value on the UniSwiss dataset.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"263 \",\"pages\":\"Article 125763\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417424026307\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417424026307","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
从序列中精确识别 DNA 结合蛋白(DBP)对于深入理解各种细胞过程中蛋白质-DNA 的相互作用机制至关重要。然而,传统的用于识别 DBP 的内测方法遇到了一些挑战,例如基于多序列比对的耗时进化建模,以及与机器或深度学习方法相关的复杂特征工程。在本文中,我们介绍了一种无需复杂特征工程就能识别 DNA 结合蛋白的新型端到端预测器,该预测器通过融合从不同语言模型中提取的双语表征,创新性地丰富了氨基酸序列的语义。我们进一步设计了一个卷积范围扩展(CoSE)模块,以拓宽卷积核的感受野,从而形成蛋白质级的 CoSE 表示序列。这些表示序列随后通过 BiLSTM 与简化的胶囊网络进行整合,从而增强了分层特征提取能力。广泛的实验证实,我们的模型在各种基准数据集上超越了现有的基线,尤其是在 UniSwiss 数据集上的 MCC 值至少提高了 5.1%。
CoSEF-DBP: Convolution scope expanding fusion network for identifying DNA-binding proteins through bilingual representations
Precisely recognizing DNA-binding proteins (DBPs) from sequences is crucial for a profound comprehension of the mechanisms governing protein-DNA interactions in various cellular processes. However, traditional in-silico methods for DBP identification encounter several challenges, such as time-consuming evolutionary modeling based on multiple sequence alignments, and intricate feature engineering associated with machine or deep learning approaches. In this paper, we introduce a novel end-to-end predictor for identifying DNA-binding proteins without intricate feature engineering, which innovatively enriches the semantics of amino acid sequences through the fusion of bilingual representations derived from distinct language models. We further design a convolution scope expanding (CoSE) module to widen the receptive fields of convolution kernels, thereby forming protein-level CoSE representation sequences. These representations are subsequently integrated via BiLSTM in conjunction with a simplified capsule network, enhancing the hierarchical feature extraction capability. Extensive experiments confirm that our model surpasses existing baselines across diverse benchmark datasets, notably achieving at least a 5.1% improvement in MCC value on the UniSwiss dataset.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.