{"title":"trans - b -site:一种预测蛋白质相互作用结合位点的改进方法","authors":"Sharzil Haris Khan , Hilal Tayara , Kil To Chong","doi":"10.1016/j.measurement.2025.117227","DOIUrl":null,"url":null,"abstract":"<div><div>Protein-protein interactions (PPIs) govern essential biological processes, relying on specific binding sites for molecular machinery in cells. Identifying these binding sites is crucial, with computational methods emerging as efficient alternatives to labor-intensive experimental approaches. While various techniques leverage sequential and structural information of amino acids, the limited availability of protein structural data in databases makes sequential-based models more practical. The proposed model, named TranP-B-site, employs a convolutional neural network on the transformer model’s embeddings of the sequential information of the amino acids to predict the binding sites of PPIs. First, two types of features are extracted for each amino acid in a protein sequence: one-hot encoding representing the low-level features and transformer model-based embeddings, which contain information about the entire protein sequence. These one-hot encodings and amino acid embeddings are concatenated to form two matrices. Then, two local feature sets are created by employing a windowing technique across the acquired matrices. The amino acid–based local feature set is fed into a CNN architecture, while the one-hot encoding-based local features are fed into a neural network. Finally, classification is performed on the concatenated output of the CNN and neural network using a sub-neural network. The proposed model demonstrates an improvement of 3% in MCC and 7% in accuracy compared to the previous state-of-the-art sequence-based model for independent dataset. Additionally, a new test dataset was curated from recently published protein sequences in the PDB database, and the proposed model outperformed other state-of-the-art models.</div></div>","PeriodicalId":18349,"journal":{"name":"Measurement","volume":"251 ","pages":"Article 117227"},"PeriodicalIF":6.1000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TranP-B-site: A Transformer Enhanced Method for prediction of binding sites of Protein-protein interactions\",\"authors\":\"Sharzil Haris Khan , Hilal Tayara , Kil To Chong\",\"doi\":\"10.1016/j.measurement.2025.117227\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Protein-protein interactions (PPIs) govern essential biological processes, relying on specific binding sites for molecular machinery in cells. Identifying these binding sites is crucial, with computational methods emerging as efficient alternatives to labor-intensive experimental approaches. While various techniques leverage sequential and structural information of amino acids, the limited availability of protein structural data in databases makes sequential-based models more practical. The proposed model, named TranP-B-site, employs a convolutional neural network on the transformer model’s embeddings of the sequential information of the amino acids to predict the binding sites of PPIs. First, two types of features are extracted for each amino acid in a protein sequence: one-hot encoding representing the low-level features and transformer model-based embeddings, which contain information about the entire protein sequence. These one-hot encodings and amino acid embeddings are concatenated to form two matrices. Then, two local feature sets are created by employing a windowing technique across the acquired matrices. The amino acid–based local feature set is fed into a CNN architecture, while the one-hot encoding-based local features are fed into a neural network. Finally, classification is performed on the concatenated output of the CNN and neural network using a sub-neural network. The proposed model demonstrates an improvement of 3% in MCC and 7% in accuracy compared to the previous state-of-the-art sequence-based model for independent dataset. Additionally, a new test dataset was curated from recently published protein sequences in the PDB database, and the proposed model outperformed other state-of-the-art models.</div></div>\",\"PeriodicalId\":18349,\"journal\":{\"name\":\"Measurement\",\"volume\":\"251 \",\"pages\":\"Article 117227\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Measurement\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S026322412500586X\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/15 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026322412500586X","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/15 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
摘要
蛋白质-蛋白质相互作用(PPIs)控制着基本的生物过程,依赖于细胞中分子机制的特定结合位点。识别这些结合位点是至关重要的,计算方法正在成为劳动密集型实验方法的有效替代方案。虽然各种技术利用氨基酸的序列和结构信息,但数据库中蛋白质结构数据的有限可用性使得基于序列的模型更加实用。该模型被命名为trans - b -site,该模型在氨基酸序列信息的变压器模型嵌入上使用卷积神经网络来预测ppi的结合位点。首先,对蛋白质序列中的每个氨基酸提取两种类型的特征:一种是表示低级特征的one-hot编码,另一种是包含整个蛋白质序列信息的基于转换模型的嵌入。这些单热编码和氨基酸嵌入连接形成两个矩阵。然后,通过在获取的矩阵上使用窗口技术创建两个局部特征集。基于氨基酸的局部特征集被输入到CNN架构中,而基于一热编码的局部特征被输入到神经网络中。最后,使用子神经网络对CNN和神经网络的拼接输出进行分类。与之前最先进的基于序列的独立数据集模型相比,所提出的模型在MCC方面提高了3%,在精度方面提高了7%。此外,从PDB数据库中最近发表的蛋白质序列中整理了一个新的测试数据集,所提出的模型优于其他最先进的模型。
TranP-B-site: A Transformer Enhanced Method for prediction of binding sites of Protein-protein interactions
Protein-protein interactions (PPIs) govern essential biological processes, relying on specific binding sites for molecular machinery in cells. Identifying these binding sites is crucial, with computational methods emerging as efficient alternatives to labor-intensive experimental approaches. While various techniques leverage sequential and structural information of amino acids, the limited availability of protein structural data in databases makes sequential-based models more practical. The proposed model, named TranP-B-site, employs a convolutional neural network on the transformer model’s embeddings of the sequential information of the amino acids to predict the binding sites of PPIs. First, two types of features are extracted for each amino acid in a protein sequence: one-hot encoding representing the low-level features and transformer model-based embeddings, which contain information about the entire protein sequence. These one-hot encodings and amino acid embeddings are concatenated to form two matrices. Then, two local feature sets are created by employing a windowing technique across the acquired matrices. The amino acid–based local feature set is fed into a CNN architecture, while the one-hot encoding-based local features are fed into a neural network. Finally, classification is performed on the concatenated output of the CNN and neural network using a sub-neural network. The proposed model demonstrates an improvement of 3% in MCC and 7% in accuracy compared to the previous state-of-the-art sequence-based model for independent dataset. Additionally, a new test dataset was curated from recently published protein sequences in the PDB database, and the proposed model outperformed other state-of-the-art models.
期刊介绍:
Contributions are invited on novel achievements in all fields of measurement and instrumentation science and technology. Authors are encouraged to submit novel material, whose ultimate goal is an advancement in the state of the art of: measurement and metrology fundamentals, sensors, measurement instruments, measurement and estimation techniques, measurement data processing and fusion algorithms, evaluation procedures and methodologies for plants and industrial processes, performance analysis of systems, processes and algorithms, mathematical models for measurement-oriented purposes, distributed measurement systems in a connected world.