{"title":"利用蛋白质语言模型预测蛋白质序列中的C和s链糖基化位点","authors":"Md Muhaiminul Islam Nafi","doi":"10.1016/j.compbiomed.2025.109956","DOIUrl":null,"url":null,"abstract":"<div><div>Among various post-translational modifications (PTMs), predicting C-linked and S-linked glycosites is an essential task, yet experimental techniques such as Capillary Electrophoresis (CE), Enzymatic Deglycosylation, and Mass Spectrometry (MS) are expensive. Therefore, computational techniques are required to predict these glycosites. Here, different language model embeddings and sequential features were explored. Two separate feature selection methods: Recursive Feature Elimination (RFE) and Particle Swarm Optimization (PSO) were employed and utilized for identifying the optimal feature set. Cross-validation results were generated for choosing the final models. Three sampling strategies to handle imbalanced datasets were examined: Random undersampling, Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN).</div><div>In this study, two models: DeepCSEmbed-C and DeepCSEmbed-S are proposed for C-linked and S-linked glycosylation prediction respectively. DeepCSEmbed-C is a dual-branch deep learning model comprising a Feedforward Neural Network (FNN) branch and an Inception branch, coupled with a Random undersampling strategy. DeepCSEmbed-S is a Categorical Boosting (CAT) model with the SMOTE oversampling strategy. DeepCSEmbed-C outperformed available state-of-the-art (SOTA) methods, achieving 92.9% sensitivity, 95.1% F1-score and 90.6% MCC on the Independent dataset. Datasets and python scripts for training and testing the models are provided and made freely accessible at <span><span>https://github.com/nafcoder/DeepCSEmbed</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"189 ","pages":"Article 109956"},"PeriodicalIF":6.3000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting C- and S-linked Glycosylation sites from protein sequences using protein language models\",\"authors\":\"Md Muhaiminul Islam Nafi\",\"doi\":\"10.1016/j.compbiomed.2025.109956\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Among various post-translational modifications (PTMs), predicting C-linked and S-linked glycosites is an essential task, yet experimental techniques such as Capillary Electrophoresis (CE), Enzymatic Deglycosylation, and Mass Spectrometry (MS) are expensive. Therefore, computational techniques are required to predict these glycosites. Here, different language model embeddings and sequential features were explored. Two separate feature selection methods: Recursive Feature Elimination (RFE) and Particle Swarm Optimization (PSO) were employed and utilized for identifying the optimal feature set. Cross-validation results were generated for choosing the final models. Three sampling strategies to handle imbalanced datasets were examined: Random undersampling, Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN).</div><div>In this study, two models: DeepCSEmbed-C and DeepCSEmbed-S are proposed for C-linked and S-linked glycosylation prediction respectively. DeepCSEmbed-C is a dual-branch deep learning model comprising a Feedforward Neural Network (FNN) branch and an Inception branch, coupled with a Random undersampling strategy. DeepCSEmbed-S is a Categorical Boosting (CAT) model with the SMOTE oversampling strategy. DeepCSEmbed-C outperformed available state-of-the-art (SOTA) methods, achieving 92.9% sensitivity, 95.1% F1-score and 90.6% MCC on the Independent dataset. Datasets and python scripts for training and testing the models are provided and made freely accessible at <span><span>https://github.com/nafcoder/DeepCSEmbed</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":10578,\"journal\":{\"name\":\"Computers in biology and medicine\",\"volume\":\"189 \",\"pages\":\"Article 109956\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers in biology and medicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0010482525003075\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482525003075","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
Predicting C- and S-linked Glycosylation sites from protein sequences using protein language models
Among various post-translational modifications (PTMs), predicting C-linked and S-linked glycosites is an essential task, yet experimental techniques such as Capillary Electrophoresis (CE), Enzymatic Deglycosylation, and Mass Spectrometry (MS) are expensive. Therefore, computational techniques are required to predict these glycosites. Here, different language model embeddings and sequential features were explored. Two separate feature selection methods: Recursive Feature Elimination (RFE) and Particle Swarm Optimization (PSO) were employed and utilized for identifying the optimal feature set. Cross-validation results were generated for choosing the final models. Three sampling strategies to handle imbalanced datasets were examined: Random undersampling, Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN).
In this study, two models: DeepCSEmbed-C and DeepCSEmbed-S are proposed for C-linked and S-linked glycosylation prediction respectively. DeepCSEmbed-C is a dual-branch deep learning model comprising a Feedforward Neural Network (FNN) branch and an Inception branch, coupled with a Random undersampling strategy. DeepCSEmbed-S is a Categorical Boosting (CAT) model with the SMOTE oversampling strategy. DeepCSEmbed-C outperformed available state-of-the-art (SOTA) methods, achieving 92.9% sensitivity, 95.1% F1-score and 90.6% MCC on the Independent dataset. Datasets and python scripts for training and testing the models are provided and made freely accessible at https://github.com/nafcoder/DeepCSEmbed.
期刊介绍:
Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.