通过知识提炼和自然语言处理提高拟南芥泛素化位点预测能力

IF 4.2 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Methods Pub Date : 2024-10-22 DOI:10.1016/j.ymeth.2024.10.006
Van-Nui Nguyen , Thi-Xuan Tran , Thi-Tuyen Nguyen , Nguyen Quoc Khanh Le
{"title":"通过知识提炼和自然语言处理提高拟南芥泛素化位点预测能力","authors":"Van-Nui Nguyen ,&nbsp;Thi-Xuan Tran ,&nbsp;Thi-Tuyen Nguyen ,&nbsp;Nguyen Quoc Khanh Le","doi":"10.1016/j.ymeth.2024.10.006","DOIUrl":null,"url":null,"abstract":"<div><div>Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting <em>Arabidopsis thaliana</em> ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species “Teacher model” to guide a more compact, species-specific “Student model”, with the “Teacher” generating pseudo-labels that enhance the “Student” learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model’s superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: <span><span>https://github.com/nuinvtnu/KD_ArapUbi</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":390,"journal":{"name":"Methods","volume":null,"pages":null},"PeriodicalIF":4.2000,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Arabidopsis thaliana ubiquitination site prediction through knowledge distillation and natural language processing\",\"authors\":\"Van-Nui Nguyen ,&nbsp;Thi-Xuan Tran ,&nbsp;Thi-Tuyen Nguyen ,&nbsp;Nguyen Quoc Khanh Le\",\"doi\":\"10.1016/j.ymeth.2024.10.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting <em>Arabidopsis thaliana</em> ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species “Teacher model” to guide a more compact, species-specific “Student model”, with the “Teacher” generating pseudo-labels that enhance the “Student” learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model’s superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: <span><span>https://github.com/nuinvtnu/KD_ArapUbi</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":390,\"journal\":{\"name\":\"Methods\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Methods\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1046202324002238\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1046202324002238","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

摘要

蛋白质泛素化是一种关键的翻译后修饰(PTM),涉及多种生物过程,在调节生理机制和疾病状态方面起着关键作用。尽管人们在开发跨物种泛素化位点预测工具方面做出了各种努力,但这些工具主要依赖于预定义的序列特征和机器学习算法,对泛素化模式的物种特异性差异仍然知之甚少。本研究介绍了一种预测拟南芥泛素化位点的新方法,该方法使用基于蛋白质序列知识提炼和自然语言处理(NLP)的神经网络模型。我们的框架采用多物种 "教师模型 "来指导更紧凑、特定物种的 "学生模型","教师 "生成伪标签以增强 "学生 "的学习和预测鲁棒性。交叉验证结果表明,我们的模型性能优越,准确率达 86.3%,曲线下面积(AUC)为 0.926;独立测试证实了这些结果,准确率达 86.3%,曲线下面积(AUC)为 0.923。与已有预测工具的比较分析进一步凸显了该模型的优越性,强调了在泛素化预测任务中整合知识提炼和 NLP 的有效性。这项研究为泛素化位点预测提供了一种前景广阔的高效方法,为相关领域的研究人员提供了宝贵的见解。代码和资源可在 GitHub 上获取:https://github.com/nuinvtnu/KD_ArapUbi.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Enhancing Arabidopsis thaliana ubiquitination site prediction through knowledge distillation and natural language processing
Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting Arabidopsis thaliana ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species “Teacher model” to guide a more compact, species-specific “Student model”, with the “Teacher” generating pseudo-labels that enhance the “Student” learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model’s superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: https://github.com/nuinvtnu/KD_ArapUbi.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Methods
Methods 生物-生化研究方法
CiteScore
9.80
自引率
2.10%
发文量
222
审稿时长
11.3 weeks
期刊介绍: Methods focuses on rapidly developing techniques in the experimental biological and medical sciences. Each topical issue, organized by a guest editor who is an expert in the area covered, consists solely of invited quality articles by specialist authors, many of them reviews. Issues are devoted to specific technical approaches with emphasis on clear detailed descriptions of protocols that allow them to be reproduced easily. The background information provided enables researchers to understand the principles underlying the methods; other helpful sections include comparisons of alternative methods giving the advantages and disadvantages of particular methods, guidance on avoiding potential pitfalls, and suggestions for troubleshooting.
期刊最新文献
Development and validation of a machine learning model for predicting drug-drug interactions with oral diabetes medications Development of novel digital PCR assays for the rapid quantification of Gram-negative bacteria biomarkers using RUCS algorithm Imaging flow cytometry reveals LPS-induced changes to intracellular intensity and distribution of α-synuclein in a TLR4-dependent manner in STC-1 cells. MLFA-UNet: A multi-level feature assembly UNet for medical image segmentation Editorial Board
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1